Chaver

Information about Chaver

Published on January 12, 2008

Author: Calogera

Source: authorstream.com

Content

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions:  Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado Index:  Index Motivation Experimental environment Lifting Transform Memory hierarchy exploitation SIMD optimization Conclusions Future work Slide3:  Motivation Slide4:  Motivation Applications based on the Wavelet Transform: JPEG-2000 MPEG-4 Usage of the lifting scheme Study based on a modern general purpose microprocessor Pentium 4 Objectives: Efficient exploitation of Memory Hierarchy Use of the SIMD ISA extensions Slide5:  Experimental Environment Slide6:  Experimental Environment RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800) Memory 512 KB, 128 Byte/Line L2 8 KB, 64 Byte/Line, Write-Through DL1 NA IL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler Compiler Slide7:  Lifting Transform Slide8:  D 1st 1st 1st 1st 1st 1st Lifting Transform Original element 1st step 2nd step A D D D A A A 1st 1st Slide9:  N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element Approximation Slide10:  Lifting Transform Horizontal Filtering Vertical Filtering Slide11:  Memory Hierarchy Exploitation Slide12:  Poor data locality of one component (canonical layouts) E.g. : column-major layout  processing image rows (Horizontal Filtering) Aggregation (loop tiling) Memory Hierarchy Exploitation Poor data locality of the whole transform Other layouts Slide13:  Memory Hierarchy Exploitation Horizontal Filtering Vertical Filtering Slide14:  Aggregation Horizontal Filtering IMAGE Memory Hierarchy Exploitation Slide15:  Memory Hierarchy Exploitation INPLACE Common implementation of the transform Memory: Only requires the original matrix For most applications needs post-processing MALLAT Memory: requires 2 matrices Stores the image in the expected order INPLACE-MALLAT Memory: requires 2 matrices Stores the image in the expected order Different studied schemes Slide16:  Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 logical view physical view INPLACE Slide17:  Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 MATRIX 2 logical view physical view MALLAT Slide18:  Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view physical view INPLACE- MALLAT Slide19:  Memory Hierarchy Exploitation Execution time breakdown for several sizes comparing both compilers. I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively. Each bar shows the execution time of each level and the post-processing step. Slide20:  The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above These 2 approaches have a noticeable slowdown for the 1st level: Larger working set More complex access pattern The Inplace-Mallat version achieves the best execution time ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONS Slide21:  SIMD Optimization Slide22:  Objective: Extract the parallelism available on the Lifting Transform Different strategies: Semi-automatic vectorization Hand-coded vectorization Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout) SIMD Optimization Slide23:  SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employed Slide24:  Original element 1st step 2nd step A D SIMD Optimization 1st 1st Slide25:  SIMD Optimization Column-major layout Vectorial Horizontal filtering + a x + Horizontal filtering + a x + a a a Slide26:  SIMD Optimization Column-major layout Vectorial Vertical filtering + a x + Vertical filtering + a x + a a a Slide27:  for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } } Horizontal Vectorial Filtering (semi-automatic) SIMD Optimization Slide28:  SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering Slide29:  Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2 t3 t4 + a x + a a a Slide30:  SIMD Optimization Execution time breakdown of the horizontal filtering (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized. Slide31:  SIMD Optimization Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses. The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level. For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONS Slide32:  SIMD Optimization Execution time breakdown of the whole transform (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized. Slide33:  SIMD Optimization Speedup between 1,5 and 2 depending on the strategy. For ICC the shortest execution time is reached by the mallat version. When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONS Slide34:  SIMD Optimization Speedup achieved by the different vectorial codes over the inplace-mallat and inplace. We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC. Slide35:  SIMD Optimization The speedup grows with the image size since. On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy. Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes CONCLUSIONS Slide36:  Conclusions Slide37:  Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme. SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results. Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2. The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability. Most of our insights are compiler independent. Conclusions Slide38:  Future work Slide39:  4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information: http://www.dacya.ucm.es/dchaver

#pragma presentations

C programming  session9 -
18. 10. 2017
0 views

C programming session9 -

Related presentations


Other presentations created by Calogera

23 Bacteria III Gram Positive
13. 02. 2008
0 views

23 Bacteria III Gram Positive

PNC
09. 01. 2008
0 views

PNC

Lec12 ImageGeneration s06
14. 01. 2008
0 views

Lec12 ImageGeneration s06

educational foundation
16. 01. 2008
0 views

educational foundation

uncoveringfreemasonry
17. 01. 2008
0 views

uncoveringfreemasonry

sighthear
18. 01. 2008
0 views

sighthear

solar energy Barnard
21. 01. 2008
0 views

solar energy Barnard

278 airpollutants
22. 01. 2008
0 views

278 airpollutants

chocolate
22. 01. 2008
0 views

chocolate

Lec6 3rd
22. 01. 2008
0 views

Lec6 3rd

2007 pot KSNA
23. 01. 2008
0 views

2007 pot KSNA

ACTFEL
24. 01. 2008
0 views

ACTFEL

8 Reforming American Society
24. 01. 2008
0 views

8 Reforming American Society

Randy Perry 2007 retreat
24. 01. 2008
0 views

Randy Perry 2007 retreat

wp 16 e
05. 02. 2008
0 views

wp 16 e

ChildhoodAdolescence
04. 02. 2008
0 views

ChildhoodAdolescence

Rattle Those Pots Pans
11. 02. 2008
0 views

Rattle Those Pots Pans

book review
11. 02. 2008
0 views

book review

06 cdf grijalva update
11. 01. 2008
0 views

06 cdf grijalva update

IPE Systems of Political Economy
28. 01. 2008
0 views

IPE Systems of Political Economy

2 6NOVO
29. 01. 2008
0 views

2 6NOVO

NativeAmer4
13. 02. 2008
0 views

NativeAmer4

Obedience to authority
15. 02. 2008
0 views

Obedience to authority

TBLT
18. 02. 2008
0 views

TBLT

Versenyeredmenyek
06. 02. 2008
0 views

Versenyeredmenyek

V Stratosphere
20. 02. 2008
0 views

V Stratosphere

Research issues
21. 01. 2008
0 views

Research issues

depression
28. 02. 2008
0 views

depression

Hierarchy
03. 03. 2008
0 views

Hierarchy

Taller CR Caso Honduras
07. 03. 2008
0 views

Taller CR Caso Honduras

Religions
08. 03. 2008
0 views

Religions

Passport to Your County Wayne
14. 03. 2008
0 views

Passport to Your County Wayne

UIS Sloan C Workshop 9Sep04
16. 01. 2008
0 views

UIS Sloan C Workshop 9Sep04

aed power point
15. 03. 2008
0 views

aed power point

zoonoses arbo 2004
07. 04. 2008
0 views

zoonoses arbo 2004

grid a ltc gec 2007 05 07
21. 01. 2008
0 views

grid a ltc gec 2007 05 07

06 C 1 FS J Orientation
27. 03. 2008
0 views

06 C 1 FS J Orientation

Valkommen till Peab
08. 02. 2008
0 views

Valkommen till Peab

InvitedTalk2
10. 04. 2008
0 views

InvitedTalk2

20071211542380
16. 04. 2008
0 views

20071211542380

YellowBlack it
17. 04. 2008
0 views

YellowBlack it

CIC Keynote Reeves Nov06
15. 01. 2008
0 views

CIC Keynote Reeves Nov06

CLAS100508 Week 4
21. 04. 2008
0 views

CLAS100508 Week 4

ITC new LOT1
24. 04. 2008
0 views

ITC new LOT1

GTW
07. 05. 2008
0 views

GTW

Chinas Currency Manipulation
08. 05. 2008
0 views

Chinas Currency Manipulation

780 1
30. 04. 2008
0 views

780 1

MWRIndocOnLine
02. 05. 2008
0 views

MWRIndocOnLine

Hilder
02. 05. 2008
0 views

Hilder

Hildegardgarden
05. 02. 2008
0 views

Hildegardgarden

unit10 reading20
15. 01. 2008
0 views

unit10 reading20

Verrazano
20. 03. 2008
0 views

Verrazano

Ladybugs
17. 01. 2008
0 views

Ladybugs

Nimbus TOMS 3
20. 02. 2008
0 views

Nimbus TOMS 3

HiddenGraphs prj
05. 03. 2008
0 views

HiddenGraphs prj

WPA Heidi Poster 06 final
17. 01. 2008
0 views

WPA Heidi Poster 06 final

Sipho
11. 01. 2008
0 views

Sipho

physicsofsports new
23. 01. 2008
0 views

physicsofsports new

HFDH 14 Laurent Nicq ConvergeX
08. 04. 2008
0 views

HFDH 14 Laurent Nicq ConvergeX