puschel spiral

Information about puschel spiral

Published on January 16, 2008

Author: Pasquale

Source: authorstream.com

Content

SPIRAL: Current Status:  SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna) Sponsor:  Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting. Moore’s Law and High(est) Performance Scientific Computing:  Moore’s Law and High(est) Performance Scientific Computing SPIRAL:  SPIRAL Automates Implementation Platform-Adaptation Optimization of DSP algorithms SPIRAL system:  SPIRAL system Slide6:  DSP Transform Algorithm DSP Algorithms: Example 4-point DFT:  DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): product of structured sparse matrices mathematical notation Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product DSP Algorithms: Terminology:  DSP Algorithms: Terminology Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into code DSP Transforms:  DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform Rules = Breakdown Strategies:  Rules = Breakdown Strategies base case recursive translation iterative recursive recursive recursive recursive recursive translation iterative/ recursive Formula for a DCT, size 16:  Formula for a DCT, size 16 Slide12:  Algorithm (Formula) Implementation DSP Transform Formulas in SPL:  Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) • • • • • • • • Slide14:  SPL Syntax (Subset) matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling SPL Compiler, 4-point FFT:  SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) fast algorithm as formula as SPL program #codetype complex real SPL Compiler: Summary:  SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code I-Code C, FORTRAN function Template Table SPL Formula Template Definition Symbol Definition Optimization I-Code SPL Program Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templates Slide17:  Algorithm (Formula) Implementation DSP Transform Search Why Search?:  Why Search? DCT, type IV, size 16 maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost best formula is platform-dependent ~31000 formulas Search Methods available in SPIRAL:  Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Search over algorithm space and implementation options (degree of unrolling) STEER:  STEER Population n: Population n+1: …… …… Mutation Cross-Breeding expand differently swap expansions Survival of Fittest Experimental Results (C code):  Experimental Results (C code) high performance code (compared with FFTW) different transforms search methods (applicable to all transforms) generated high quality code SPIRAL System:  SPIRAL System Available for download (v3.1): www.ece.cmu.edu/~spiral Easy installation (Unix: configure/make; Windows: install shield) Unix/Linux and Windows 98/ME/NT/2000/XP Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants Extensible New version (4.0) in preparation Slide23:  Recent Work Learning to Generate Fast Algorithms:  Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT SIMD Short Vector Extensions:  SIMD Short Vector Extensions + x vector length = 4 (4-way) Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited Problems: very difficult to use Vector code generation from SPL formulas:  Vector code generation from SPL formulas Generated Vector Code DFTs: Pentium 4:  Generated Vector Code DFTs: Pentium 4 gflops DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (to C code) up to factor of 3.3 beats hand-tuned vendor library Generated Vector Code, Other Transforms:  Generated Vector Code, Other Transforms normalized runtime normalized runtime transform size transform size 2-dim DCT WHT speedups up to factor of 2.5 Slide29:  Flexible Loop Interleaving (Runtime Gain WHT) Alpha 21264: up to 70% UltraSparc III: up to 60% Athlon XP: up to 55% Pentium 4: up to 45% Parallel Code Generation: Example WHT:  Parallel Code Generation: Example WHT WHT size log(N) 1 thread 8 threads 10 threads PowerPC RS64 III Parallelized constructs: In  A, A  In Code Scheduling:  Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduled Filters and Wavelets:  Filters and Wavelets New constructs: row/column overlapped tensor product Examples for rules: Conclusions:  Conclusions Automatic code generation for the entire domain of (linear) DSP algorithms Portable high performance across platforms and across time Integration of math (high) level and implementation (low) level Intelligence through search and learning in the space of alternatives Future Plans:  Future Plans Transforms: Radon, Gabor, Hankel, structured matrices, … Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs Instructions: Vector, FMAs, prefetching, OpenMP, MPI Beyond transforms: entire DSP applications Other domains amenable to SPIRAL approach? Slide35:  http://www.ece.cmu.edu/~spiral Questions?

Related presentations


Other presentations created by Pasquale

ancient olympic games
02. 05. 2008
0 views

ancient olympic games

VRA Glass Markets Presentation
09. 01. 2008
0 views

VRA Glass Markets Presentation

AppliedMechanics
13. 01. 2008
0 views

AppliedMechanics

Chapter9
15. 01. 2008
0 views

Chapter9

2007Somera
15. 01. 2008
0 views

2007Somera

Excavations2
19. 01. 2008
0 views

Excavations2

NCMA Berry Amendment Rex Bragaw
22. 01. 2008
0 views

NCMA Berry Amendment Rex Bragaw

moscowforumenglish
23. 01. 2008
0 views

moscowforumenglish

BEEDL 123a finalPPT v6
04. 02. 2008
0 views

BEEDL 123a finalPPT v6

universal inbox ericsson2000
04. 02. 2008
0 views

universal inbox ericsson2000

Hydrogen Economy
05. 02. 2008
0 views

Hydrogen Economy

Joos QCSE07
09. 01. 2008
0 views

Joos QCSE07

ravenous uk ltd
29. 01. 2008
0 views

ravenous uk ltd

WNC07
07. 02. 2008
0 views

WNC07

PalliativeCare
15. 01. 2008
0 views

PalliativeCare

electronID
20. 01. 2008
0 views

electronID

WatershedCommunicati onChannels
26. 02. 2008
0 views

WatershedCommunicati onChannels

20070306
28. 02. 2008
0 views

20070306

Course chp
16. 01. 2008
0 views

Course chp

RichmondRegionalMobi litySystem
31. 01. 2008
0 views

RichmondRegionalMobi litySystem

lebilinguisme Mar2703
29. 02. 2008
0 views

lebilinguisme Mar2703

05 a NCEM Brownian v1
25. 01. 2008
0 views

05 a NCEM Brownian v1

Ergo Keween
05. 03. 2008
0 views

Ergo Keween

SLiCE
14. 03. 2008
0 views

SLiCE

Transport lecture11
21. 03. 2008
0 views

Transport lecture11

Fingerhut welcome
03. 04. 2008
0 views

Fingerhut welcome

BHTV BITE2000
08. 04. 2008
0 views

BHTV BITE2000

east vs west
10. 03. 2008
0 views

east vs west

Kids andTableTennis1
14. 04. 2008
0 views

Kids andTableTennis1

Self Leadership Lynn Joseph
16. 04. 2008
0 views

Self Leadership Lynn Joseph

WU Course 3
17. 04. 2008
0 views

WU Course 3

2007611162243264
23. 04. 2008
0 views

2007611162243264

CLI GM Vodcast trial
24. 04. 2008
0 views

CLI GM Vodcast trial

chintoda
07. 05. 2008
0 views

chintoda

insagb
02. 05. 2008
0 views

insagb

Lsn 9 SSA
03. 04. 2008
0 views

Lsn 9 SSA

EYESAFE
22. 01. 2008
0 views

EYESAFE

Wykle2007
16. 01. 2008
0 views

Wykle2007

tdb
07. 02. 2008
0 views

tdb

COED CMS Overview Oct 16 Final
10. 01. 2008
0 views

COED CMS Overview Oct 16 Final

colloq hc99
18. 01. 2008
0 views

colloq hc99

FairCASH Best Path ID
17. 01. 2008
0 views

FairCASH Best Path ID

Light 6
04. 02. 2008
0 views

Light 6