Published on January 16, 2008
SPIRAL: Current Status: SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna) Sponsor: Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting. Moore’s Law and High(est) Performance Scientific Computing: Moore’s Law and High(est) Performance Scientific Computing SPIRAL: SPIRAL Automates Implementation Platform-Adaptation Optimization of DSP algorithms SPIRAL system: SPIRAL system Slide6: DSP Transform Algorithm DSP Algorithms: Example 4-point DFT: DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): product of structured sparse matrices mathematical notation Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker product DSP Algorithms: Terminology: DSP Algorithms: Terminology Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into code DSP Transforms: DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform Rules = Breakdown Strategies: Rules = Breakdown Strategies base case recursive translation iterative recursive recursive recursive recursive recursive translation iterative/ recursive Formula for a DCT, size 16: Formula for a DCT, size 16 Slide12: Algorithm (Formula) Implementation DSP Transform Formulas in SPL: Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) • • • • • • • • Slide14: SPL Syntax (Subset) matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling SPL Compiler, 4-point FFT: SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) fast algorithm as formula as SPL program #codetype complex real SPL Compiler: Summary: SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code I-Code C, FORTRAN function Template Table SPL Formula Template Definition Symbol Definition Optimization I-Code SPL Program Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templates Slide17: Algorithm (Formula) Implementation DSP Transform Search Why Search?: Why Search? DCT, type IV, size 16 maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost best formula is platform-dependent ~31000 formulas Search Methods available in SPIRAL: Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Search over algorithm space and implementation options (degree of unrolling) STEER: STEER Population n: Population n+1: …… …… Mutation Cross-Breeding expand differently swap expansions Survival of Fittest Experimental Results (C code): Experimental Results (C code) high performance code (compared with FFTW) different transforms search methods (applicable to all transforms) generated high quality code SPIRAL System: SPIRAL System Available for download (v3.1): www.ece.cmu.edu/~spiral Easy installation (Unix: configure/make; Windows: install shield) Unix/Linux and Windows 98/ME/NT/2000/XP Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants Extensible New version (4.0) in preparation Slide23: Recent Work Learning to Generate Fast Algorithms: Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHT SIMD Short Vector Extensions: SIMD Short Vector Extensions + x vector length = 4 (4-way) Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited Problems: very difficult to use Vector code generation from SPL formulas: Vector code generation from SPL formulas Generated Vector Code DFTs: Pentium 4: Generated Vector Code DFTs: Pentium 4 gflops DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (to C code) up to factor of 3.3 beats hand-tuned vendor library Generated Vector Code, Other Transforms: Generated Vector Code, Other Transforms normalized runtime normalized runtime transform size transform size 2-dim DCT WHT speedups up to factor of 2.5 Slide29: Flexible Loop Interleaving (Runtime Gain WHT) Alpha 21264: up to 70% UltraSparc III: up to 60% Athlon XP: up to 55% Pentium 4: up to 45% Parallel Code Generation: Example WHT: Parallel Code Generation: Example WHT WHT size log(N) 1 thread 8 threads 10 threads PowerPC RS64 III Parallelized constructs: In A, A In Code Scheduling: Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduled Filters and Wavelets: Filters and Wavelets New constructs: row/column overlapped tensor product Examples for rules: Conclusions: Conclusions Automatic code generation for the entire domain of (linear) DSP algorithms Portable high performance across platforms and across time Integration of math (high) level and implementation (low) level Intelligence through search and learning in the space of alternatives Future Plans: Future Plans Transforms: Radon, Gabor, Hankel, structured matrices, … Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs Instructions: Vector, FMAs, prefetching, OpenMP, MPI Beyond transforms: entire DSP applications Other domains amenable to SPIRAL approach? Slide35: http://www.ece.cmu.edu/~spiral Questions?