MAPLD2004 Paper198

Information about MAPLD2004 Paper198

Published on February 28, 2008

Author: Urania

Source: authorstream.com

Content

Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI:  Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida Gainesville, FL Outline:  Outline Introduction N-Queens Solutions Backtracking Approach N-Queens Parallelization Experimental Setup Handel-C and Lessons Learned Results and Analysis Conclusions Future Work and Acknowledgements References Introduction:  Introduction N-Queens dates back to the 19th century (studied by Gauss) Classical combinatorial problem, widely used as a benchmark because of its simple and regular structure Problem involves placing N queens on an N  N chessboard such that no queen can attack any other Benchmark code versions include finding the first solution and finding all solutions Introduction :  Introduction Mathematically stated: Find a permutation of the BOARD() vector containing numbers 1:N, such that for any i != j Board( i ) - i != Board( j ) - j Board( i ) + i != Board( j ) + j N-Queens Solutions:  N-Queens Solutions Various approaches to the problem Brute force[2] Local search algorithms[4] Backtracking[2], [7] , [11], [12], [13] Divide and conquer approach[1] Permutation generation[2] Mathematical solutions[6] Graph theory concepts[2] Heuristics and AI[4], [14] Backtracking Approach:  Backtracking Approach One of the only approaches that guarantees a solution, though it can be slow Can be seen as a form of intelligent depth-first search Complexity of backtracking typically rises exponentially with problem size Good test case for performance analysis of RC systems, as the problem is complex even for small data size* Traditional processors provide a suboptimal platform for this iterative application due to serial nature of their processing pipelines Tremendous speedups achieved by adding parallelism at the logic level via RC * For an 8x8 board, 981 moves (876 tests + 105 backtracks) are required for first solution alone Backtracking Approach:  Backtracking Approach Tables provide an estimate of the backtracking approach’s complexity Problem can be made to find first solution or the total number of solutions Total number of solutions is obviously a more challenging problem Interesting observation: 1st solution’s complexity (i.e. number of operations) does not increase monotonically with board size # of operations for 1st solution [7] Number of solutions [8] N-Queens Parallelization:  N-Queens Parallelization Different levels of parallelism added to improve performance Functional Unit replication Parallel column check Parallel row check Q Q Sequential : 11 cycles Parallel column check: 3 cycles Multiple row check appended:1 cycles 11x speedup over sequential operation Note: Assume first four queens have been placed and the fifth queen starts from the 1st row Parallelization Comparison Experimental setup:  Experimental setup Experiments conducted using RC1000 boards from Celoxica, Inc., and Tarari RC boards from Tarari, Inc. Each RC1000 board features a Xilinx Virtex-2000 FPGA, 8 MB of on-card SRAM, and PCI Mezzanine Card (PMC) sockets for connecting two daughter cards Each Tarari board features two user-programmable Xilinx Virtex-II FPGAs in addition to a controller FPGA, 256 MB of DDR SDRAM Configurations designed in Handel-C using Celoxica’s application mapping tool DK-2, along with Xilinx ISE for place and route Performance compared against 2.4 GHz Xeon server and 1.33 GHz Athlon server Celoxica RC1000:  Celoxica RC1000 * Figure courtesy of Celoxica RC1000 manual PCI-based card having one Xilinx FPGA and four memory banks FPGA configured from the host processor over the PCI bus Four memory banks, each of 2MB, accessible to both the FPGA and any other device on the PCI bus Data transfers: The RC1000 provides 3 methods of transferring data over PCI bus between host processor and FPGA: Bulk data transfers performed via memory banks Two unidirectional 8 bit ports, called control and status ports, for direct comm. between FPGA and PCI bus (note: this method used in our experiments) User I/O pins USER1 and USERO for single bit communication with FPGA API-layer calls from host to configure and communicate with RC board Tarari Content Processing Platform:  Tarari Content Processing Platform * Figure courtesy of Tarari CP-DK manual Handel-C Programming Paradigm:  Handel-C Programming Paradigm Handel-C acts as a bridge between VHDL and “C” Comparison with conventional C More explicit provisioning of parallelism within the code Variables declared to have the exact bit-lengths to save space Provides more bit-level manipulations beyond shifts and logic operations Limited support for many ANSI C standards and extensions Comparison with VHDL Application porting is much faster for experienced coders Similar to VHDL behavioral models Lacks VHDL concurrent signal assignments which can be suspended until changes on input triggers (Handel-C requires polling) Provides more higher-level routines Handel-C Design Specifics :  Handel-C Design Specifics Design makes use of the following two approaches Approach 1 Use of an array of binary numbers to hold a ‘1’ at a particular bit position to indicate the location of queen in the column A 32 x 32 board will require an array of 32 elements of 32 bits each Correspondingly use bit-shift operations and logical-and operations to check diagonal and row conditions More closely corresponds to the way the operations will take place on the RC fabric Approach 2 Use of an array of integers instead of binary numbers Correspondingly use the mathematical model of the problem to check the validation conditions Smaller variables yield better device utilization; slices occupied reduce from about 75% to about 15% for similar performance and parallelism Approach 2 found to be more amenable for Handel-C designs Lessons Learned with Handel-C:  Lessons Learned with Handel-C Some interesting observations: Code for which place and route did not work, finally worked when the function parameters were replaced by global variables Less control at lower level with place and route being a consistent problem even with designs using up only 40% of total slices Self-referenced operations (e.g. a=a+x) affect the design adversely, so use intermediate variables Order of operations and conditional statements can affect design Useful to reduce wider-bit operations into a sequence of narrower-bit operations Balancing “if” with “else” branches leads to better designs Comments in the main program sometimes affected the synthesis, leading to place and route errors in fully commented code We are still learning more everyday! Sequential First-Solution Results:  Sequential First-Solution Results Sequential version does not perform well versus the Xeon and Athlon CPUs Algorithm needs an efficient design to minimize resource utilization The results do not include the one-time configuration overhead of ~150 ms RC1000 clock speed @ 40 MHz RC1000 clock speed @ 40 MHz Parallel First-Solution Results:  Parallel First-Solution Results RC1000 clock speed @ 25 MHz The most parallel algorithm runs about 20x faster than sequential algorithm on RC fabric Parallel algorithm with two row checks almost duplicates behavior of 2.4 GHz Xeon server, while 6-row check outperforms it by 74% Further increasing the number of rows checked is likely to further improve performance for larger problem sizes Total Number of Solutions Method:  Total Number of Solutions Method Employ divide-and-conquer approach Seen as a parallel depth-first search Solutions obtained with queen positioned in any row in the first column are independent from solutions with queens in other positions Technique allows for high degree of parallelism (DoP) One-Board Total-Solutions Results:  One-Board Total-Solutions Results Designs on hardware perform around 1.7x faster than Xeon server Performance on both RC platforms similar for same clock rates RC1000 performs a notch better for smaller chess board sizes while Tarari CPP’s performance improves with chess board sizes Almost entire Virtex–II chip on the Tarari is occupied for one FU RC1000 and Tarari clock speed @ 33 MHz Multiple Functional Units (FUs):  Multiple Functional Units (FUs) Used additional FUs per chip to increase parallelism per chip Each FU searches for the number of solutions corresponding to a subset of rows in the first column The controller Handles communication with the host Invokes all FUs in parallel Combines all results Controller Host processor fu1 fu2 fu6 fu7 fu8 fu9 fu10 fu3 fu4 fu5 On board FPGA Total-Solutions Results with Multiple FUs:  Total-Solutions Results with Multiple FUs RC1000 with three FUs performs almost 5x faster than Xeon server Speedup increases near linearly with number of FUs Area occupied scales linearly with number of FUs RC1000 clock speed @ 30 MHz RC speedup vs. Xeon server for board size of 17 MPI for Inter-Board Communication:  MPI for Inter-Board Communication On-board FPGA (with one or multiple FU’s) MPI Host server Host server On-board FPGA (with one or multiple FU’s) To further increase system speedup (having more functional units), multiple boards employed Each FU programmed to search a subset of the solution space Servers communicate using the Message Passing Interface (MPI) to start search in parallel and obtain the final result Total-Solutions Results with MPI:  Total-Solutions Results with MPI Results show total execution time including MPI overhead Minimal MPI overhead incurred (high computation-to-communication ratio) Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms Overhead becomes negligible for large problem sizes Speedup scales near linearly with number of boards 4-board Tarari design performs about 6.5x faster than Xeon server Tarari CPP clock speed @ 33 MHz RC speedup vs. Xeon server for board size of 12 Total-Solutions Results with MPI:  Total-Solutions Results with MPI RC speedup vs. Xeon server for board size of 12 RC1000 clock speed @ 30 MHz Results show total execution time including MPI overhead Minimal MPI overhead incurred (high computation-to-communication ratio) Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms Overhead becomes negligible for large problem sizes Speedup scales near linearly with number of boards 4-board RC1000 design performs about 12x faster than Xeon server Total-Solutions Results with MPI:  Total-Solutions Results with MPI Communication overheads still remain low, while MPI initialization overheads increase with number of boards (now 1316 ms for 8 boards) Heterogeneous mixture of boards employed to solve the problem coordinating via MPI Total of 8 boards (4 RC1000 and 4 Tarari boards) allows up to 16 (43 + 41) FUs 8 boards perform about 21x faster than Xeon server for chess board size of 16 What appears to be an unfair comparison really shows how the approach scales to many more FUs per FPGA (on higher density chips) RC1000 clock speed @ 30 MHz and Tarari clock speed @ 33MHz Conclusions:  Conclusions Parallel backtracking for solving N-Queens problem in RC shows promise for performance N-Queens is an important benchmark in the HPC community RC devices outperform CPUs for N-Queens due to RC’s efficient processing of fine-grained, parallel, bit-manipulation operations Previously inefficient methods for CPUs like backtracking can be improved by reexamining their design This approach can be applied to many other applications Numerous parallel approaches developed at several levels Handel-C lessons learned A “C-based” programming model for application mapping provides a degree of higher-level abstraction, yet still requires programmer to code from a hardware perspective Solutions produced to date show promise for application mapping Future Work and Acknowledgements:  Future Work and Acknowledgements Compare application mappers with HDL design in terms of mapping efficiency Develop and use direct communication between FPGAs to avoid MPI overhead Export approach featured in this talk to variety of algorithms and HPC benchmarks for performance analysis and optimization Develop library of application and middleware kernels for RC-based HPC We wish to thank the following for their support of this research: Department of Defense Xilinx Celoxica Tarari Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel) References:  References [1] “Divide and Conquer under Global Constraints: A Solution to the N-Queens Problem”, Bruce Abramson and Mordechai M. Yung [2] ”Different Perspectives Of The N-queens Problem”, Cengiz Erbas, Seyed Sarkeshikt, Murat M. Tanik, Department of Computer Science and Engineering,Southern Methodist University, Dallas [3] “Algorithms and Complexity”, Herbert S. Wilf, University of Pennsylvania, Philadelphia [4] “Fast search algorithms for N-Queens problem”, Rok Sausic, Jum Gu, appeared in IEEE transactions on Systems, Man, and Cybernetics, Vol 21, 6, pp 1572-76, Nov/Dec 1991 [5] http://www.cit.gu.edu.au/~sosic/nqueens.html [6] http://bridges.canterbury.ac.nz/features/eight.html [7] www.math.utah.edu/~alfeld/queens/queens.html [8] www.jsomers.com/nqueen_demo/nqueens.html [9] A polynomial time algorithm for N-queens problem [10] remus.rutgers.edu/~rhoads/Code/code.html [11] http://www.mactech.com/articles/mactech/Vol.13/13.12/TheEightQueensProblem/index.html [12] http://www2.ilog.com/preview/Discovery/samples/nqueens/ [13] http://www.infosun.fmi.uni-passau.de/br/lehrstuhl/Kurse/Proseminar_ss01/backtracking_nm.pdf [14] “From Alife Agents To A Kingdom Of N Queens”, Han Jing, Jimimg Liu, Cai Qingsheng [15] http://www.wi.leidenuniv.nl/~kosters/nqueens.html [16] http://www.dsitri.de/projects/NQP/

Related presentations


Other presentations created by Urania

rocks ppt
14. 01. 2008
0 views

rocks ppt

Haydn Shaw presentation
09. 01. 2008
0 views

Haydn Shaw presentation

Cosmetica presentacion gris
10. 01. 2008
0 views

Cosmetica presentacion gris

what is femp 5B1 5D
11. 01. 2008
0 views

what is femp 5B1 5D

private sector
12. 01. 2008
0 views

private sector

S06 Lec5a rocksmet
14. 01. 2008
0 views

S06 Lec5a rocksmet

ma dissertation
14. 01. 2008
0 views

ma dissertation

Chapter 6 day 3
14. 01. 2008
0 views

Chapter 6 day 3

chapt 3 power presentation
14. 01. 2008
0 views

chapt 3 power presentation

Music 2 Powerpoint
15. 01. 2008
0 views

Music 2 Powerpoint

suips
17. 01. 2008
0 views

suips

lead awareness
18. 01. 2008
0 views

lead awareness

4 Rafi Ahmad
19. 01. 2008
0 views

4 Rafi Ahmad

weathering
20. 01. 2008
0 views

weathering

wind
21. 01. 2008
0 views

wind

TheAfterlife
15. 01. 2008
0 views

TheAfterlife

Rahman PPT
23. 01. 2008
0 views

Rahman PPT

biotech
24. 01. 2008
0 views

biotech

gbr07 shen gufg 01
04. 02. 2008
0 views

gbr07 shen gufg 01

SBS UAE Presentation
04. 02. 2008
0 views

SBS UAE Presentation

unit 7
05. 02. 2008
0 views

unit 7

a3
08. 02. 2008
0 views

a3

Color seminar
11. 02. 2008
0 views

Color seminar

Life on a cotton plantation
25. 01. 2008
0 views

Life on a cotton plantation

17xsl
15. 01. 2008
0 views

17xsl

VMB SEMIPLENARY SHEFFIELD 2006
16. 01. 2008
0 views

VMB SEMIPLENARY SHEFFIELD 2006

OB ch03
29. 01. 2008
0 views

OB ch03

The Rosary Prayers
29. 01. 2008
0 views

The Rosary Prayers

Ogilvy Presentation
06. 02. 2008
0 views

Ogilvy Presentation

DENA ozone
20. 02. 2008
0 views

DENA ozone

EdgarSpresentation
25. 01. 2008
0 views

EdgarSpresentation

Manzotti
29. 02. 2008
0 views

Manzotti

Peter Loveland Presentation
17. 01. 2008
0 views

Peter Loveland Presentation

PARQUES TEMÁTICOS
19. 03. 2008
0 views

PARQUES TEMÁTICOS

ae bg sec1
20. 03. 2008
0 views

ae bg sec1

edwards
07. 02. 2008
0 views

edwards

c1 3
24. 03. 2008
0 views

c1 3

ecology collab chapter 5 lesson
03. 04. 2008
0 views

ecology collab chapter 5 lesson

Andean Civilizations
13. 02. 2008
0 views

Andean Civilizations

GEOL3026 3
28. 03. 2008
0 views

GEOL3026 3

Realization of Olympism DR CHANG
14. 04. 2008
0 views

Realization of Olympism DR CHANG

182AWHRETraining
17. 04. 2008
0 views

182AWHRETraining

japan60s
21. 04. 2008
0 views

japan60s

employ scs place shaping 110907
22. 04. 2008
0 views

employ scs place shaping 110907

articles 71023 2
24. 04. 2008
0 views

articles 71023 2

2008 PS Summer
10. 03. 2008
0 views

2008 PS Summer

Michael Payne IOC
30. 04. 2008
0 views

Michael Payne IOC

WLOCHY ETAP 17 04 2007
02. 05. 2008
0 views

WLOCHY ETAP 17 04 2007

JTB260401Singapore2
27. 03. 2008
0 views

JTB260401Singapore2

pps 326
14. 02. 2008
0 views

pps 326

IsiorhoHawii2Group
05. 02. 2008
0 views

IsiorhoHawii2Group

29 Lectures PPT
12. 02. 2008
0 views

29 Lectures PPT

procanal
20. 02. 2008
0 views

procanal

WreckedCenters Philly
15. 03. 2008
0 views

WreckedCenters Philly

Brain sportsconcussion
23. 01. 2008
0 views

Brain sportsconcussion

Semeria history pool
18. 01. 2008
0 views

Semeria history pool

royster
09. 01. 2008
0 views

royster

IT Scenario Orissa
31. 01. 2008
0 views

IT Scenario Orissa

HST353 15
14. 02. 2008
0 views

HST353 15

marshall 25june2007
23. 01. 2008
0 views

marshall 25june2007

Rachel Ellaway Virtual Patients
04. 02. 2008
0 views

Rachel Ellaway Virtual Patients

ep050320m
08. 04. 2008
0 views

ep050320m

heathers green cleaners training
22. 01. 2008
0 views

heathers green cleaners training

Kishan
24. 01. 2008
0 views

Kishan