Published on January 16, 2008
Reconfigurable Computing:Current Status and Potential for Spacecraft Computing Systems: Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems Rod Barto NASA/GSFC Office of Logic Design Spacecraft Digital Electronics 3312 Moonlight El Paso, Texas 79904 Reconfigurable Computing is…: Reconfigurable Computing is… A design methodology by which computational components can be arranged in several ways to perform various computing tasks Two types of reconfigurable computing: Static, i.e., the computing system is configured before launch Dynamic, i.e., the computing system can be reconfigured after launch Static Reconfigurability: Static Reconfigurability Several examples exist, e.g., Cray Typically processing modules connected by an intercommunication mechanism, e.g., Ethernet Goals are To reduce system development costs To provide higher performance computing Dynamic Reconfigurability (DR): Dynamic Reconfigurability (DR) Processing modules that can be reconfigured in flight Goal is to provide processing support for algorithms that do not map well onto general purpose computers using reduced amounts of hardware Outline of Paper: Outline of Paper Discuss the computation of a series of algorithms on general purpose, special purpose, and DR computers Calculate the execution time of an image processing algorithm on a concept DR computer Compare the reconfiguration time of a Xilinx FPGA with the algorithm execution time calculated in section 2. Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer Conclude that the DR computer described offers higher performance than does the flight computer Section 1:Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers: Section 1: Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers Processing example: Processing example A computing function is the composition of n algorithms executed serially Can be executed on a general purpose computer (GP) or a special purpose computer (SP) Execution on a GP Computer : Execution on a GP Computer Processing time of each stage = ti, i=1..n Total processing time = Latency time = GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function Processing on an SP Processor: Processing on an SP Processor Each stage is an independently operating processor designed specifically for the algorithm it executes Processing time of each stage = ti, i=1..n Results appear at rate of one per max(ti), 1=1..n Latency time = Performance increase comes from two factors: Pipelining of constituent algorithms exploiting parallelism Processors being designed specifically for their algorithms Processing on a DR Computer: Processing on a DR Computer Two processing elements alternately process and reconfigure, i.e., fodd executes one algorithm while feven reconfigures for the next algorithm, etc. fodd feven Input Output DR Computer Processing Flow: DR Computer Processing Flow Performance increase comes from configuring processors specifically for the algorithm they are executing Do not get increase from exploiting parallelism. Section 2:Execution Time of an Image Processing Algorithm on a Concept DR Computer: Section 2: Execution Time of an Image Processing Algorithm on a Concept DR Computer DR Computer Concept: DR Computer Concept RAM0 is source for FPGA0, destination for FPGFA1, etc. Processing elements are implemented in FPGAs FPGA0 and FPGA1 alternately process and reconfigure, as previously discussed. Input and output not shown FPGA0 FPGA1 RAM1 RAM0 AlgorithmExample: 3x3 Image Convolution: AlgorithmExample: 3x3 Image Convolution Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads. Image width in pixels row i-1 row i row i+1 Parallel shift rows up row i+2 Circular shift rows through convolution processor 3x3 convolution processor Destination RAM Source RAM one pixel Convolution Operation: Convolution Operation Used, for example, to compute the intensity gradient (derivative) at pixel (i,j) Result = P(i-1,j-1)*m11+P(i-1,j)*m12+P(i-1,j-1)*m13+…+P(i+1,j+1)*m33 Pixel array Convolution mask Convolution Calculation: Convolution Calculation Arithmetic processing may require some pipelining Result(I,j) Convolution Timing: Convolution Timing Total time = latency+processing = 20.971 msec This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate Latency = time to read 3 rows: 1024 pixels *3 rows * 20 nsec/pixel = 61 usec Processing = time to stream remaining 1021 rows through and process: 1024 * 1021 * 20 nsec = 20.910 msec Larger convolutions (e.g., 7x7) have longer latencies, but same computation time Calculation is for a mono image, stereo image would take twice as long. Section 3:Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2.: Section 3: Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2. DR Computer Processing Element:Virtex-4 LX FPGA: DR Computer Processing Element: Virtex-4 LX FPGA Eight versions: XC4VLX15, -25, -40, -60, -80, -100, -160, -200 Logic hierarchically arranged: 2 flip-flops per slice 4 slices per CLB Time to Configure FPGA: Time to Configure FPGA FPGA Configuration Sequence PROG_B INIT_B CCLK DONE Tpl Tconfig Total Configuration Time Configuration Timing: Tpl: Configuration Timing: Tpl Tpl = 0.5 usec/frame “frame” is a unit of configuration RAM Tpl period clears configuration RAM Configuration Timing: Tconfig: Configuration Timing: Tconfig FPGA programmed by bitstream CCLK (programming CLK) can run at 100 MHz Parallel mode loads 8 bits per CCLK Total Configuration Time: Total Configuration Time Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each) Processing and Reconfiguration Time Comparison: Processing and Reconfiguration Time Comparison Convolution execution is faster than reconfiguration Convolution = 21 msec mono, 42 msec stereo Reconfiguration = 81 msec Assuming -200 device Processing shown is well within FPGA’s capabilities More complex algorithms may require use of FPGA performance features Much higher internal clock rates Large internal RAM Dedicated arithmetic support in –SX series What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer: Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer GP Computing Performance Estimate: GP Computing Performance Estimate DANGER: really rough estimate! Based on data from this paper: “Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference Available at robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements. Published Vision Algorithm Timing: Published Vision Algorithm Timing Timed on Pentium III 700 MHz CPU, 32K L1 cache, 256K L2 cache, 512M RAM, Win2K algorithms explicitly timed (names from paper): The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size Flight Computer Performance: Flight Computer Performance Flight processor is RAD6000 GESTALT Navigation algorithm timed on 3 processors: Assume that the RAD6000 takes 7 times as long as the 500 MHz Pentium Final Peformance Estimate: Final Peformance Estimate Assume RAD6000 time = 7 times the 500 MHz Pentium time Assume 500 MHz Pentium time = 7/5=1.4 times the 700 MHz Pentium time Then, RAD6000 time is 1.4*7=9.8 times the 700 MHz Pentium time Vision algorithm timing can be estimated as follows: Remember: This is a really rough estimate!! Section 5: Conclusions: Section 5: Conclusions What We Have Shown: What We Have Shown We have shown that the concept DR computer presented executes a 3x3 neighborhood-type algorithm “a lot” faster than it appears that a RAD6000 executes what are probably a bunch of neighborhood algorithms. The reader is cautioned to not try to quantify what “a lot” means based on the data given here. But, it’s a good enough estimate to tell us that this is worth looking into in more detail. Conclusions: Conclusions Xilinx-based DR computer shows promise for performance enhancement of a vision system By extension, the DR computer shows promise for the performance enhancement of other algorithms.