Published on August 14, 2007
Slide1: Dependability Benchmarking of VLSI Circuits Cristian Constantinescu [email protected] Intel Corporation Outline: Outline Neutron SER characterization of microprocessors SER scaling trends Experimental set-up Experimental Results Other sources of errors Memory intermittent faults Front side bus intermittent faults Using environmental tests as dependability benchmarking tools Temperature and Voltage Operating Test ESD Operating Test Summary Backup Linpack benchmark References Acknowledgement Neutron SER characterization: Bruce Takala, Steve Wander (LANSCE), Nelson Tam, Pat Armstrong (Intel Corp.) Environmental testing: John Blair, Scott Scheuneman (Intel Corp.) Neutron SER Characterization of Microprocessors: Neutron SER Characterization of Microprocessors Single Event Upsets: Single Event Upsets Single event upsets (SEU) are induced by Alpha particles – generated during radioactive decay of the package and interconnect materials Neutrons, protons, pions – generated by cosmic rays penetrating the atmosphere SEU may induce errors both in storage elements and combinational logic Frequency of occurrence of the particle induced induced errors: soft error rate (SER) SER Scaling Trends: SER Scaling Trends SRAM SER per bit and chip Latch SER per bit and chip Assumption: SRAM/latch count increases ~2x per generation Hadron Cascades: Hadron Cascades Neutrons represent 94% of the hadrons reaching sea level For terrestrial applications it makes sense to benchmark the impact of neutron SER Main constituents of atmospheric hadron cascades LANSCE Neutron Beam: LANSCE Neutron Beam Los Alamos Neutron Science Center (LANSCE) Generates high-energy neutrons by spallation: a linear accelerator generates a pulsed proton beam that strikes a tungsten target Energy dependence of the natural cosmic-ray neutron flux and the LANSCE neutron flux Experimental Set Up: Experimental Set Up Itanium processor based server Windows NT 4.0 operating system Linpack benchmark Performs matrix computations Derives residues – can detect silent data corruption (SDC) Fission ion chamber to determine neutron fluence Deriving MTTF: Deriving MTTF MTTF = Tua/U Tua – duration of an equivalent experiment, taking place in unaccelerated conditions [h] U – total number of upsets (failures) over the duration of the experiment Tua = (Fcp * Nc)/ Nf Fcp – total number of fission chamber pulses, over the duration of the experiment Nc – average neutron conversion factor [neutrons/fission pulse/cm2] Nf – cosmic-ray induced neutron flux at the desired geographical location and altitude [neutrons/cm2/h] Experimental Results: Experimental Results Run Linpack benchmark for square matrixes of size 800 and 1000 Completed 40 runs Duration of one run: 10 s – 5 min Failure types Blue screen Hang Silent data corruption (SDC) Experimental Results: Experimental Results Itanium processor MTTF due to neutrons, as a function of number of runs Experimental Results: MTTF confidence intervals Experimental Results SDC – one event Insufficient for statistical analysis Practical Considerations: Practical Considerations Error handling techniques differ greatly from one manufacturer to another HW error detection and correction, e.g. ECC, is faster FW/SW implemented recovery may be overwhelmed by an accelerated test (near coincident faults scenario) Acceleration factor is an important variable Failure prediction and automatic deconfiguration may lead to misleading results Multiple experiments Beam divergence Beam attenuation Other Sources of Errors: Other Sources of Errors Memory Intermittent Faults: Memory Intermittent Faults Intermittent faults are induced by unstable or marginal hardware Intermittent shorts/opens Manufacturing residuals Timing faults Number of memory single-bit errors reported by 193 systems over 16 months Daily number of memory single-bit errors reported by one system over 16 months Front Side Bus Intermittent Faults: Front Side Bus Intermittent Faults Front side bus (FSB) errors Bursts of single-bit errors (SBE) on data path SBE detected and corrected (data path protected by ECC) Failure analysis results Intermittent contacts at solder joints Fault injection showed that similar faults experienced by control signals induce SDC Using Environmental Tests as Dependability Benchmarking Tools: Using Environmental Tests as Dependability Benchmarking Tools Temperature and Voltage Operating Test: Temperature and Voltage Operating Test Profile of the test 9 systems experienced SDC SDC events: 134 (90.5%) Detected errors: 14 (9.5%) SDC preceded detected errors 70o C 25o C -10o C Ten systems were tested Workload: Linpack benchmark Temperature and Voltage Operating Test: Temperature and Voltage Operating Test Distribution of the SDC events Failure analysis results Memory controller setup and hold-time violations ESD Operating Test: ESD Operating Test 4 servers from 2 manufacturers Workload: Linpack benchmark 30 test points per server 20 positive and 20 negative discharges per test point Air discharge 4 kV – 15 kV Contact discharge 8 kV One server experienced SDC 8% of the discharges targeted to the disk bay area (15 kV, air) First ESD operating test to reveal SDC in a commercially available server Summary: Summary The need for dependability benchmarking is increasing Wider use of COTS components in critical applications Technology is a two edge sword Higher performance Higher rates of occurrence of the transient and intermittent faults SDC is a real threat We take for granted the correctness of the computer data Dependability benchmarks should determine whether the circuits/systems under evaluation experience SDC Fault injection techniques require in depth knowledge of the evaluated system Appropriate for designers and manufacturers Accelerated neutron tests and environmental tests are a 'black box approach' Capable of unveiling SDC In depth knowledge of the system under test is not required Linpack benchmark is available for free Can be used both by manufacturers and independent evaluators Backup: Backup Linpack Benchmark: Linpack Benchmark Example of Linpack output: large residues indicate SDC References: References 'Neutron SER characterization of microprocessors', Proc. of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005, pp. 754-759. 'Dependability benchmarking using environmental test tools', Proc. of the Reliability and Maintainability Symposium, Alexandria, VA, USA, January 2005, pp. 567 – 571. 'Impact of deep submicron technology on dependability of VLSI circuits', Proc. of the International Conference on Dependable Systems and Networks, Washington, DC, USA, June 2002, pp. 205-209.