### Radiation Models and Hardware Design

Raphael (Rafi) Some - JPL Caltech

- Why is this topic of interest?
  - Why space?
  - Why space based computing?
- Some Radiation basics
- REE a COTS based onboard high performance computer
- A Space Environment Radiation Fault Model
- Predictions and prognostications
  - Old
  - New
- Conclusions
- This work was carried out by the Jet Propulsion Laboratory under contract with the National Aeronautics and Space Administration, Code AE as part of the NASA Electronic Parts and Packaging Program, and as part of the Remote Exploration and Experimentation Project, a component of the High Performance Computing Program.
- Special thanks to A. Johnston of JPL's Radiation Effects Group for organizing the much of the testing and analysis, and for supplying much of the data herein reported.

# Why Space?

- The world is global!
- Military:
  - Space is tactical control the high ground
- Agricultural/Environmental/Health/Emergency Mgmt:
  - Space is the only place you can see the big picture
- Major players in an emerging space race:
  - US Communication Transformational Architecture, NOAA weather constellations, NASA sensor web, USDA hemisphere monitoring system...
  - European Union Galileo, Euro-climate constellations
  - China to the moon!
  - Wild cards Russia, Japan

## Why Space Based Computing?

- Historically minimum processing on board spacecraft
  - A necessary evil to be minimized wherever possible
  - Scientists want all the data
  - Spacecraft have been either bent-pipes or cameras
  - Automation not trusted (never mind autonomy!)
  - Require radiation hardened, ultra-reliable computers
  - Rad-hard computers are:
    - Expensive, slow, heavy and power hungry
    - Run by software
      - Expensive, unreliable, poorly understood and mistrusted

## Why Space Based Computing?

- Current Situation:
  - Can not tolerate the low bandwidth and comm delays
    - Weather and solid earth modeling TBs/sat-day
    - Asteroid landing mS reaction time
    - Rovers high speed nav and in-transit science GOPS/20fps
  - New space applications require onboard processing to handle basic functions:
    - Routers/Switches/Protocol Handlers & Translators
    - Quick look product generators
      - Weather Forecasts/Updates
      - Emergency & Disaster Notification/Monitoring
      - Sensor Web Based Situational Evaluations and Updates Military, Civil, Scientific

## Why Radiation?

- Most significant fault source in space environment lots of environments, lots of sources, lots of interactions, not well understood
  - Galactic Cosmic Rays (GCRs)
  - Solar Radiation
    - Solar Wind/Protons
    - Coronal Mass Ejections
  - Planetary magnetic fields interaction with GCRs and Sol
    - Van Allen Belts, polar regions and the South Atlantic Anomaly
    - Jovian belts and the IO dynamo
- Radiation is also an issue in particle beam facilities
  - High energy, particle physics, and medical research
  - Advanced fabrication/manufacturing methods
- And now showing up in otherwise benign terrestrial environments
  - Studies by IBM, Intel, Xilinx, and USAF high altitude and sea level, particle showers and atmospheric/thermal neutrons

- Radiation effects come in several flavors
  - Total Ionizing Dose (TID)
    - Cumulative ionization damage (krads)
    - Causes increased leakage current and threshold shifts
  - Single Event Effects (SEE)
    - Single particle strike
    - LET (Linear Energy Transfer) in MeV/gm-cm<sup>2</sup>
    - Single Event Latch up (SEL)
    - Single Event Upset (SEU) aka soft error
    - Single Event Multiple Upset (SEMU)
    - Single Event Gate Rupture
    - Single Event Micro-dose
  - Of these effects, the only significant issue today is SEU

#### Single-Event Upset in Advanced Twin-Well Process



Shallow Trench Isolated CMOS on epi process Charge collection region surrounds ion path (top to transition zone) Approx. 2-2.5 micron depth

#### SOI CMOS Structure



Charge collection depth - approx. 0.13 micron

Source-drain "lateral gain" or "bipolar effect" due to proximity of wells

- TID:
  - Most natural space environments < 3krad/yr</li>
  - State of the art bulk CMO logic approx. 100krad
  - Commercial SOI >> 100krad
  - State of the art DRAM approx. 35-40krad
  - State of the art Flash <10krad (charge pump)</li>
- SEL:
  - Most environments minimal fluence above LET of 30 MeV
  - State of the art CMOS >70MeV(limit of testing to date)
  - Commercial SOI latch up immune

- SEU aka soft error:
  - Function of charge deposition by charged particle passing through a semiconductor junction vs. charge required to upset (critical q)
  - Charge deposited is a complex function of type of particle, energy of particle, material being traversed, charge collection region (area x depth) and carrier mobility expressed as Linear Energy Transfer (LET)
- LET Threshold is minimum LET which can cause upset
- Effective Cross Sectional Area (aka cross section)
  - Conceptually a measure of the size of the sensitive area of a circuit compared to total circuit area
- SEU Rate:
  - Function of LET threshold and cross section

### Summary Thus Far

- Space systems are becoming both necessary and more complex
  Need high performance computing onboard
- Radiation hardened electronics are poor in power:performance, expensive and generations behind the commercial world
- State of the art digital semiconductor processes are providing sufficient tolerance of TID and SEL for most systems in most space environments
  - Exceptions are Van Allen and Jovian Radiation Belts
- Major issue for space is SEU, which is also becoming an issue for the commercial world
- So, can we build a state of the art, high performance computer suitable for next generation space systems?
  - Can fault tolerance techniques provided adequate dependability without significant degradation in power:performance, mass, etc.?

## <u>REE - A COTS Based Fault Tolerant</u> <u>Cluster Computer</u>



## <u>REE - A COTS Based Fault Tolerant</u> <u>Cluster Computer</u>



## <u>REE - A COTS Based Fault Tolerant</u> <u>Cluster Computer</u>

- •Processors have internal fault detection \*\*
- •Node Main Memory is SECDED protected
- •L2 Cache is parity protected
- •Mass Memory and node NVM is SECDED protected
- •All buses/networks are dual redundant and protocol monitored & ED coded \*\*
- •Normal communication between spacecraft and REE is via mass memory
- •Spacecraft Control Computer can assume command via back door bus (IIC)
- •Spacecraft avionics are SEU hardened.
- \*\* Better in later generations, eg., G4 7455, Rapid I/O, Infiniband\*\* SOI provides order of magnitude improvement need 3 orders magnitude

## An Approach to Understanding Radiation Faults



## Radiation Test & Fault Model Functional Block and Node Levels



### **Functional Element Input Sheet**

| SEU Rate for Data Cache (Per Bit) |                    |                  |            |              |  |  |  |
|-----------------------------------|--------------------|------------------|------------|--------------|--|--|--|
| Orbit or Location                 | Environmental      | Shielding        | Peak Rate  | Average Rate |  |  |  |
|                                   | Components         |                  | (per bit)  | (per bit)    |  |  |  |
|                                   | solar min. GCR     |                  |            | 7.00E - 07   |  |  |  |
|                                   | solar m ax. GCR    |                  |            | 1.90E - 07   |  |  |  |
| Interplanetary Space              |                    | 60 mil Aluminum  | 9.20E - 03 | 1.90E - 03   |  |  |  |
|                                   |                    |                  |            |              |  |  |  |
|                                   | DCF (protons+ions) | 100 mil Aluminum | 1.60E - 03 | 3.80E - 04   |  |  |  |
|                                   |                    | 250 mil Aluminum | 7.80E - 04 | 1.80E - 04   |  |  |  |
|                                   | solar min. GCR     |                  | 7.00E - 07 | 1.90E - 07   |  |  |  |
|                                   | solar max. GCR     |                  | 1.90E - 07 | 6.40E - 08   |  |  |  |
| 600km - 98 <sup>°</sup>           | trapped protons    |                  | 1.10E - 05 | 1.80E - 07   |  |  |  |
|                                   |                    | 60 mil Aluminum  | 9.20E - 03 | 4.80E - 04   |  |  |  |
|                                   | DCF (protons+ions) | 100 mil Aluminum | 1.60E - 03 | 8.00E - 05   |  |  |  |
|                                   |                    | 250 mil Aluminum | 7.80E - 04 | 3.50E-05     |  |  |  |
| 600km - 28°                       | GCR                |                  |            | 1.70E-08     |  |  |  |
|                                   | trapped pr otons   |                  | 1.10E-05   | 3.70E-07     |  |  |  |
| Surface of Mars                   | GCR                |                  |            | 5.00E-08     |  |  |  |

#### Functional Block Level Detailed Worksheet

| Latch fault rate (LFR)                                        |                 | 9.54E-07                |              |               |            |  |
|---------------------------------------------------------------|-----------------|-------------------------|--------------|---------------|------------|--|
|                                                               | Source          | Param                   | # Latches    | LFR           | LtchFaults |  |
| Totals:                                                       |                 |                         | 919875       |               | 0.72761438 |  |
| Chip area, mm^2                                               |                 | 625                     |              |               |            |  |
| Address bus width, bits                                       |                 | 32                      |              |               |            |  |
| Data bus width, bits                                          |                 | 32                      |              |               |            |  |
| GP Register/instruction/address width, bits                   |                 | 32                      |              |               |            |  |
| FP Register width, bits                                       |                 | 64                      |              |               |            |  |
| Width of Virtual Address                                      |                 | 52                      |              |               |            |  |
| Width of Real (Physical) Address                              |                 | 32                      |              |               |            |  |
| Width of Page offset                                          |                 | 12                      |              |               |            |  |
| Width of Real Page Number                                     |                 | 40                      |              |               |            |  |
|                                                               |                 |                         |              |               |            |  |
| Number of General Purpose (GP) registers                      |                 | 32                      |              | 9.54E-07      | 0.0009769  |  |
| Number of floating point (FP) registers                       |                 | 32                      | 2048         | 9.54E-07      | 0.00195379 |  |
| Number of special registers (CR, LR, CTR, XER, FPSCR          | )               | 5                       | 160          | 9.54E-07      | 0.00015264 |  |
| Number of Program Counters (PC)                               |                 | 1                       | 32           | 9.54E-07      | 3.0528E-05 |  |
| Number of Control/Status Registers (CSR)                      |                 | 42                      | 1344         | 9.54E-07      | 0.00128218 |  |
| Number of debugging registers (decr,watchaddr,watchinstr      | )               | 3                       | 96           | 9.54E-07      | 9.1584E-05 |  |
| Number of addressing registers (in BIU)                       |                 | 4                       | 256          | 9.54E-07      | 0.00024422 |  |
|                                                               |                 |                         |              |               |            |  |
| Number of latches holding current instruction                 |                 | 29                      |              | 9.54E-07      | 0.00088531 |  |
| Number of register rename buffers (6 GP,6 FP, 3 CSR)          |                 | 15                      |              | 9.54E-07      | 0.00064109 |  |
| No. of Branch Target Instruction Cache entries                |                 | 64                      | 2048         | 9.54E-07      | 0.00195379 |  |
| No. of Branch History Table entries                           |                 | 512                     | 1024         | 9.54E-07      | 0.0009769  |  |
| Number of MMU entries/TLB                                     |                 | 128                     |              | 9.54E-07      | (          |  |
| Tag bits per TLB entry (dirty,protected,read-only,volatile,wr | ite-back/throug | h,coherent,gu <b>la</b> | rded,LRU,cha | nged5reffe07n | (ed)       |  |
| Width of MMU TLB entry, bits                                  |                 | 70                      |              | 9.54E-07      | , (        |  |
| Number of MMU TLB's                                           |                 | 2                       | 17920        | 9.54E-07      | 0.01709568 |  |
| _atches/BAT                                                   |                 | 64                      |              | 9.54E-07      | (          |  |
| Number of I+D BATs (defined as SPRs)(shadowed)                |                 | 16                      | 1024         | 9.54E-07      | 0.0009769  |  |
| Number of MMU mem segment registers, VSID+SLB                 |                 | 17                      | 408          | 9.54E-07      | 0.00038923 |  |

### Node Level Summary

|                                       |       |         | <u>Latch</u>     | <u>Gate</u> | <u>Total</u> |
|---------------------------------------|-------|---------|------------------|-------------|--------------|
|                                       | Count | Margin_ | <u>Faults/hr</u> | Eaults/hr   | Eaults/hr    |
| Totals per node:                      |       |         | 6.95             | 0.05        | 6.99         |
| Nod e CPU's per node                  | 2     |         | 5.31             | 0.04        | 5.36         |
| Node Controller (NC) CPU              | 1     | 1.5     | 0.71             | 0.00        | 0.71         |
| Node Controller RAM                   | 1     | 3       | 0.12             | 0.00        | 0.12         |
| Network Interface Units(NIU) per node | 2     | 3       | 0.35             | 0.00        | 0.35         |
| Number of Network Switches per node   | 1     | 3       | 0.16             | 0.00        | 0.16         |
| Bu s controller (PCI)                 | 1     | 3       | 0.13             | 0.00        | 0.13         |
| Misc (watchdog,clock,EEPROM,PHRC)     | 1     | 3       | 0.02             | 0.00        | 0.02         |
| Node Controller FPGA                  | 1     | 3       | 0.14             | 0.00        | 0.14         |

### System Fault Rate Summary

| Average Rates (Faults/Hour)            |                      |                      |                      |                      |              |              |              |              |              |              |
|----------------------------------------|----------------------|----------------------|----------------------|----------------------|--------------|--------------|--------------|--------------|--------------|--------------|
| Environment                            | Interplanetary Space | Interplanetary Space | Interplanetary Space | Interplanetary Space | 600km-98°    | 600km-98°    | 600km-98°    | 600km-98°    | Mars Surface | 600km-28°    |
| Solar Min/Max                          | Solar Minimum        | Solar Maximum        | Solar Minimum        | Solar Maximum        | Solar Min    | Solar Max    | Solar Min    | Solar Max    | N/A          | N/A          |
| Flare Status                           | No Flare             | No Flare             | Design Case Flare    | Design Case Flare    | No Flare     | No Flare     | Solar Flare  | Solar Flare  | N/A          | N/A          |
| Shielding                              | 100 Mil (Al)         | 100 Mil (Al)         | 100 Mil (Al)         | 100 Mil (Al)         | 100 Mil (Al) | 100 Mil (Al) | 100 Mil (Al) | 100 Mil (Al) | 100 Mil (Al) | 100 Mil (Al) |
| Node CPU (w/o                          |                      |                      |                      |                      |              |              |              |              |              |              |
| Caches)                                | 0.04                 | 0.01                 | 28.46                | 28.43                | 0.05         | 0.02         | 5.89         | 31.27        | 0.00         | 0.03         |
| L1 Cache                               | 0.05                 | 0.01                 | 37.40                | 37.37                | 0.06         | 0.02         | 7.88         | 5.88         | 0.00         | 0.04         |
| RAM per node CPU                       | 0.01                 | 0.00                 | 7.94                 | 7.93                 | 0.01         | 0.00         | 1.67         | 7.87         | 0.00         | 0.01         |
| L2 Cache                               | 0.00                 | 0.00                 | 1.03                 | 1.03                 | 0.00         | 0.00         | 0.22         | 1.67         | 0.00         | 0.00         |
| Node Controller CPI                    | J 0.03               | 0.01                 | 22.10                | 22.08                | 0.04         | 0.01         | 4.61         | 0.22         | 0.00         | 0.02         |
| Node Controller Rar                    | n 0.01               | 0.00                 | 4.20                 | 4.20                 | 0.01         | 0.00         | 0.88         | 4.60         | 0.00         | 0.00         |
| Network Interface<br>Units per node(2) | 0.01                 | 0.00                 | 6.87                 | 6.86                 | 0.01         | 0.01         | 1.14         | 0.88         | 0.00         | 0.01         |
| Network Switch Per<br>Node(1)          | 0.01                 | 0.00                 | 3.17                 | 3.17                 | 0.00         | 0.00         | 0.53         | 1.14         | 0.00         | 0.01         |
| PCI Bus Controller                     | 0.00                 | 0.00                 | 2.58                 | 2.58                 | 0.00         | 0.00         | 0.43         | 0.52         | 0.00         | 0.00         |
| Node Controller<br>FPGA                | 0.00                 | 0.00                 | 1.18                 | 1.18                 | 0.00         | 0.00         | 0.24         | 0.43         | 0.00         | 0.00         |
| watchdog,clock,EEF<br>ROM,PHRC         | 0.00                 | 0.00                 | 2.63                 | 2.63                 | 0.00         | 0.00         | 0.44         | 0.24         | 0.00         | 0.00         |
| Single CPU                             | 0.10                 | 0.03                 | 74.83                | 74.76                | 0.13         | 0.05         | 15.66        | 15.64        | 0.01         | 0.08         |
| Per Node                               | 0.25                 | 0.07                 | 192.40               | 192.21               | 0.31         | 0.12         | 39.57        | 39.53        | 0.02         | 0.21         |
| Per System (20<br>nodes)               | 4.97                 | 1.33                 | 3852.98              | 3849.34              | 6.27         | 2.39         | 792.27       | 791.37       | 0.34         | 4.14         |

## **Radiation Fault Model Results**

•Initial studies of PPC603, PPC750 and G4 processors using technology scaling factors and available component data show results consistent with experimental data.

•Initial studies show:

- No appreciable gate fault rate and no clock rate dependant fault rates
- •Fault rates for Mars surface, LEO, and GEO are relatively benign and can be handled with Software Implemented Fault Tolerance (SWIFT) techniques and low cost hardware monitors.

## So, what happened to REE?

- REE was zero-funded for FY'02
  - But it begat:
  - FY'02 ALERT project start a 3 year project to validate use of a G4 for the next Mars Rover ('09 Launch)
  - FY'03 NMP-ST8 project start a 4 year project to develop and validate a space borne COTS-based supercomputer

Meanwhile....

• The highest performance rad hard processor to date is the X2000 PPC-750: 133MHz processor - first flight planned for FY'04.

## Predictions and Prognostications

- Past prognostications how good were they?
- Current road map
- Latest test results
- Process issues affecting SEU tolerance
- Our predictions for the future



- Prediction at 1 micron SEU rate will be astronomical!
- Clearly, this did not happen why not?
  - •Charge collection volume shrank
  - •Mfgr's had to keep SEU threshold above alpha particle LET
- •And we've been getting it wrong ever since!

### **Clock Rate Dependent Fault Rate Prediction**

- SEU event is approximately 1 nS wide
- Therefore:
  - We will hit a cliff at 500MHz 50% probability of SEU event catching a clock edge
  - Operation at >1GHz will be impossible 100% probability of SEU event catching a clock edge.
- We never saw the problem why not?
- Re-measured SEU event now < 100pS
  - High speed circuits required higher electron/hole mobility
  - Higher mobility -> faster SEU event
  - SEU event time remains << clock width</p>
- *Predictions were wrong again!*

### Advanced Bulk CMOS for the Year 2008: L = 25 nm

- 50 nm lithography
- "Super-halo,, doping to control short-channel effects
- ~ 20 mV threshold voltage fluctuation (1  $\sigma$ )
- Adaptable to partially depleted SOI



•Scaling for state of the art devices is less aggressive than projected

- •Vcc is higher than projected
- •Margins are higher than projected
- •Fault probability is reduced from projected
- •Predictions are wrong yet again!

### Mainstream Technologies for the Near Term

- SIA Roadmap Overestimates Scaling Progress
  - Highest performance devices are for very specialized applications
  - Realistic scaling projections require different tradeoffs
    - On-off ratio must be between  $10^4$  and  $10^6$
    - Power supply voltages for logic will plateau at 1 volt
  - Next technology node is 90 nm
- Benchmarks:
  - Commercial microprocessors
  - High-density memory technology
- SOI Is Becoming a Mainstream Technology not foreseen
- *Predictions, once again are wrong!*

## **Design Features of Power PC-Series**

#### **Microprocessors**

| Device                    | Feature<br>Size<br>(?m) | Die<br>Size<br>(mm <sup>2</sup> ) | Core<br>Voltage<br>(V) | Max.<br>Operating<br>Frequency<br>(MHz) |
|---------------------------|-------------------------|-----------------------------------|------------------------|-----------------------------------------|
| Motorola<br>MPC750 (G3)   | 0.29                    | 67                                | 2.5                    | 350                                     |
| IBM<br>MPC750             | 0.22                    | 40                                | 2.0                    | 533                                     |
| Motorola<br>MPC7400 (G4)  | 0.20                    | 83                                | 1.8                    | 500                                     |
| Motorola<br>MPC7455 (SOI) | 0.18                    | 106                               | 1.6                    | 1000                                    |
| IBM<br>750FX (SOI)        | 0.13                    | 34                                | 1.4                    | 1000                                    |

- Note low core voltages of SOI parts
  - As we will see, this doesn't seem to hurt SEU performance

#### Test Results for Power PC Microprocessors



• SOI gives 1000x improvement in x-section, but little reduction in threshold due to "bipolar effect", which magnifies deposited charge

- Net effect upset rate is significantly lower in SOI parts
- This is a good thing!

### **Scaling Trends for Cross Section**



- 4x reduction in feature size -> 10x reduction in cross section
- Figure includes core voltage reduction low voltage not a problem! *This is a good thing!*

### <u>3-D Modeling of the Effect of Junction Area on Charge</u> <u>Collection from Alpha Particles</u>



As junction area scales down, collected charge will be reduced faster than predicted from classical charge collection depth analyses *This is a good thing!*

## Dependence of Neutron Soft Error Rate on Scaling



•Test data from commercial vendor parts

•Actual SEU Rate goes down with decreasing Vcc!

•Includes reduced cross section due to reduced geometries

•Corroborates and validates previous independent results •*This is a good thing!* 

#### Charge Multiplication in Partially Depleted SOI Devices



To take full advantage of SOI we need body ties to reduce lateral gain (bipolar effect). Commercial processes do not currently use these - we could do better and probably will in the future.

## Permanent Damage from Single Particles

- Micro dose Single particle ionization -> permanent threshold shift
  - DRAMs or Low-Power Logic
    - Continues to be an issue, even for devices with thinner gates
    - Mechanisms in storage capacitors may be a factor
  - Flash Memories
    - Charge pump and high-voltage logic
      - Could be bypassed if vendor brought out terminals
    - Cell errors in multi-level storage technology
      - 8 level flash is on the drawing boards this could be a problem
- Permanent Damage in Oxides
  - Leakage and worsening of hot electron damage, i.e., interaction with other non-radiation reliability mechanisms
  - Probably not an issue for conventional oxides and circuits
  - May be a factor for storage arrays used in long-duration space mission

# Conclusions

- Scaling Appears to Decrease SEU Upset Sensitivity for Mainstream Technologies
  - Decreased charge collection efficiency
  - Influence of "hardening" commercial devices for atmospheric radiation
- Future Trends Are Difficult to Predict
  - "Most likely" 2008 device will probably follow current trends
  - Commercial SOI processes reduce cross section, but not LET threshold
  - Very low voltage or low power technologies may be more sensitive to SEU
  - Complex Device Architectures and Circuit Designs Are "Wildcards"
  - Processors
    - High-speed device design is quite involved (dynamic logic)
    - Changes in cache design may affect error rate (e.g., partial use of DRAMs)
  - DRAMs
    - Architecture for high-speed devices
    - Storage capacitor design continues to evolve
    - Novel designs that work at low voltage
  - No good NVRAM options at this time
    - Chalcogenide
    - FERAM
    - MRAM

# Conclusions

- COTS based high performance computing onboard spacecraft in natural environments is quite do-able.
- NVRAM remains an issue, but there are 3 potential solutions being developed for commercial application that we can leverage
  - MRAM, FERAM, Calchogenide
- We need to continue testing and developing fault models
  - Previous predictions were wrong, this one may be as well
  - New effects continue to emerge and old ones reemerge
  - SEU Rate is function of competing effects hard to determine which will be dominant or the operating point of a given future technology/design
  - Design sensitivity is still an issue the best processes can be undermined by poor circuit and component designs
  - Permanent damage effects loom on the horizon
  - New materials and processes are coming at a rapid rate, e.g., SiGe
- But, for the foreseeable future, we should be ok....we think.