

Dependability Issues Due to Scaling Towards Nanometer Size Devices:



Aggressive and Adaptive Mitigation Techniques Maybe Key for Solution

#### Arun K. Somani

Dependable Computing and Networking Laboratory Department of Electrical and Computer Engineering Iowa State University, Ames, IA, 50011

arun@iastate.edu



# **Technology Scaling**

- Every 30% downscaling of technology node
  - Transistor density doubles
  - Gate delay reduces 30%
  - Operating frequency improves 43%
  - Active power consumption halves
  - 65% energy savings
- Frequency scaling inhibited with recent generations
  - Low power requirements
  - Process variations
  - Reliability concerns
- High speed, low leakage requirements
  - Determines the choice of supply and threshold voltages



## How the Progress is Holding Up?

- Drives semiconductor performance
- Enables newer technologies



Source: Intel IOWA STATE UNIVERSITY

#### A Few Things Are Here to Stay

#### Leakage Power in MOSFETs

- Sufficient overdrive required for high speed switching
- Lower V<sub>T</sub> leads to more leakage
- Gate Leakage
  - Tunneling current through gate dielectric
  - High-k dielectrics used in 45nm technology
    - Arrest gate leakage



- Process variations increase with scaling
  - Random and systematic variations in delay, power, yield
  - $V_t \downarrow \rightarrow \text{Delay} \downarrow, L_{eff} \uparrow \rightarrow \text{Delay} \uparrow, V_{dd} \downarrow \rightarrow \text{Delay} \uparrow, T \uparrow \rightarrow \text{Delay} \uparrow$
- Thermal Variation



#### **Temperature Variations**



IOWA STATE UNIVERSITY

C

Original Source: Anirudh Devgan, IBM Research

#### **Challenges for Future Manufacturing**

- Ultimate limit 0.3 nm (Silicon atoms distance)
  - Various barriers seen over time
  - Overcome with changes in material and process technology
- Degradation of performance with downscaling
  - Interconnect delay increases with increase in resistance and capacitance of narrow and dense metal lines
- Higher power consumption will continue as a problem
- Unaffordable manufacturing cost for smaller sizes
  - Semiconductor companies moving towards fab-lite model
  - Yield and the time-to-market with newer technologies is becoming longer





## What to Look Forward For?

- Error tolerance rather than avoidance
- Built in fault tolerance for all designs
- Selective replication instead of full scale redundancy
- Design adaptability
  - Key for low overhead solutions
- Design optimizations
  - Dynamic schemes
  - Possible through speculation



## Reliable Overclocking (Aggressive Designs)



- Typically clock period is determined by the maximum delay from A to B which depends physical implementation, operating environment, and temperature and supply voltage variations
- Traditionally, worst case delays assumed
  - Result overly conservative clock period
- Pipelined processor

IOWA STATE

Longest/slowest stage limits the period of the entire pipeline

# Reliable Overclocking (Aggressive Designs) – Contd.

- Problem to address in nanometer design space
  - Provide high performance by exploiting PVT variations
  - Enhance system dependability with low cost solutions
- Clock beyond worst case delay, relying on data dependent delays
- Timing errors may occur at overclocked speeds
- Aggressive, but reliable, design methodologies employ relevant timing error detection and recovery schemes
  - Razor-Micro'03, Sprite-DSN'07
- Performance 15-20%, Error rate below 1%
  - Safety critical systems, real-time constraints supported



#### Why Past Solutions are not Acceptable

- Traditional techniques
  - TMR solutions incur high cost and performance penalty
  - Dual latching dynamic optimization uses less area
  - False positives and high penalty for error recovery are concerns
  - Static power Vs Dynamic power
    - Both are comparable for today's technology
    - Thus logic replication is not a viable alternative







# Offering More Design Features with Added Redundancy

- Soft Error Mitigation, SEM [DSN'09]
  - Circuit level speculation, local recovery, no false positives, high fault coverage (like TMR tolerates both SEU and SET)
  - No performance overhead, operating frequency  $f_{sys} \leq 1/t_{pd}$
- Soft and Timing Error Mitigation, STEM [DSN'09]
  - Like SEM, but detects and correct timing errors
  - Can be deployed in aggressive system designs
  - Timing speculation, like overclocking [DSN'07] and DVS [MICRO'03]





# **Dynamic Frequency Scaling**

- Clock frequency is scaled while satisfying the error rate constraint
- Limits of DFS

UNIVERSITY

- $F_{MAX}$  (Minimum possible frequency)  $D_2 D_1 \ge T_{PW}$ 
  - Set by worst-case design settings

$$T_{CD} \ge D_2 \qquad (9)$$

(10)

$$T_{MIN} + D_1 \ge T_{PD}$$
 (11)

- F<sub>MIN</sub> (Maximum possible frequency)
  - As shown in equation (11)



 $T_{CD}$  = Contamination delay of the logic circuit  $T_{PD}$  = Propagation delay of the logic circuit T<sub>PW</sub> = Expected soft error/noise pulse width  $D_1$  = Phase shift between CLK<sub>1</sub> and CLK<sub>2</sub>  $D_2$  = Phase shift between CLK<sub>2</sub> and CLK<sub>3</sub>

#### **Pipeline Design**

- Using STEM
  - Input clocks are constrained to provide fault tolerance
  - Extra buffer stage to ensure only "gold" data to memory
- Stage error signal: Generated from error signal in that stage
- Global error signal is generated from all stages
- Error rates are monitored and used by clock unit



#### **Performance Analysis**

- Limiting factor for frequency scaling
  - With frequency scaling, no. of input combinations resulting in greater delays than the new clock period increases

$$N \times t_{ov} + n \times N \times k \times t_{ov} < N \times t_{wc}$$

 $k < (t_{wc} - t_{ov}) / (n \times t_{ov})$ 

t<sub>wc</sub> : worst case clock period t<sub>ov</sub> : overclocked clock period n : no of cycles to recover N : total cycles required k : error rate

#### • For STEM cells

- 15% increase in frequency, error rate needs to be > 5.76% to yield no performance improvement
- For error rates < 1%, a 2.6% increase in frequency is required to compensate the penalty paid for error correction



#### **Three Interdependent Concerns**

- Performance
  - Device scaling
  - Architectural innovations
  - Better-than-worst-case designs
- Dependability
  - Soft errors, silicon defects
  - Fault mitigation techniques
- Power Consumption
  - Low power design
  - Adaptive control mechanisms
- All managed through aggressive design methodology

