

#### **Dependable Systems**

Faults, errors and failures: Classification and available models

Luca Cassano luca.cassano@polimi.it cassano.faculty.polimi.it/ds.html

Most of the material of these slides has been provided by Prof. Cristiana Bolchini, Politecnico di Milano, Italy

## TOPIC QUESTIONS

What are the problems we are trying to address?

What is the most suitable fault model?

#### **Dependability scenario**





## **Reliability terminology**

| Term    | Description                                                        |
|---------|--------------------------------------------------------------------|
| Fault   | A defect within the system                                         |
| Error   | A deviation from the required operation of the system or subsystem |
| Failure | The system fails to perform its<br>required function               |





#### Defects

Non-ideal (non-perfect) fabrication of system components

Examples:

- Thinner/ticker wires
- "Holed" transistors' gate, source and drain or wires
- ...



#### Defects

Non-ideal (non-perfect) fabrication of system components

Examples:

- Thinner/ticker wires
- "Holed" transistors' gate, source and drain or wires

• ...

Defects are (of course) permanent, but...

...not always defects cause a fault!



#### **Faults**

Events that cause a non-ideal (non-perfect) behavior of system components

Examples:

- Stress-induced wire breaks (permanent fault)
- Radiation-induced current pulses (transient fault)
- Interconnect malfunctions due to specific humidity conditions (recurrent/intermittent fault)
- ...



#### Bugs

Non-ideal (non-perfect) source code implementation

Examples:

- Coding errors
- Requirements misinterpretation
- OS, libraries, development tools incomplete support

• ...



#### **Defects + Faults + Bugs**

#### Runtime activation may cause errors

Always keep in mind that defects, faults and bugs may stay silent

• They may not affect any component before the triggering condition occurs

We talk bout **fault activation** 



#### **Errors**

Any unexpected incorrect behavior of a **component/subsystem** 

Always keep in mind that also errors may stay silent

• They can be masked in any point between the fault location and the output of the system

We talk bout error propagation



#### **Failures**

Any unexpected incorrect behavior of the **entire system** 

Examples:

- The system produce an incorrect output (**functional failure**)
- The system produce an output (either correct or not) at the wrong time (**timing failure**)



## The "bathtub" viewpoint

System failures are due to:

- *Infant mortality*: random production defects, process variation...
- Normal functioning: constant random fault occurrence due to the working environment (radiation, heat, humidity...)
- Wearout: normal long-term use of the system that cause aging of the materials

#### The sum of these effects causes all system failures



#### The "<u>bathtub</u>" viewpoint





Time

#### Yield

During the production process it is impossible to completely avoid defects (basic cause of faults and errors)

**Yield**: provides a measure of the amount of functioning devices with respect to the entire production (manufacturing yield)



# Faults, errors and failures in digital circuits and systems

## **Fabrication defects & functioning faults**

**Fabrication defects**: introduced during component production causing faults, such as:

- Spot defects
- Systematic defects



## **Fabrication defects & functioning faults**

**Fabrication defects**: introduced during component production causing faults, such as:

- Spot defects
- Systematic defects

**Functioning defects**: are activated during the functional life of the device because of failure mechanisms, such as

- Gate-oxide break
- Broken contacts
- Wareout effects



#### Spot defects: due to impurities

- Missing material leads to open circuits (dust particles on the masks)
- Extra material leads to short circuits (dust particles on the silicon surface)



#### **Spot defects**: due to impurities

- Missing material leads to open circuits (dust particles on the masks)
- Extra material leads to short circuits (dust particles on the silicon surface)

#### Systematic defects:

usually occurring in new design processes (e.g. from 65nm to 45nm) solved in time

- Process variation (modifications in the transistors)
- Mask defects



CAUSES:

Instabilities in the process conditions

- random fluctuation in the actual environment
- inaccuracies in the control or furnace
- variation in the physical and chemical parameters of the material

Human errors

Mis-handling of the materials and of the furnace



- oxide breakdown
  - formation of pinhole defects due to insufficient oxygen at the interface of silicon (Si) and silicon dioxide (SiO2), chemical contamination, nitride cracking during field oxidation, and crystal defects



- oxide breakdown
  - formation of pinhole defects due to insufficient oxygen at the interface of silicon (Si) and silicon dioxide (SiO2), chemical contamination, nitride cracking during field oxidation, and crystal defects

Also be due to the operational conditions, e.g., large discharge through the oxide causes local breakdown







A physical defect causes a fault if its position and size are such to produce an open or short between two lines

#### Critical Area of a defect – CA

given the diameter x of the defect (which is assumed to be constant and dependent on the technology) CA is the area where a defect has to occur in order to cause a fault



## Fabrication defects | extra material | scenario 1

#### 120nm defect





defect not causing short

60nm growth around wires

120nm defect

#### 20nm overlap

if the center of a 120nm defect falls anywhere in this area, a short between the wires occurs

100nm wires on a 200nm pitch (center-to-center distance)



## Fabrication defects | extra material | scenario 2

140nm defect



70nm growth around wires

140nm defect

#### 40nm overlap

*if the center of a 140nm defect falls anywhere in this area, a short between the wires occurs* 

100nm wires on a 200nm pitch (center-to-center distance)



#### Fabrication defects | missing material | scenario 1



## No current flow interruption



**POLITECNICO MILANO 1863** 

## Fabrication defects | missing material | scenario 2



No current flow interruption

Partial current flow interruption



## Fabrication defects | missing material | scenario 3



No current flow interruption

Partial current flow interruption

Complete current flow interruption



#### **Critical areas**

As defects grow in size, their Critical Areas increase

The rate of the critical area increase is dependent on the spacings in the layout



#### **Critical areas**

As defects grow in size, their Critical Areas increase

The rate of the critical area increase is dependent on the spacings in the layout

Layouts with open spaces are less susceptible to short defects



#### **Critical areas**

As defects grow in size, their Critical Areas increase

The rate of the critical area increase is dependent on the spacings in the layout

#### Layouts with open spaces are less susceptible to short defects

Dually, open defects that land on thin wires may more easily halt current flow than open defects that land on thick wires







POLITECNICO MILANO 1863









POLITECNICO MILANO 1863

Custom standard cell design done by hand via layout engineers and can

- decrease the critical area for short defects by spreading wires
- decrease the critical area for open defects by widening wires



# **Critical areas mitigation | wire spreading**

Custom standard cell design done by hand via layout engineers and can

- decrease the critical area for short defects by spreading wires
- decrease the critical area for open defects by widening wires

The optimum balance, for a given area, depends on the defect density distributions for open defects and short defects...



# **Critical areas mitigation | wire spreading**

Custom standard cell design done by hand via layout engineers and can

- decrease the critical area for short defects by spreading wires
- decrease the critical area for open defects by widening wires

The optimum balance, for a given area, depends on the defect density distributions for open defects and short defects...

...and of course on the cost!



### **Defects and faults**

Physical defects

- Fabrication defects (missing or extra material)
- Material degradation over time and/or environment, wear-out

Faults

A model of the incorrect behavior due to defects

Errors ...

Failures ...



Depending on the abstraction level we look these defects at, we may consider different fault models



Depending on the abstraction level we look these defects at, we may consider different fault models

Levels:

- Transistor
- Gate
- RTL/Module





Depending on the abstraction level we look these defects at, we may consider different fault models

Levels:

- Transistor
- Gate
- RTL/Module





Depending on the abstraction level we look these defects at, we may consider different fault models

Levels:

- Transistor
- Gate
- RTL/Module





Depending on the abstraction level we look these defects at, we may consider different fault models

The lower the considered abstraction level:

• The closer to the device material behavior



Depending on the abstraction level we look these defects at, we may consider different fault models

The lower the considered abstraction level:

- The closer to the device material behavior
- The higher the accuracy



Depending on the abstraction level we look these defects at, we may consider different fault models

The lower the considered abstraction level:

- The closer to the device material behavior
- The higher the accuracy
- The longer the analysis time



### **Different levels of abstraction**





**POLITECNICO MILANO 1863** 

According to the time duration we classify faults as:

• **Permanent**: once the fault occurs, it is always there and stable (caused by a defect rather then disturbance)



According to the time duration we classify faults as:

- **Permanent**: once the fault occurs, it is always there and stable (caused by a defect rather then disturbance)
- Intermittent: the fault occasionally occurs (unstable hardware or varying hardware states)



According to the time duration we classify faults as:

- **Permanent**: once the fault occurs, it is always there and stable (caused by a defect rather then disturbance)
- Intermittent: the fault occasionally occurs (unstable hardware or varying hardware states)
- **Transient**: fault resulting from temporary environment conditions



Testing

- Identification of defects after production/manufacturing (production test)
- Periodical health analysis (on-line self test)



Testing

- Identification of defects after production/manufacturing (production test)
- Periodical health analysis (on-line self test)
- Fault detection/management/tolerance
  - Identification, management and masking of defects and faults occurring during the operational life of the device



#### Assumptions

When analysing digital circuits/systems these assumptions are considered:

- Single fault or single failure
  - Once a fault occurs, there is enough time to detect it before a second one may occur



#### Assumptions

When analysing digital circuits/systems these assumptions are considered:

- Single fault or single failure
  - Once a fault occurs, there is enough time to detect it before a second one may occur
- No faults on the primary inputs
  - The input data are always correct



| Α | В | Out |
|---|---|-----|
| 0 | 0 | 1   |
| 0 | 1 | 1   |
| 1 | 0 | 1   |
| 1 | 1 | 0   |





Due to the previously discussed defects a transistor may be:

Transistor Stuck-on (or stuck-short)

- Always connecting





Due to the previously discussed defects a transistor may be:

Transistor Stuck-on (or stuck-short)

- Always connecting





Due to the previously discussed defects a transistor may be:

Transistor Stuck-on (or stuck-short)

Always connecting

| Α | В | Out |
|---|---|-----|
| 0 | 0 | 1   |
| 0 | 1 | ?   |
| 1 | 0 | 1   |
| 1 | 1 | 0   |





Due to the previously discussed defects a transistor may be:



Due to the previously discussed defects a transistor may be:

Transistor Stuck-on (or stuck-short) — Always connecting

 A
 B
 Out

| 0 | 0 | 1 |
|---|---|---|
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | ? |





Due to the previously discussed defects a transistor may be:

Transistor Stuck-open

- Always interrupted





Due to the previously discussed defects a transistor may be:

Transistor Stuck-open

- Always interrupted





Due to the previously discussed defects a transistor may be:





**Transistor level faults** 

Recap: the stuck- model

- Stuck-open: a single transistor is permanently stuck in the open state
- **Stuck-on** (**stuck-short**): a single transistor is permanently shorted irrespective of its gate voltage



# **Logic level faults**

Single Stuck-at Fault (SSF)





**POLITECNICO** MILANO 1863

### **Logic level faults**

Single Stuck-at Fault (SSF)







**POLITECNICO** MILANO 1863

# **Logic level faults**

Single Stuck-at Fault (SSF)





# **SSF dictionary**

- A: SA0 / SA1
- B: SA0 / SA1
- C: SA0 / SA1
- d: SA0 / SA1
- e: SA0 / SA1
- f: SA0 / SA1
- g: SA0 / SA1
- O: SA0 / SA1





Logic level faults

Equivalences among stuck-at faults may be identified to reduce the fault dictionary

A: SA0  $\approx$  d: SA0  $\approx$  g: SA1





**POLITECNICO** MILANO 1863

Logic level faults

Equivalences among stuck-at faults may be identified to reduce the fault dictionary

A: SA0 ≈ d: SA0 ≈ g: SA1 C: SA0 ≈ f: SA1





Logic level faults

Equivalences among stuck-at faults may be identified to reduce the fault dictionary

A: SA0  $\approx$  d: SA0  $\approx$  g: SA1 C: SA0  $\approx$  f: SA1 C: SA1  $\approx$  f: SA0  $\approx$  e: SA0  $\approx$  h: SA1





...

**POLITECNICO MILANO 1863** 

Logic level faults

...

Equivalences among stuck-at faults may be identified to reduce the fault dictionary

A: SA0  $\approx$  d: SA0  $\approx$  g: SA1 C: SA0  $\approx$  f: SA1 C: SA1  $\approx$  f: SA0  $\approx$  e: SA0  $\approx$  h: SA1



**POLITECNICO** MILANO 1863

**Physical Defect** 

Undesired connection between two wires:



Physical Model

L1

L2



Undesired connection between two wires:





POLITECNICO MILANO 1863

Undesired connection between two wires:





**Timing faults**: a logic gates produces the right output but with an increased propagation delay

May cause T<sub>hold</sub> T<sub>setup</sub> violations of flip-flops!



### **Radiation induced faults**

Ionizing radiations are those with a level of energy able to transfer part of it to the particles they hit

As technology scales, microelectronic devices are more and more sensible to this kind of effects



# Types

Single Event Effects - SEE a measurable effect resulting from the deposition of energy from a single ionizing particle strike

Total Ionizing Dose - TID a *cumulative* long term ionizing damage mostly due to protons and electrons



# **Single Event Effects**

#### SEEs can take many forms

- Single Event Transients (SETs)
- Single Event Upsets (SEUs)
  - Bit flips in memory cells
- Multiple Cell Upsets (MCUs)
- Single Event Latchups (SELs), energy from a charged particle leading to an excessive supply power
- Single-Event Functional Interrupt (SEFI)

### Classification





POLITECNICO MILANO 1863

## Classification





POLITECNICO MILANO 1863

## Classification



#### Recoverable

Non-recoverable



### SETs & SEUs

A single particle hits either sequential or combinational logic



SEUs directly affect memory elements

SETs affect combinational logic, but their effect may propagate to memory elements



### **SEU in SRAM Memory**

Effect: bit-flip





### **SEU in SRAM Memory**

Effect: bit-flip







### **SEU in SRAM Memory**

Effect: bit-flip





# **Multiple Bit/Cell Upsets**

Distributed effect of the radiation, causing more adjacent memory cells to modify their content

Independent SEUs affecting different cells within the design



# **SET in combinational logic**

A transient pulse

It might get latched or not



- An erroneous value in the memory element(s)



### **SEU in SRAM-based FPGAs**

Single-Event Upset

- affecting memory elements





### **SEU in SRAM-based FPGAs**

#### SEU in bitstream

- modifies the functionality of the implemented system
- remains corrupted unless the bitstream is re-written (eventually only partially)

SEU in user memory elements

- corrupts the computed data
- a re-execution mitigates the effects



# **Single-Event Functional Interrupts**

Characterizes situations where the event affects a critical signal of the circuit:

- Clock or reset signal
- Control registers



# **Single-Event Functional Interrupts**

Characterizes situations where the event affects a critical signal of the circuit:

- Clock or reset signal
- Control registers

Recovery:

- Refreshing the corrupted data
- Reloading the altered configuration, and possibly
- Power cycling the circuit



### **Aging effects**

Device degradation due to material stress, environmental harshness, wear-out

Instability at first (intermittent problems, performance degradation ...) and permanent faults as the final effect



# Main aging effects

- Electromigration EM
  - occurs in wires and vias as a result of the momentum transfer from electrons to ions that construct the interconnect lattice and leads to hard failures such as opens and shorts in metal lines

Reference: JEDEC Solid State Technology Division



# Main aging effects

- Electromigration EM
  - occurs in wires and vias as a result of the momentum transfer from electrons to ions that construct the interconnect lattice and leads to hard failures such as opens and shorts in metal lines
- Time-Dependent Dielectric Breakdown TDDB
  - related to the deterioration of the gate oxide layer. Gate current causes defects in the oxide, which eventually form a low-impedance path and cause the transistor to permanently fail
- Stress Migration SM
  - similar to electromigration

Reference: JEDEC Solid State Technology Division



# Main aging effects

- Electromigration EM
  - occurs in wires and vias as a result of the momentum transfer from electrons to ions that construct the interconnect lattice and leads to hard failures such as opens and shorts in metal lines
- Time-Dependent Dielectric Breakdown TDDB
  - related to the deterioration of the gate oxide layer. Gate current causes defects in the oxide, which eventually form a low-impedance path and cause the transistor to permanently fail
- Stress Migration SM
  - similar to electromigration
- Thermal Cycling TC
  - caused by thermal stress due to mismatched coefficients of thermal expansion for adjacent material layers and cause the transistor to permanently fail

Reference: JEDEC Solid State Technology Division



# **Functional faults | error modeling**

The component, core, system has a different behavior with respect to the expected one

- The exact causes are not known because
  - It is not interesting
  - It is impossible
  - It is too expensive
  - ...
- The effects are the only thing to work on



# **Functional faults | error modeling**

The component, core, system has a different behavior with respect to the expected one

- The exact causes are not known because
  - It is not interesting
  - It is impossible
  - It is too expensive
  - ...
- The effects are the only thing to work on

Two important aspects:

- do not model errors that no fault can generate
- model all possible errors that faults can generate



# **Cross-layer fault/error models**

- Cross-layer reliability analysis is today one of the keywords for the research community
  - Try to combine solutions at different levels of abstraction



# **Cross-layer fault/error models**

- Cross-layer reliability analysis is today one of the keywords for the research community
  - Try to combine solutions at different levels of abstraction
    - More precise solutions at lower abstraction levels
    - Faster and cheaper solutions at higher abstraction levels



# **Cross-layer fault/error models**

- Cross-layer reliability analysis is today one of the keywords for the research community
  - Try to combine solutions at different levels of abstraction
    - More precise solutions at lower abstraction levels
    - Faster and cheaper solutions at higher abstraction levels
- Need to propagate models from lower to higher levels
  - Abstract the fault model
    - Simplification needed to deal with the complexity of the system when described at higher abstraction levels
  - Find a match between the corresponding fault models at different abstraction levels



# Cross-layer fault/error models | 2

Example: faults in the CPU registers and functional errors while running an application

- A. Faults at transistor level are erroneous transistors output values
- B. Faults at gate level are erroneous gates output values
- C. Faults at RTL abstraction level:
  - Erroneous value in CPUs registers (SEU, MCU, stuck-at ..)
- D. Functional faults at CPU level are:
  - corruption of the execution of an instruction
  - corruption of a stored value
- E. Functional faults at program execution level are:
  - Erroneous execution of workflow
  - Data errors



# Cross-layer fault/error models | 2

Example: faults in the CPU registers and functional errors while running an application

- A. Faults at transistor level are erroneous transistors output values
- B. Faults at gate level are erroneous gates output values
- C. Faults at RTL abstraction level:
  - Erroneous value in CPUs registers (SEU, MCU, stuck-at ..)
- D. Functional faults at CPU level are:
  - corruption of the execution of an instruction
  - corruption of a stored value
- E. Functional faults at program execution level are:
  - Erroneous execution of workflow
  - Data errors

Challenge: map the correspondence between A and E (or B and E or C and E) in order to work using E (easier) but still being able to estimate fault coverage, effectiveness of a technique, ...



# **Commonly adopted fault models**

| Fault model                            | Description                                                                                                  |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Transistor Stuck-open                  | A failure in a pull-up or pull-down transistor in a CMOS causing the device to expose a memory-like behavior |
| Transistor Stuck-on                    | A transistor is always conducting                                                                            |
| Single Stuck-at Fault (SSA)            | Line permanently at logic value 0 or 1                                                                       |
| Multiple Stuck-at Fault                | Several lines permanently at logic value 0 or 1                                                              |
| Bridging Faults                        | Two or more independent lines assume the same logic value                                                    |
| Delay Fault                            | Signal delay on one or more path                                                                             |
| SEU                                    | Single Event Upset                                                                                           |
| MCU                                    | Multiple Cell Upset                                                                                          |
| Electromigration                       | Open/Short in metal lines                                                                                    |
| Time-Dependent Dielectric<br>Breakdown | Transistors stop switching                                                                                   |
| Stress Migration                       | Transistors stop switching                                                                                   |
| Thermal Cycling                        | Transistors stop switching                                                                                   |



# defects faults errors faliure

A physical production *defect* may result in a *fault* 

A fault, when excited, may cause an observable error

An error is a difference between the correct behavior and the one caused by the presence of a fault in a subcomponent/subsystem

An *error* propagate to the primary output of the system may cause a *failure* 



# QUESTIONS

What are the problems we are trying to address?

What is the most suitable fault model?

# TOPICS

Existing fault models at different levels of abstraction

Time-related characterization of the fault

Platform-related faults