Syllabus For The Subject Architecture Design



1 Introduction


Evidence of soft errors

Types of soft errors

Cost-effective solutions to mitigate the impact of soft errors




Dependability models



Miscellaneous models

Permanente faults in complementary metal oxide semiconductor


Metal failure modes

Gate oxide failure modes

Radiation-induced transient faults in CMOS transistors

The alpha particle

The neutron

Interaction of alpha particles neutrons

With silicon crystals

Architectural fault models for alpha particle

And neutron strikes

Silent data corruption and detected unrecoverable errors

Basic definitions: SDC and DUE

SDC and DUE budgets

Soft error scaling trends

SRAM and latch scaling trends

DRAM scaling trends


Historical anecdote



2 Device-and circuit-level modeling, management and mitigation


Modeling circuit-level SERs

Impact of alpha particle or neutron on circuit elements

Critical charge

Timing vulnerability factor

Masking effects in combinatorial logic gates

Vulnerability of clock circuits


Field data collection

Accelerated alpha particle tests

Accelerated neutron tests

Mitigation techniques

Device enhancements

Circuit enhancements


Historical anecdote



3 Architectural vulnerability analyses


  AVF basics

Does bit matter ?

SDC and DUE equations

Bit-level SDC and DUE FIT equations

Chip-level SDC and DUE FIT equations


Case study: false DUE from lock stepped checkers

Process-kill versus system-kill DUE AVF

ACE principles

Types of ACE and Un-ACE bits

Point-of-strike model  versus propagated fault model

Microarchitecural Un-ACE bits

Point –of-strike model versus propagated fault model

Microarchiectural Un-ACE bits

Idle or invalid state

Misspeculated state

Predictor structures

Ex-ACE state

Architectural Un-ACE bits

NOP instructions

Performance-enhancing operations

Predicated false instructions

Dynamically dead instructions

Logical masking

AVF equations for a hardware structure

Computing AVF with little’s law

Implications of little’s law for AVF computation

Computing AVF with a performance model

Limitations of AVF analysis with performance models

ACE analysis using the point-of-strike fault model

AVF results from an itanium 2 performance model

ACE analysis using the propagated fault model


Historical anecdote



4 Advanced architectural vulnerability analysis


Lifetime analysis of RAM arrays

Basic idea of lifetime analysis

Accounting for structural differences in lifetime analysis

Impact of working set size for lifetime analysis

Granularity of lifetime analysis

Computing the DUE AVF

Lifetime analysis of CAM arrays

Handling false-positive matches in a CAM array

Handling false-negative matches in a CAM array

Effect of cooldown in lifetime analysis

AVF results for cache data translation buffer, and store buffer

Unknown components

RAM arrays

CAM arrays


Computing AVF using SFI into an RTL model

Comparison of fault injection and ACE analyses

Random sampling in SFI

Determining if an injected fault will result in an error

Case study of SFI

The Illinois SFI study

SFI methodology

Transient faults in pipeline state

Transient faults in logic blocks


Historical anecdote



5 . Error coding techniques


Fault detection and ECC for state bits

Basics of error coding

Error detection using parity codes

Single-error correction codes

Single-error correct double-error detect code

 Double -error correct triple -error detect code

Cyclic redundancy check

Error detection codes for execution units

AN codes

Residue codes

Parity prediction circuits

Implementation overhead of error detection

And correction codes

Number of logic levels

Overhead in area

Scrubbing analysis

DUE FIT from temporal double –bit error with No scrubbing

DUE rate from temporal double-bit error with

Fixed interval scrubbing


Historical anecdote



6 Fault detection via redundant execution


Sphere of replication

Components of the sphere of replication

The size of sphere of replication

Output comparison and input replication

Fault detection via cycle-by-cycle lock stepping

Advantages of lock stepping

Disadvantages of lock stepping

Lockstepping in the status fitserve

Lockstepping in the   Hewlett-packard  nonstop

Himalaya architecture

Lockstepping in the IBM Z-series processors

Fault detection via RMT

RMT in the marathon endurance server

RMT in the Hewlett-packard nonstop advanced architecture

RMT within a single-processor core

A simultaneous multithreaded processor

Design space for SMT in a single core

Output comparison in an SRT processor

Input replication in an SRT processor

Two techniques to enhance performance of an SRT processor

Performance evaluation of an SRT implementation

Alternate single-core RMT implementation

RMT in a multicore architecture

DIVA: RMT using specialized checker processor

RMT enhancements

Relaxed input replication

Relaxed output comparison

Partial RMT


Historical anecdote



7 Hardware error recovery


Classification of hardware error recovery schemes


Forward error recovery

Backward error recovery

Forward error recovery

Fail-over systems

DMR with recovery

Triple modular redundancy


Backward error recovery with fault detection before

Register commit

Fujitsu SPARC64 V: parity with retry

IBM Z-series:  lockstepping with retry

 Recovery in an SRT processor

Revive: backward error recovery using global checkpoints

Safety net: backward error recovery error recovery using local checkpoints

Backward error recovery with fault detection

After I/O commit


Historical anecdote



8 Software detection and recovery  


Fault detection using

Fault detection using

Fault detection using software RMT

Error detection by duplicated instructions

Software-implemented fault tolerance

Configurable transient fault detection

Via dynamic binary translation

Fault detection using hybrid RMT

CRAFT: A Hybrid RMT Implementation

CRAFT evaluation

Fault detection using RVMs

Application –level recovery

Forward error recovery using software RMT and AN codes for

Fault detection

Log –based backward error recovery in database systems

Checkpoint-based backward error recovery for shared memory


OS-level and VMM-level recoveries










 Protection Status