ARCHITECTURE DESIGN Contents 1 Introduction Overview Evidence of soft errors Types of soft errors Cost-effective solutions to mitigate the impact of soft errors Faults Errors Metrics Dependability models Reliability Availability Miscellaneous models Permanente faults in complementary metal oxide semiconductor Technology Metal failure modes Gate oxide failure modes Radiation-induced transient faults in CMOS transistors The alpha particle The neutron Interaction of alpha particles neutrons With silicon crystals Architectural fault models for alpha particle And neutron strikes Silent data corruption and detected unrecoverable errors Basic definitions: SDC and DUE SDC and DUE budgets Soft error scaling trends SRAM and latch scaling trends DRAM scaling trends Summary Historical anecdote References 2 Device-and circuit-level modeling, management and mitigation Overview Modeling circuit-level SERs Impact of alpha particle or neutron on circuit elements Critical charge Timing vulnerability factor Masking effects in combinatorial logic gates Vulnerability of clock circuits Measurement Field data collection Accelerated alpha particle tests Accelerated neutron tests Mitigation techniques Device enhancements Circuit enhancements Summary Historical anecdote References 3 Architectural vulnerability analyses Overview AVF basics Does bit matter ? SDC and DUE equations Bit-level SDC and DUE FIT equations Chip-level SDC and DUE FIT equations False DUE AVF Case study: false DUE from lock stepped checkers Process-kill versus system-kill DUE AVF ACE principles Types of ACE and Un-ACE bits Point-of-strike model versus propagated fault model Microarchitecural Un-ACE bits Point –of-strike model versus propagated fault model Microarchiectural Un-ACE bits Idle or invalid state Misspeculated state Predictor structures Ex-ACE state Architectural Un-ACE bits NOP instructions Performance-enhancing operations Predicated false instructions Dynamically dead instructions Logical masking AVF equations for a hardware structure Computing AVF with little’s law Implications of little’s law for AVF computation Computing AVF with a performance model Limitations of AVF analysis with performance models ACE analysis using the point-of-strike fault model AVF results from an itanium 2 performance model ACE analysis using the propagated fault model Summary Historical anecdote References 4 Advanced architectural vulnerability analysis Overview Lifetime analysis of RAM arrays Basic idea of lifetime analysis Accounting for structural differences in lifetime analysis Impact of working set size for lifetime analysis Granularity of lifetime analysis Computing the DUE AVF Lifetime analysis of CAM arrays Handling false-positive matches in a CAM array Handling false-negative matches in a CAM array Effect of cooldown in lifetime analysis AVF results for cache data translation buffer, and store buffer Unknown components RAM arrays CAM arrays DUE AVF Computing AVF using SFI into an RTL model Comparison of fault injection and ACE analyses Random sampling in SFI Determining if an injected fault will result in an error Case study of SFI The Illinois SFI study SFI methodology Transient faults in pipeline state Transient faults in logic blocks Summary Historical anecdote References 5 . Error coding techniques Overview Fault detection and ECC for state bits Basics of error coding Error detection using parity codes Single-error correction codes Single-error correct double-error detect code Double -error correct triple -error detect code Cyclic redundancy check Error detection codes for execution units AN codes Residue codes Parity prediction circuits Implementation overhead of error detection And correction codes Number of logic levels Overhead in area Scrubbing analysis DUE FIT from temporal double –bit error with No scrubbing DUE rate from temporal double-bit error with Fixed interval scrubbing Summary Historical anecdote References 6 Fault detection via redundant execution Overview Sphere of replication Components of the sphere of replication The size of sphere of replication Output comparison and input replication Fault detection via cycle-by-cycle lock stepping Advantages of lock stepping Disadvantages of lock stepping Lockstepping in the status fitserve Lockstepping in the Hewlett-packard nonstop Himalaya architecture Lockstepping in the IBM Z-series processors Fault detection via RMT RMT in the marathon endurance server RMT in the Hewlett-packard nonstop advanced architecture RMT within a single-processor core A simultaneous multithreaded processor Design space for SMT in a single core Output comparison in an SRT processor Input replication in an SRT processor Two techniques to enhance performance of an SRT processor Performance evaluation of an SRT implementation Alternate single-core RMT implementation RMT in a multicore architecture DIVA: RMT using specialized checker processor RMT enhancements Relaxed input replication Relaxed output comparison Partial RMT Summary Historical anecdote References 7 Hardware error recovery Overview Classification of hardware error recovery schemes Reboot Forward error recovery Backward error recovery Forward error recovery Fail-over systems DMR with recovery Triple modular redundancy Pair-and-space Backward error recovery with fault detection before Register commit Fujitsu SPARC64 V: parity with retry IBM Z-series: lockstepping with retry Recovery in an SRT processor Revive: backward error recovery using global checkpoints Safety net: backward error recovery error recovery using local checkpoints Backward error recovery with fault detection After I/O commit Summary Historical anecdote References 8 Software detection and recovery Overview Fault detection using Fault detection using Fault detection using software RMT Error detection by duplicated instructions Software-implemented fault tolerance Configurable transient fault detection Via dynamic binary translation Fault detection using hybrid RMT CRAFT: A Hybrid RMT Implementation CRAFT evaluation Fault detection using RVMs Application –level recovery Forward error recovery using software RMT and AN codes for Fault detection Log –based backward error recovery in database systems Checkpoint-based backward error recovery for shared memory Programs OS-level and VMM-level recoveries Summary References
ARCHITECTURE DESIGN
Contents
1 Introduction
Overview
Evidence of soft errors
Types of soft errors
Cost-effective solutions to mitigate the impact of soft errors
Faults
Errors
Metrics
Dependability models
Reliability
Availability
Miscellaneous models
Permanente faults in complementary metal oxide semiconductor
Technology
Metal failure modes
Gate oxide failure modes
Radiation-induced transient faults in CMOS transistors
The alpha particle
The neutron
Interaction of alpha particles neutrons
With silicon crystals
Architectural fault models for alpha particle
And neutron strikes
Silent data corruption and detected unrecoverable errors
Basic definitions: SDC and DUE
SDC and DUE budgets
Soft error scaling trends
SRAM and latch scaling trends
DRAM scaling trends
Summary
Historical anecdote
References
2 Device-and circuit-level modeling, management and mitigation
Modeling circuit-level SERs
Impact of alpha particle or neutron on circuit elements
Critical charge
Timing vulnerability factor
Masking effects in combinatorial logic gates
Vulnerability of clock circuits
Measurement
Field data collection
Accelerated alpha particle tests
Accelerated neutron tests
Mitigation techniques
Device enhancements
Circuit enhancements
3 Architectural vulnerability analyses
AVF basics
Does bit matter ?
SDC and DUE equations
Bit-level SDC and DUE FIT equations
Chip-level SDC and DUE FIT equations
False DUE AVF
Case study: false DUE from lock stepped checkers
Process-kill versus system-kill DUE AVF
ACE principles
Types of ACE and Un-ACE bits
Point-of-strike model versus propagated fault model
Microarchitecural Un-ACE bits
Point –of-strike model versus propagated fault model
Microarchiectural Un-ACE bits
Idle or invalid state
Misspeculated state
Predictor structures
Ex-ACE state
Architectural Un-ACE bits
NOP instructions
Performance-enhancing operations
Predicated false instructions
Dynamically dead instructions
Logical masking
AVF equations for a hardware structure
Computing AVF with little’s law
Implications of little’s law for AVF computation
Computing AVF with a performance model
Limitations of AVF analysis with performance models
ACE analysis using the point-of-strike fault model
AVF results from an itanium 2 performance model
ACE analysis using the propagated fault model
4 Advanced architectural vulnerability analysis
Lifetime analysis of RAM arrays
Basic idea of lifetime analysis
Accounting for structural differences in lifetime analysis
Impact of working set size for lifetime analysis
Granularity of lifetime analysis
Computing the DUE AVF
Lifetime analysis of CAM arrays
Handling false-positive matches in a CAM array
Handling false-negative matches in a CAM array
Effect of cooldown in lifetime analysis
AVF results for cache data translation buffer, and store buffer
Unknown components
RAM arrays
CAM arrays
DUE AVF
Computing AVF using SFI into an RTL model
Comparison of fault injection and ACE analyses
Random sampling in SFI
Determining if an injected fault will result in an error
Case study of SFI
The Illinois SFI study
SFI methodology
Transient faults in pipeline state
Transient faults in logic blocks
5 . Error coding techniques
Fault detection and ECC for state bits
Basics of error coding
Error detection using parity codes
Single-error correction codes
Single-error correct double-error detect code
Double -error correct triple -error detect code
Cyclic redundancy check
Error detection codes for execution units
AN codes
Residue codes
Parity prediction circuits
Implementation overhead of error detection
And correction codes
Number of logic levels
Overhead in area
Scrubbing analysis
DUE FIT from temporal double –bit error with No scrubbing
DUE rate from temporal double-bit error with
Fixed interval scrubbing
6 Fault detection via redundant execution
Sphere of replication
Components of the sphere of replication
The size of sphere of replication
Output comparison and input replication
Fault detection via cycle-by-cycle lock stepping
Advantages of lock stepping
Disadvantages of lock stepping
Lockstepping in the status fitserve
Lockstepping in the Hewlett-packard nonstop
Himalaya architecture
Lockstepping in the IBM Z-series processors
Fault detection via RMT
RMT in the marathon endurance server
RMT in the Hewlett-packard nonstop advanced architecture
RMT within a single-processor core
A simultaneous multithreaded processor
Design space for SMT in a single core
Output comparison in an SRT processor
Input replication in an SRT processor
Two techniques to enhance performance of an SRT processor
Performance evaluation of an SRT implementation
Alternate single-core RMT implementation
RMT in a multicore architecture
DIVA: RMT using specialized checker processor
RMT enhancements
Relaxed input replication
Relaxed output comparison
Partial RMT
7 Hardware error recovery
Classification of hardware error recovery schemes
Reboot
Forward error recovery
Backward error recovery
Fail-over systems
DMR with recovery
Triple modular redundancy
Pair-and-space
Backward error recovery with fault detection before
Register commit
Fujitsu SPARC64 V: parity with retry
IBM Z-series: lockstepping with retry
Recovery in an SRT processor
Revive: backward error recovery using global checkpoints
Safety net: backward error recovery error recovery using local checkpoints
Backward error recovery with fault detection
After I/O commit
8 Software detection and recovery
Fault detection using
Fault detection using software RMT
Error detection by duplicated instructions
Software-implemented fault tolerance
Configurable transient fault detection
Via dynamic binary translation
Fault detection using hybrid RMT
CRAFT: A Hybrid RMT Implementation
CRAFT evaluation
Fault detection using RVMs
Application –level recovery
Forward error recovery using software RMT and AN codes for
Fault detection
Log –based backward error recovery in database systems
Checkpoint-based backward error recovery for shared memory
Programs
OS-level and VMM-level recoveries
Leave us your details we will revert you as soon as possible.
Copyright © 2014 - All Rights Reserved - nimtweb.org Google
Powered by Nasbar Infotech