RepTFD: Replay Based Transient Fault Detection


الملخص بالإنكليزية

The advances in IC process make future chip multiprocessors (CMPs) more and more vulnerable to transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core separately. As a result, they may leave transient faults in the uncore parts, which consume over 50% area of a modern CMP, escaped from detection. This paper proposes RepTFD, the first core-level transient fault detection scheme with 100% coverage. Instead of providing redundancy for each core separately, RepTFD provides redundancy for a group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a redundant group of cores. Through comparing the execution results between the two groups of cores, all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based record-replay approach, which can greatly reduce the number of execution orders that need to be enforced in the replay-run. Hence, RepTFD brings only 4.76% performance overhead in comparison to the normal execution without fault-tolerance according to our experiments on the RTL design of an industrial CMP named Godson-3. In addition, RepTFD only consumes about 0.83% area of Godson-3, while needing only trivial modifications to existing components of Godson-3.

تحميل البحث