ترغب بنشر مسار تعليمي؟ اضغط هنا

RepTFD: Replay Based Transient Fault Detection

87   0   0.0 ( 0 )
 نشر من قبل Lei Li
 تاريخ النشر 2012
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The advances in IC process make future chip multiprocessors (CMPs) more and more vulnerable to transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core separately. As a result, they may leave transient faults in the uncore parts, which consume over 50% area of a modern CMP, escaped from detection. This paper proposes RepTFD, the first core-level transient fault detection scheme with 100% coverage. Instead of providing redundancy for each core separately, RepTFD provides redundancy for a group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a redundant group of cores. Through comparing the execution results between the two groups of cores, all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based record-replay approach, which can greatly reduce the number of execution orders that need to be enforced in the replay-run. Hence, RepTFD brings only 4.76% performance overhead in comparison to the normal execution without fault-tolerance according to our experiments on the RTL design of an industrial CMP named Godson-3. In addition, RepTFD only consumes about 0.83% area of Godson-3, while needing only trivial modifications to existing components of Godson-3.



قيم البحث

اقرأ أيضاً

Micro- and nanosatellites have become popular platforms for a variety of commercial and scientific applications, but today are considered suitable mainly for short and low-priority space missions due to their low reliability. In part, this can be att ributed to their reliance upon cheap, low-feature size, COTS components originally designed for embedded and mobile-market applications, for which traditional hardware-voting concepts are ineffective. Software-fault-tolerance concepts have been shown effective for such systems, but have largely been ignored by the space industry due to low maturity, as most have only been researched in theory. In practice, designers of payload instruments and miniaturized satellites are usually forced to sacrifice reliability in favor deliver the level of performance necessary for cutting-edge science and innovative commercial applications. Thus, we developed a software-fault-tolerance-approach based upon thread-level coarse-grain lockstep, which was validated using fault-injection. To offer strong long-term fault coverage, our architecture is implemented as tiled MPSoC on an FPGA, utilizing partial reconfiguration, as well as mixed criticality. This architecture can satisfy the high performance requirements of current and future scientific and commercial space missions at very low cost, while offering the strong fault-coverage guarantees necessary for platform control even for missions with a long duration. This architecture was developed for a 4-year ESA project. Together with two industrial partners, we are developing a prototype to then undergo radiation testing.
One of the most studied forms of attacks on the cyber-physical systems is the replay attack. The statistical similarities of the replay signal and the true observations make the replay attack difficult to detect. In this paper, we have addressed the problem of replay attack detection by adding watermarking to the control inputs and then performed resilient detection using cumulative sum (CUSUM) test on the joint statistics of the innovation signal and the watermarking signal. We derive the expression of the Kullback-Liebler divergence (KLD) between the two joint distributions before and after the replay attack, which is asymptotically inversely proportional to the detection delay. We perform structural analysis of the derived KLD expression and suggest a technique to improve the KLD for the systems with relative degree greater than one. A scheme to find the optimal watermarking signal variance for a fixed increase in the control cost to maximize the KLD under the CUSUM test is presented. We provide various numerical simulation results to support our theory. The proposed method is also compared with a state-of-the-art method.
The CMOS integrated chips at advanced technology nodes are becoming more vulnerable to various sources of faults like manufacturing imprecisions, variations, aging, etc. Additionally, the intentional fault attacks (e.g., high power microwave, cyberse curity threats, etc.) and environmental effects (i.e., radiation) also pose reliability threats to integrated circuits. Though the traditional hardware redundancy-based techniques like Triple Modular Redundancy (TMR), Quadded (QL) Logic etc. mitigate the risk to some extent, they add huge hardware overhead and are not very effective. Truly polymorphic circuits that are inherently capable of achieving multiple functionalities in a limited footprint could enhance the faultresilience/recovery of the circuits with limited overhead. We demonstrate a novel crosstalk logic based polymorphic circuit approach to achieve compact and efficient fault resilient circuits. We show a range of polymorphic primitive gates and their usage in a functional unit. The functional unit is a single arithmetic circuit that is capable of delivering Multiplication/Sorting/Addition output depending on the control inputs. Using such polymorphic computing units in an ALU would imply that a correct path for functional output is possible even when 2/3rd of the ALU is damaged. Our comparison results with respect to existing polymorphic techniques and CMOS reveal 28% and 62% reduction in transistor count respectively for the same functionalities. In conjunction with fault detection algorithms, the proposed polymorphic circuit concept can be transformative for fault tolerant circuit design directions with minimum overhead.
The Network-on-Chips is a promising candidate for addressing communication bottlenecks in many-core processors and neural network processors. In this work, we consider the generalized fault-tolerance topology generation problem, where the link or swi tch failures can happen, for application-specific network-on-chips (ASNoC). With a user-defined number, K, we propose an integer linear programming (ILP) based method to generate ASNoC topologies, which can tolerate at most K faults in switches or links. Given the communication requirements between cores and their floorplan, we first propose a convex-cost-flow based method to solve a core mapping problem for building connections between the cores and switches. Second, an ILP based method is proposed to allocate K+1 switch-disjoint routing paths for every communication flow between the cores. Finally, to reduce switch sizes, we propose sharing the switch ports for the connections between the cores and switches and formulate the port sharing problem as a clique-partitioning problem Additionally, we propose an ILP-based method to simultaneously solve the core mapping and routing path allocation problems when considering physical link failures only. Experimental results show that the power consumptions of fault-tolerance topologies increase almost linearly with K because of the routing path redundancy. When both switch faults and link faults are considered, port sharing can reduce the average power consumption of fault-tolerance topologies with K = 1, K = 2 and K = 3 by 18.08%, 28.88%, and 34.20%, respectively. When considering only the physical link faults, the experimental results show that compared to the FTTG algorithm, the proposed method reduces power consumption and hop count by 10.58% and 6.25%, respectively; compared to the DBG based method, the proposed method reduces power consumption and hop count by 21.72% and 9.35%, respectively.
321 - Rui Yang , Meng Fang , Lei Han 2021
Solving multi-goal reinforcement learning (RL) problems with sparse rewards is generally challenging. Existing approaches have utilized goal relabeling on collected experiences to alleviate issues raised from sparse rewards. However, these methods ar e still limited in efficiency and cannot make full use of experiences. In this paper, we propose Model-based Hindsight Experience Replay (MHER), which exploits experiences more efficiently by leveraging environmental dynamics to generate virtual achieved goals. Replacing original goals with virtual goals generated from interaction with a trained dynamics model leads to a novel relabeling method, emph{model-based relabeling} (MBR). Based on MBR, MHER performs both reinforcement learning and supervised learning for efficient policy improvement. Theoretically, we also prove the supervised part in MHER, i.e., goal-conditioned supervised learning with MBR data, optimizes a lower bound on the multi-goal RL objective. Experimental results in several point-based tasks and simulated robotics environments show that MHER achieves significantly higher sample efficiency than previous state-of-the-art methods.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا