A checkpoint/recovery Model based on work stealing for grid applications


Abstract in English

The study is researching the fault tolerance in the large distributed environments such as grid computing and clusters of computers in order to find the most effective ways to deal with the errors associated with the crash one of the devices in the environment or network disconnection to ensure the continuity of the application in the presence of the faults.In this paper we study a model of the distributed environment and the parallel applications within it. Then we provide a checkpoint mechanism that will enable us to ensure continuity of the work used by a virtual representation of the application (macro dataflow) and suitable for the applications which uses work stealing algorithm to distribute the tasks which are implemented in heterogeneous and dynamic environment. This mechanism will add a simple cost to the cost of parallel execution as a result of keeping part of the work during fault-free execution. The study also provides a mathematical model to calculate the time complexity i.e. the cost of this proposed mechanism.

References used

AVIZIENIS A, LAPRIE JC and RANDALL B, 2001, Fundamental Concepts of Dependability, in University of New castle upon Tyne, Computing Science
BALA A, CHANA I, 2012, Fault tolerance-challenges, techniques and implementation in cloud computing, in IJCSI Interna tional Journal of Computer Science Issues,Vol. 9, No 1
FRIGO M, LEISERSON CE, and RANDALL KH, 1998 ,The implementation of the Cilk-5 multithreaded language,inProc. ACM SIGPLAN conference on Programming language design and implementation,Pages 212 - 223

Download