A continuous mathematical model of fault tolerance mechanisms for parallel applications


Abstract in English

In this paper, we introduce a continuous mathematical model to optimize the compromise between the overhead of fault tolerance mechanism and the faults impacts in the environment of execution. The fault tolerance mechanism considered in this research is a coordinated checkpoint/recovery mechanism and the study based on stochastic model of different performance critics of parallel application on parallel and distributed environment.

References used

Feitelson D.G,2005-The supercomputer industry in light of the top500 data Computing in Science Engineering,7(1):42-47
Oldeld R.A.and all, 2007-Modeling the impact of checkpoints on next-generation systems, 24th IEEE conference on mass storage systems and technologies, pages30–46
Cappello F., Geist A., Gropp B., Kale L., Kramer B. and Snir M., 2009-Toward exascale resilience, International Journal of High Performance Computing Applications, 23(4) :374

Download