In this paper we present a study on the time cost
added to the grid computing as a result of the use of a
coordinated checkpoint / recovery fault tolerance protocol, we aim
to find a mathematical model which determined the suitable time
to save t
he checkpoints for application, to achieve a minimum
finish time of parallel application in grid computing with faults and
fault tolerance protocols, we have find this model by serial
modeling to the goal errors, execution environment and the
chosen fault tolerance protocol all that by Kolmogorov differential
equations.
In this research, We introduce two probabilistic mechanisms to
certificate parallel applications on distribute architecture supposing
that there are no oracles on which we depend on certification, in
addition to introducing cost model of two mecha
nisms and compare
them.
In this research, we are interested in parallel applications, which
are represented by data-flow graph that is built dynamically during
the execution and which are executed in a wide distributed
heterogeneous and dynamic environment and these applications
use the principle of work stealing to distribute the tasks among the
processors.
In this paper, we introduce a continuous mathematical model to
optimize the compromise between the overhead of fault tolerance
mechanism and the faults impacts in the environment of
execution. The fault tolerance mechanism considered in this
rese
arch is a coordinated checkpoint/recovery mechanism and the
study based on stochastic model of different performance critics of
parallel application on parallel and distributed environment.