In this paper we present a study on the time cost
added to the grid computing as a result of the use of a
coordinated checkpoint / recovery fault tolerance protocol, we aim
to find a mathematical model which determined the suitable time
to save t
he checkpoints for application, to achieve a minimum
finish time of parallel application in grid computing with faults and
fault tolerance protocols, we have find this model by serial
modeling to the goal errors, execution environment and the
chosen fault tolerance protocol all that by Kolmogorov differential
equations.
In this paper, we introduce a continuous mathematical model to
optimize the compromise between the overhead of fault tolerance
mechanism and the faults impacts in the environment of
execution. The fault tolerance mechanism considered in this
rese
arch is a coordinated checkpoint/recovery mechanism and the
study based on stochastic model of different performance critics of
parallel application on parallel and distributed environment.
The study is researching the fault tolerance in the large distributed
environments such as grid computing and clusters of computers in
order to find the most effective ways to deal with the errors
associated with the crash one of the devices in th
e environment or
network disconnection to ensure the continuity of the application in
the presence of the faults.In this paper we study a model of the
distributed environment and the parallel applications within it. Then
we provide a checkpoint mechanism that will enable us to ensure
continuity of the work used by a virtual representation of the
application (macro dataflow) and suitable for the applications
which uses work stealing algorithm to distribute the tasks which
are implemented in heterogeneous and dynamic environment.
This mechanism will add a simple cost to the cost of parallel
execution as a result of keeping part of the work during fault-free
execution. The study also provides a mathematical model to
calculate the time complexity i.e. the cost of this proposed
mechanism.