Overlay multicast (Application-Level Multicast (ALM)) constructs a multicast delivery tree among end hosts. Unlike traditional IP multicast where the internal tree nodes are dedicated routers which are relatively stable and do not leave the multicast
tree voluntarily, the non-leaf nodes in the overlay tree are free end hosts which can join/leave the overlay at will, or even crash without notification. So, the leaving node can leave suddenly and cannot give its descendants (and the Rendez-vous Point (RP)) the time to prepare the recovering (the reconnection) of the overlay tree, and so there is a need to trigger a rearrangement process in which each one of its descendants should rejoin the overlay tree. In this case, all of its downstream nodes are partitioned from the overlay tree and cannot get the multicast data any more. These dynamic characteristics cause the instability of the overlay tree, which can significantly impact the user.
A key challenge in constructing an efficient and resilient ALM protocol is to provide fast data recovery when overlay node failures partition the data delivery paths. In this paper, we analyze the performance of the ALM tree recovery solutions using different metrics.
A lot of research directed its concern to the reliability of Wireless Sensor Networks
(WSNs) used in various applications, especially in early detection of forest fires to ensure
the reliability of warning alarms sent by sensors and reduce the aver
age of false warnings.
In this research we have tried to evaluate the reliability of WSN used in early
detection of fires in Fir and cedar preserve, mainly. By designing hybrid WSN network,
similar to the terrains of the preserve and modeling it using program Opnet14.5. We have
studied several scenarios, to allow increasing malfunction of the network resulting from
fire break out and spreading: starting in allowance of 0% and comparing its results the
results of mathematical equations of reliability according to the same scenarios. In
addition, we have calculated the final availability through suggesting a mechanism to
improve WSN reliability using the redundancy, i.e add sensitive spare nodes, which
replace the damaged ones as the result of fire. The results have proved the remarkable
increasing of reliability. Also, it has been predicted of the reliability of the network
designed according to reliability of different values of the nodes used by using one of the
reliability devices "the Block Diagram".
The increasing reliance on network systems in day-to-day activities requires that they
provide available and reliable services. Jgroup provides available service through creating
multiple replicas of the same service on multiple devices. Jgroup ach
ieves reliable service
by maintaining the shared state between the replicas and coordinating their activities
through Remote Method Invocation. Unlike Jgroup, JavaGroups uses message passing to
implement coordination between the replicas.
In this paper, we compare Jgroup and JavaGroups for different Group Method
Invocation modes. These modes are Anycast and Multicast in Jgroup, GET_FIRST and
GET_ALL in JavaGroups.
This paper also improves the performance of ARM (Autonomous Replication
Management) which is embedded with Jgroup (Jgroup/ARM) for supporting fault
tolerance, through finding a new solution to handle group failure where all remaining
replicas fail in rapid succession. In this new solution, only one replica (the group leader)
issues renew events (IamAlive) periodically, instead of sending it by every replica in the
group, with taking the same period to discover group failure by Replication Manager.
Results of Comparison show that JavaGroups is faster than Jgroup when a single
replica is used, whereas Jgroup outperforms JavaGroups with increasing number of
replicas. The invocation delay in JavaGroups increases noticeably with increasing the size
of array passed into the invoked method which make JavaGroups unsuitable for
applications which require exchanging big sizes of data and use large number of servers,
whereas Jgroup is suitable for that.
Results show that the new proposal reduces the number of renew events to 37.5% at
most, and Jgroup/ARM takes approximately the same period of time to discover group
failure as in Meling solution.
The study is researching the fault tolerance in the large distributed
environments such as grid computing and clusters of computers in
order to find the most effective ways to deal with the errors
associated with the crash one of the devices in th
e environment or
network disconnection to ensure the continuity of the application in
the presence of the faults.In this paper we study a model of the
distributed environment and the parallel applications within it. Then
we provide a checkpoint mechanism that will enable us to ensure
continuity of the work used by a virtual representation of the
application (macro dataflow) and suitable for the applications
which uses work stealing algorithm to distribute the tasks which
are implemented in heterogeneous and dynamic environment.
This mechanism will add a simple cost to the cost of parallel
execution as a result of keeping part of the work during fault-free
execution. The study also provides a mathematical model to
calculate the time complexity i.e. the cost of this proposed
mechanism.
Application-Level Multicast Networks are easy to deployment, it does not require
any change in the network layer, where data is sent to the network via the built-up
coverage of the tree using a single-contact transmission of the final contract, who
are the
hosts are free can join / leave whenever they want it, or even to leave without telling any
node so. Causing the separation of the children of the leaved node from the tree, and the
request for rejoin, in other words, these nodes will be separated from the overlay tree and
cannot get the data even rejoin. This causes the distortion of the constructed tree, and the
loss of several packets which can significantly impact the user.
One of the key challenges in building a multi-efficiently and effectively overlay
multicast protocol is to provide a robust mechanism to overcome the sudden departure of a
node from the overlay tree without a significant impact on the performance of the
constructed tree. In this research, we propose a new protocol to solve problems presented
previously.
Failure detection plays a central role in the engineering of
distributed systems. Furthermore, many applications have timing
constraints and require failure detectors that provide quality of
service (QoS) with some quantitative timeliness guarante
es.
Therefore, they need failure detectors that are fast and accurate.
Failure detectors are oracles that provide information about process
crashes , they are an important abstraction for fault tolerance in
distributed systems. Although current failure detectors theory
provides great generality and expressiveness, it also possess
significant challenges in developing a robust hierarchy of failure
detectors.
In this paper, we propose an implementation of failure detectors.
this implementation uses a dual model of heartbeat and interaction.
First, the heartbeat model is adopted to shorten the detection time.
if the detecting process does not receive the heartbeat message in
the expected time, the interaction model is then used to check the
process further.
In this paper, we propose an implementation of hierarchical failure
detectors, which depends on dividing the processes into sub-groups
and elect one leader called the main process .
The main process then distributes the remaining processes into
g
roups and chooses one leader for each one.
Finally failure detector applied in the chosen leaders which send the
results to the central process.
In this paper we present a study on the time cost
added to the grid computing as a result of the use of a
coordinated checkpoint / recovery fault tolerance protocol, we aim
to find a mathematical model which determined the suitable time
to save t
he checkpoints for application, to achieve a minimum
finish time of parallel application in grid computing with faults and
fault tolerance protocols, we have find this model by serial
modeling to the goal errors, execution environment and the
chosen fault tolerance protocol all that by Kolmogorov differential
equations.