Research papers, master and doctoral theses about Fault Tolerance

Fault Tolerance in Application-Level Multicast Networks

1084 - Tishreen University 2015 ورقة بحثية

Overlay multicast (Application-Level Multicast (ALM)) constructs a multicast delivery tree among end hosts. Unlike traditional IP multicast where the internal tree nodes are dedicated routers which are relatively stable and do not leave the multicast tree voluntarily, the non-leaf nodes in the overlay tree are free end hosts which can join/leave the overlay at will, or even crash without notification. So, the leaving node can leave suddenly and cannot give its descendants (and the Rendez-vous Point (RP)) the time to prepare the recovering (the reconnection) of the overlay tree, and so there is a need to trigger a rearrangement process in which each one of its descendants should rejoin the overlay tree. In this case, all of its downstream nodes are partitioned from the overlay tree and cannot get the multicast data any more. These dynamic characteristics cause the instability of the overlay tree, which can significantly impact the user. A key challenge in constructing an efficient and resilient ALM protocol is to provide fast data recovery when overlay node failures partition the data delivery paths. In this paper, we analyze the performance of the ALM tree recovery solutions using different metrics.

الشبكات التطبيقية متعددة البث البث المجموعاتي multicast سماحية الأعطال الطرائق التفاعلية الطرائق الاستباقية Application-Level Multicast Fault Tolerance Reactive methods Proactive Methods المزيد..

The Reliability of Wireless Sensor Networks Used For Early Detection of Fire, Status: Fir & Cedar Preserve in Lattakia, Syria

1980 - Tishreen University 2016 ورقة بحثية

A lot of research directed its concern to the reliability of Wireless Sensor Networks (WSNs) used in various applications, especially in early detection of forest fires to ensure the reliability of warning alarms sent by sensors and reduce the aver age of false warnings. In this research we have tried to evaluate the reliability of WSN used in early detection of fires in Fir and cedar preserve, mainly. By designing hybrid WSN network, similar to the terrains of the preserve and modeling it using program Opnet14.5. We have studied several scenarios, to allow increasing malfunction of the network resulting from fire break out and spreading: starting in allowance of 0% and comparing its results the results of mathematical equations of reliability according to the same scenarios. In addition, we have calculated the final availability through suggesting a mechanism to improve WSN reliability using the redundancy, i.e add sensitive spare nodes, which replace the damaged ones as the result of fire. The results have proved the remarkable increasing of reliability. Also, it has been predicted of the reliability of the network designed according to reliability of different values of the nodes used by using one of the reliability devices "the Block Diagram".

شبكات الحساسات اللاسلكية wireless sensor networks reliability الوثوقية Fault Tolerance التوافرية Availability سماحية العطل الفائضية التنبؤ بالوثوقية Redundancy Reliability Prediction المزيد..

Development of Distributed Partitionable Reliable Applications using Jgroup/ARM

1089 - Tishreen University 2016 ورقة بحثية

The increasing reliance on network systems in day-to-day activities requires that they provide available and reliable services. Jgroup provides available service through creating multiple replicas of the same service on multiple devices. Jgroup ach ieves reliable service by maintaining the shared state between the replicas and coordinating their activities through Remote Method Invocation. Unlike Jgroup, JavaGroups uses message passing to implement coordination between the replicas. In this paper, we compare Jgroup and JavaGroups for different Group Method Invocation modes. These modes are Anycast and Multicast in Jgroup, GET_FIRST and GET_ALL in JavaGroups. This paper also improves the performance of ARM (Autonomous Replication Management) which is embedded with Jgroup (Jgroup/ARM) for supporting fault tolerance, through finding a new solution to handle group failure where all remaining replicas fail in rapid succession. In this new solution, only one replica (the group leader) issues renew events (IamAlive) periodically, instead of sending it by every replica in the group, with taking the same period to discover group failure by Replication Manager. Results of Comparison show that JavaGroups is faster than Jgroup when a single replica is used, whereas Jgroup outperforms JavaGroups with increasing number of replicas. The invocation delay in JavaGroups increases noticeably with increasing the size of array passed into the invoked method which make JavaGroups unsuitable for applications which require exchanging big sizes of data and use large number of servers, whereas Jgroup is suitable for that. Results show that the new proposal reduces the number of renew events to 37.5% at most, and Jgroup/ARM takes approximately the same period of time to discover group failure as in Meling solution.

Fault Tolerance التسامح مع الخطأ استدعاء الطريقة البعيدة إدارة النسخ و الإصلاح منصة عمل مجموعة الغرض الموزع Jgroup نظام اتصالات المجموعة JavaGroups Remote Method Invocation Replication and Recovery Management Jgroup Object Group Platform JavaGroups Group Communication System المزيد..

A checkpoint/recovery Model based on work stealing for grid applications

1189 - Aِl-Baath University 2016 ورقة بحثية

The study is researching the fault tolerance in the large distributed environments such as grid computing and clusters of computers in order to find the most effective ways to deal with the errors associated with the crash one of the devices in th e environment or network disconnection to ensure the continuity of the application in the presence of the faults.In this paper we study a model of the distributed environment and the parallel applications within it. Then we provide a checkpoint mechanism that will enable us to ensure continuity of the work used by a virtual representation of the application (macro dataflow) and suitable for the applications which uses work stealing algorithm to distribute the tasks which are implemented in heterogeneous and dynamic environment. This mechanism will add a simple cost to the cost of parallel execution as a result of keeping part of the work during fault-free execution. The study also provides a mathematical model to calculate the time complexity i.e. the cost of this proposed mechanism.

Grid Computing Fault Tolerance البرمجة المتوازية Parallel programming مخطط تدفق البيانات الحوسبة الشبكية سرقة العمل التسامح مع الأعطال نقطة تحقق macro data flow work stealing checkpointing المزيد..

A New Fault Tolerance Protocol in Application-Level Multicast Networks

1327 - Tishreen University 2016 ورقة بحثية

Application-Level Multicast Networks are easy to deployment, it does not require any change in the network layer, where data is sent to the network via the built-up coverage of the tree using a single-contact transmission of the final contract, who are the hosts are free can join / leave whenever they want it, or even to leave without telling any node so. Causing the separation of the children of the leaved node from the tree, and the request for rejoin, in other words, these nodes will be separated from the overlay tree and cannot get the data even rejoin. This causes the distortion of the constructed tree, and the loss of several packets which can significantly impact the user. One of the key challenges in building a multi-efficiently and effectively overlay multicast protocol is to provide a robust mechanism to overcome the sudden departure of a node from the overlay tree without a significant impact on the performance of the constructed tree. In this research, we propose a new protocol to solve problems presented previously.

الشبكات التطبيقية متعددة البث سماحية الأعطال Application-Level Multicast Fault Tolerance شجرة التغطية Overlay Tree التقريب التفاعلي التقريب الاستباقي Reactive Approaches Proactive Approaches المزيد..

Failure detector implementation using dual mode of heartbeat and interaction

1315 - Aِl-Baath University 2016 ورقة بحثية

Failure detection plays a central role in the engineering of distributed systems. Furthermore, many applications have timing constraints and require failure detectors that provide quality of service (QoS) with some quantitative timeliness guarante es. Therefore, they need failure detectors that are fast and accurate. Failure detectors are oracles that provide information about process crashes , they are an important abstraction for fault tolerance in distributed systems. Although current failure detectors theory provides great generality and expressiveness, it also possess significant challenges in developing a robust hierarchy of failure detectors. In this paper, we propose an implementation of failure detectors. this implementation uses a dual model of heartbeat and interaction. First, the heartbeat model is adopted to shorten the detection time. if the detecting process does not receive the heartbeat message in the expected time, the interaction model is then used to check the process further.

Fault Tolerance النظم الموزعة الخوارزميات الموزعة كاشف الأعطال التسامح مع الأخطاء Distributed systems Distributed algorithms Failure detectors المزيد..

Hierarchical Failure detectors implementation using dual mode of heartbeat and interaction

931 - Aِl-Baath University 2016 ورقة بحثية

In this paper, we propose an implementation of hierarchical failure detectors, which depends on dividing the processes into sub-groups and elect one leader called the main process . The main process then distributes the remaining processes into g roups and chooses one leader for each one. Finally failure detector applied in the chosen leaders which send the results to the central process.

Fault Tolerance النظم الموزعة الخوارزميات الموزعة كاشف الأعطال التسامح مع الأخطاء Distributed systems Distributed algorithms Failure detectors البنية الهرمية hierarchical failure detector المزيد..

Using differential equations for modeling performance of fault tolerance in parallel applications

1531 - Aِl-Baath University 2017 ورقة بحثية

In this paper we present a study on the time cost added to the grid computing as a result of the use of a coordinated checkpoint / recovery fault tolerance protocol, we aim to find a mathematical model which determined the suitable time to save t he checkpoints for application, to achieve a minimum finish time of parallel application in grid computing with faults and fault tolerance protocols, we have find this model by serial modeling to the goal errors, execution environment and the chosen fault tolerance protocol all that by Kolmogorov differential equations.

نموذج رياضي mathematical model Fault Tolerance معادلات تفاضلية التطبيقات المتوازية Parallel application التسامح مع الأعطال البيئات التفرعية distributed environment differential equation المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد