An Improved Multiple Faults Reassignment based Recovery in Cluster Computing

302 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل William Jackson

تاريخ النشر 2011

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Sanjay Bansal - Sanjeev Sharma

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In case of multiple node failures performance becomes very low as compare to single node failure. Failures of nodes in cluster computing can be tolerated by multiple fault tolerant computing. Existing recovery schemes are efficient for single fault but not with multiple faults. Recovery scheme proposed in this paper having two phases; sequentially phase, concurrent phase. In sequentially phase, loads of all working nodes are uniformly and evenly distributed by proposed dynamic rank based and load distribution algorithm. In concurrent phase, loads of all failure nodes as well as new job arrival are assigned equally to all available nodes by just finding the least loaded node among the several nodes by failure nodes job allocation algorithm. Sequential and concurrent executions of algorithms improve the performance as well better resource utilization. Dynamic rank based algorithm for load redistribution works as a sequential restoration algorithm and reassignment algorithm for distribution of failure nodes to least loaded computing nodes works as a concurrent recovery reassignment algorithm. Since load is evenly and uniformly distributed among all available working nodes with less number of iterations, low iterative time and communication overheads hence performance is improved. Dynamic ranking algorithm is low overhead, high convergence algorithm for reassignment of tasks uniformly among all available nodes. Reassignments of failure nodes are done by a low overhead efficient failure job allocation algorithm. Test results to show effectiveness of the proposed scheme are presented.

قيم البحث

88 - Qian Qu , Ronghua Xu , Seyed Yahya Nikouei 2020

The rapid technological advances in the Internet of Things (IoT) allows the blueprint of Smart Cities to become feasible by integrating heterogeneous cloud/fog/edge computing paradigms to collaboratively provide variant smart services in our cities a nd communities. Thanks to attractive features like fine granularity and loose coupling, the microservices architecture has been proposed to provide scalable and extensible services in large scale distributed IoT systems. Recent studies have evaluated and analyzed the performance interference between microservices based on scenarios on the cloud computing environment. However, they are not holistic for IoT applications given the restriction of the edge device like computation consumption and network capacity. This paper investigates multiple microservice deployment policies on the edge computing platform. The microservices are developed as docker containers, and comprehensive experimental results demonstrate the performance and interference of microservices running on benchmark scenarios.

النظم الموزعة والتوازية والحوسبة العنقودية

An Improved Framework of GPU Computing for CFD Applications on Structured Grids using OpenACC

156 - Weicheng Xue , Charles W. Jackson , Christoper J. Roy 2020

This paper is focused on improving multi-GPU performance of a research CFD code on structured grids. MPI and OpenACC directives are used to scale the code up to 16 GPUs. This paper shows that using 16 P100 GPUs and 16 V100 GPUs can be 30$times$ and 7 0$times$ faster than 16 Xeon CPU E5-2680v4 cores for three different test cases, respectively. A series of performance issues related to the scaling for the multi-block CFD code are addressed by applying various optimizations. Performance optimizations such as the pack/unpack message method, removing temporary arrays as arguments to procedure calls, allocating global memory for limiters and connected boundary data, reordering non-blocking MPI I_send/I_recv and Wait calls, reducing unnecessary implicit derived type member data movement between the host and the device and the use of GPUDirect can improve the compute utilization, memory throughput, and asynchronous progression in the multi-block CFD code using modern programming features.

النظم الموزعة والتوازية والحوسبة العنقودية

Resource Trading in Edge Computing-enabled IoV: An Efficient Futures-based Approach

94 - Minghui Liwang , Ruitao Chen , Xianbin Wang 2021

Mobile edge computing (MEC) has become a promising solution to utilize distributed computing resources for supporting computation-intensive vehicular applications in dynamic driving environments. To facilitate this paradigm, the onsite resource tradi ng serves as a critical enabler. However, dynamic communications and resource conditions could lead unpredictable trading latency, trading failure, and unfair pricing to the conventional resource trading process. To overcome these challenges, we introduce a novel futures-based resource trading approach in edge computing-enabled internet of vehicles (IoV), where a forward contract is used to facilitate resource trading related negotiations between an MEC server (seller) and a vehicle (buyer) in a given future term. Through estimating the historical statistics of future resource supply and network condition, we formulate the futures-based resource trading as the optimization problem aiming to maximize the sellers and the buyers expected utility, while applying risk evaluations to relieve possible losses incurred by the uncertainties in the system. To tackle this problem, we propose an efficient bilateral negotiation approach which facilitates the participants reaching a consensus. Extensive simulations demonstrate that the proposed futures-based resource trading brings considerable utilities to both participants, while significantly outperforming the baseline methods on critical factors, e.g., trading failures and fairness, negotiation latency and cost.

النظم الموزعة والتوازية والحوسبة العنقودية

Rapid Recovery for Systems with Scarce Faults

419 - Chung-Hao Huang 2012

Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.

أنظمة وتحكم

An Adaptive Checkpointing Scheme for Peer-to-Peer Based Volunteer Computing Work Flows

770 - Lei Ni , Aaron Harwood 2007

Volunteer Computing, sometimes called Public Resource Computing, is an emerging computational model that is very suitable for work-pooled parallel processing. As more complex grid applications make use of work flows in their design and deployment it is reasonable to consider the impact of work flow deployment over a Volunteer Computing infrastructure. In this case, the inter work flow I/O can lead to a significant increase in I/O demands at the work pool server. A possible solution is the use of a Peer-to- Peer based parallel computing architecture to off-load this I/O demand to the workers; where the workers can fulfill some aspects of work flow coordination and I/O checking, etc. However, achieving robustness in such a large scale system is a challenging hurdle towards the decentralized execution of work flows and general parallel processes. To increase robustness, we propose and show the merits of using an adaptive checkpoint scheme that efficiently checkpoints the status of the parallel processes according to the estimation of relevant network and peer parameters. Our scheme uses statistical data observed during runtime to dynamically make checkpoint decisions in a completely de- centralized manner. The results of simulation show support for our proposed approach in terms of reduced required runtime.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات