Recomputation Enabled Efficient Checkpointing

60 0 0.0 ( 0 )

Download Cite

Added by Ismail Akturk

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Ismail Akturk - Ulya R. Karpuzcu

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby reduce the checkpointing overhead. Specifically, the resulting amnesic checkpointing framework AmnesiCHK can reduce the storage overhead by up to 23.91%; time overhead, by 11.92%; and energy overhead, by 12.53%, respectively, even in a relatively small scale system.

rate research

Power and Performance Efficient SDN-Enabled Fog Architecture

75 - Adnan Akhunzada , Sherali Zeadally (Seniorn Member 2021

Software Defined Networks (SDNs) have dramatically simplified network management. However, enabling pure SDNs to respond in real-time while handling massive amounts of data still remains a challenging task. In contrast, fog computing has strong potential to serve large surges of data in real-time. SDN control plane enables innovation, and greatly simplifies network operations and management thereby providing a promising solution to implement energy and performance aware SDN-enabled fog computing. Besides, power efficiency and performance evaluation in SDN-enabled fog computing is an area that has not yet been fully explored by the research community. We present a novel SDN-enabled fog architecture to improve power efficacy and performance by leveraging cooperative and non-cooperative policy-based computing. Preliminary results from extensive simulation demonstrate an improvement in the power utilization as well as the overall performance (i.e., processing time, response time). Finally, we discuss several open research issues that need further investigation in the future.

Distributed Parallel and Cluster Computing Artificial Intelligence Networking and Internet Architecture

Checkpointing as a Service in Heterogeneous Cloud Environments

422 - Jiajun Cao , Matthieu Simonin , Gene Cooperman 2014

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.

Distributed Parallel and Cluster Computing

Resource Trading in Edge Computing-enabled IoV: An Efficient Futures-based Approach

94 - Minghui Liwang , Ruitao Chen , Xianbin Wang 2021

Mobile edge computing (MEC) has become a promising solution to utilize distributed computing resources for supporting computation-intensive vehicular applications in dynamic driving environments. To facilitate this paradigm, the onsite resource trading serves as a critical enabler. However, dynamic communications and resource conditions could lead unpredictable trading latency, trading failure, and unfair pricing to the conventional resource trading process. To overcome these challenges, we introduce a novel futures-based resource trading approach in edge computing-enabled internet of vehicles (IoV), where a forward contract is used to facilitate resource trading related negotiations between an MEC server (seller) and a vehicle (buyer) in a given future term. Through estimating the historical statistics of future resource supply and network condition, we formulate the futures-based resource trading as the optimization problem aiming to maximize the sellers and the buyers expected utility, while applying risk evaluations to relieve possible losses incurred by the uncertainties in the system. To tackle this problem, we propose an efficient bilateral negotiation approach which facilitates the participants reaching a consensus. Extensive simulations demonstrate that the proposed futures-based resource trading brings considerable utilities to both participants, while significantly outperforming the baseline methods on critical factors, e.g., trading failures and fairness, negotiation latency and cost.

Distributed Parallel and Cluster Computing

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

229 - Rohan Garg , Gregory Price , Gene Cooperman 2019

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel split-process approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.

Distributed Parallel and Cluster Computing Operating Systems

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

121 - Morgan K. Geldenhuys , Benjamin J. J. Pfister , Dominik Scheinert 2021

Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a systems ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.

Distributed Parallel and Cluster Computing