Checkpointing as a Service in Heterogeneous Cloud Environments

416 0 0.0 ( 0 )

Download Cite

Added by Jiajun Cao

Publication date 2014

fields Informatics Engineering

and research's language is English

Authors Jiajun Cao - Matthieu Simonin - Gene Cooperman

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.

rate research

Links as a Service (LaaS): Feeling Alone in the Shared Cloud

102 - Eitan Zahavi , Alex Shpiner , Ori Rottenstreich 2015

The most demanding tenants of shared clouds require complete isolation from their neighbors, in order to guarantee that their application performance is not affected by other tenants. Unfortunately, while shared clouds can offer an option whereby tenants obtain dedicated servers, they do not offer any network provisioning service, which would shield these tenants from network interference. In this paper, we introduce Links as a Service, a new abstraction for cloud service that provides physical isolation of network links. Each tenant gets an exclusive set of links forming a virtual fat tree, and is guaranteed to receive the exact same bandwidth and delay as if it were alone in the shared cloud. Under simple assumptions, we derive theoretical conditions for enabling LaaS without capacity over-provisioning in fat-trees. New tenants are only admitted in the network when they can be allocated hosts and links that maintain these conditions. Using experiments on real clusters as well as simulations with real-life tenant sizes, we show that LaaS completely avoids the performance degradation caused by traffic from concurrent tenants on shared links. Compared to mere host isolation, LaaS can improve the application performance by up to 200%, at the cost of a 10% reduction in the cloud utilization.

Distributed Parallel and Cluster Computing Networking and Internet Architecture

Market-Oriented Online Bi-Objective Service Scheduling for Pleasingly Parallel Jobs with Variable Resources in Cloud Environments

63 - Bingbing Zheng , Li Pan , Shijun Liu 2021

In this paper, we study the market-oriented online bi-objective service scheduling problem for pleasingly parallel jobs with variable resources in cloud environments, from the perspective of SaaS (Software-as-as-Service) providers who provide job-execution services. The main process of scheduling SaaS services in clouds is: a SaaS provider purchases cloud instances from IaaS providers to schedule end users jobs and charges users accordingly. This problem has several particular features, such as the job-oriented end users, the pleasingly parallel jobs with soft deadline constraints, the online settings, and the variable numbers of resources. For maximizing both the revenue and the user satisfaction rate, we design an online algorithm for SaaS providers to optimally purchase IaaS instances and schedule pleasingly parallel jobs. The proposed algorithm can achieve competitive objectives in polynomial run-time. The theoretical analysis and simulations based on real-world Google job traces as well as synthetic datasets validate the effectiveness and efficiency of our algorithm.

Distributed Parallel and Cluster Computing

Cloudburst: Stateful Functions-as-a-Service

161 - Vikram Sreekanti , Chenggang Wu , Xiayue Charles Lin 2020

Function-as-a-Service (FaaS) platforms and serverless cloud computing are becoming increasingly popular. Current FaaS offerings are targeted at stateless functions that do minimal I/O and communication. We argue that the benefits of serverless computing can be extended to a broader range of applications and algorithms. We present the design and implementation of Cloudburst, a stateful FaaS platform that provides familiar Python programming with low-latency mutable state and communication, while maintaining the autoscaling benefits of serverless computing. Cloudburst accomplishes this by leveraging Anna, an autoscaling key-value store, for state sharing and overlay routing combined with mutable caches co-located with function executors for data locality. Performant cache consistency emerges as a key challenge in this architecture. To this end, Cloudburst provides a combination of lattice-encapsulated state and new definitions and protocols for distributed session consistency. Empirical results on benchmarks and diverse applications show that Cloudburst makes stateful functions practical, reducing the state-management overheads of current FaaS platforms by orders of magnitude while also improving the state of the art in serverless consistency.

Distributed Parallel and Cluster Computing

M2: Malleable Metal as a Service

106 - Apoorve Mohan , Ata Turk , Ravi S. Gudimetla 2018

Existing bare-metal cloud services that provide users with physical nodes have a number of serious disadvantage over their virtual alternatives, including slow provisioning times, difficulty for users to release nodes and then reuse them to handle changes in demand, and poor tolerance to failures. We introduce M2, a bare-metal cloud service that uses network-mounted boot drives to overcome these disadvantages. We describe the architecture and implementation of M2 and compare its agility, scalability, and performance to existing systems. We show that M2 can reduce provisioning time by over 50% while offering richer functionality, and comparable run-time performance with respect to tools that provision images into local disks. M2 is open source and available at https://github.com/CCI-MOC/ims.

Distributed Parallel and Cluster Computing

Service Placement with Provable Guarantees in Heterogeneous Edge Computing Systems

75 - Stephen Pasteris , Shiqiang Wang , Mark Herbster 2019

Mobile edge computing (MEC) is a promising technique for providing low-latency access to services at the network edge. The services are hosted at various types of edge nodes with both computation and communication capabilities. Due to the heterogeneity of edge node characteristics and user locations, the performance of MEC varies depending on where the service is hosted. In this paper, we consider such a heterogeneous MEC system, and focus on the problem of placing multiple services in the system to maximize the total reward. We show that the problem is NP-hard via reduction from the set cover problem, and propose a deterministic approximation algorithm to solve the problem, which has an approximation ratio that is not worse than $left(1-e^{-1}right)/4$. The proposed algorithm is based on two sub-routines that are suitable for small and arbitrarily sized services, respectively. The algorithm is designed using a novel way of partitioning each edge node into multiple slots, where each slot contains one service. The approximation guarantee is obtained via a specialization of the method of conditional expectations, which uses a randomized procedure as an intermediate step. In addition to theoretical guarantees, simulation results also show that the proposed algorithm outperforms other state-of-the-art approaches.

Distributed Parallel and Cluster Computing Optimization and Control