Prediction-Based Power Oversubscription in Cloud Platforms

407 0 0.0 ( 0 )

Download Cite

Added by Pulkit Misra

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Alok Kumbhare - Reza Azimi - Ioannis Manousakis

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Datacenter designers rely on conservative estimates of IT equipment power draw to provision resources. This leaves resources underutilized and requires more datacenters to be built. Prior work has used power capping to shave the rare power peaks and add more servers to the datacenter, thereby oversubscribing its resources and lowering capital costs. This works well when the workloads and their server placements are known. Unfortunately, these factors are unknown in public clouds, forcing providers to limit the oversubscription so that performance is never impacted. In this paper, we argue that providers can use predictions of workload performance criticality and virtual machine (VM) resource utilization to increase oversubscription. This poses many challenges, such as identifying the performance-critical workloads from black-box VMs, creating support for criticality-aware power management, and increasing oversubscription while limiting the impact of capping. We address these challenges for the hardware and software infrastructures of Microsoft Azure. The results show that we enable a 2x increase in oversubscription with minimum impact to critical workloads.

rate research

An Experimental Study on Microservices based Edge Computing Platforms

88 - Qian Qu , Ronghua Xu , Seyed Yahya Nikouei 2020

The rapid technological advances in the Internet of Things (IoT) allows the blueprint of Smart Cities to become feasible by integrating heterogeneous cloud/fog/edge computing paradigms to collaboratively provide variant smart services in our cities and communities. Thanks to attractive features like fine granularity and loose coupling, the microservices architecture has been proposed to provide scalable and extensible services in large scale distributed IoT systems. Recent studies have evaluated and analyzed the performance interference between microservices based on scenarios on the cloud computing environment. However, they are not holistic for IoT applications given the restriction of the edge device like computation consumption and network capacity. This paper investigates multiple microservice deployment policies on the edge computing platform. The microservices are developed as docker containers, and comprehensive experimental results demonstrate the performance and interference of microservices running on benchmark scenarios.

Distributed Parallel and Cluster Computing

Resource Management Schemes for Cloud-Native Platforms with Computing Containers of Docker and Kubernetes

93 - Ying Mao , Yuqi Fu , Suwen Gu 2020

Businesses have made increasing adoption and incorporation of cloud technology into internal processes in the last decade. The cloud-based deployment provides on-demand availability without active management. More recently, the concept of cloud-native application has been proposed and represents an invaluable step toward helping organizations develop software faster and update it more frequently to achieve dramatic business outcomes. Cloud-native is an approach to build and run applications that exploit the cloud computing delivery models advantages. It is more about how applications are created and deployed than where. The container-based virtualization technology, such as Docker and Kubernetes, serves as the foundation for cloud-native applications. This paper investigates the performance of two popular computational-intensive applications, big data, and deep learning, in a cloud-native environment. We analyze the system overhead and resource usage for these applications. Through extensive experiments, we show that the completion time reduces by up to 79.4% by changing the default setting and increases by up to 96.7% due to different resource management schemes on two platforms. Additionally, the resource release is delayed by up to 116.7% across different systems. Our work can guide developers, administrators, and researchers to better design and deploy their applications by selecting and configuring a hosting platform.

Distributed Parallel and Cluster Computing Performance

Machine Learning for Performance Prediction of Spark Cloud Applications

83 - Alexandre Maros , Fabricio Murai , Ana Paula Couto da Silva andn Jussara M. Almeida 2021

Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of todays most widely used frameworks for big data analysis. We compare our approach with textit{Ernest} (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernests performance, sometimes enabling us to reduce the prediction error from 126-187% to only 5-19%.

Distributed Parallel and Cluster Computing Performance

Power-Aware Allocation of Graph Jobs in Geo-Distributed Cloud Networks

92 - Seyyedali Hosseinalipour , Anuj Nayak , Huaiyu Dai 2018

In the era of big-data, the jobs submitted to the clouds exhibit complicated structures represented by graphs, where the nodes denote the sub-tasks each of which can be accommodated at a slot in a server, while the edges indicate the communication constraints among the sub-tasks. We develop a framework for efficient allocation of graph jobs in geo-distributed cloud networks (GDCNs), explicitly considering the power consumption of the datacenters (DCs). We address the following two challenges arising in graph job allocation: i) the allocation problem belongs to NP-hard nonlinear integer programming; ii) the allocation requires solving the NP-complete sub-graph isomorphism problem, which is particularly cumbersome in large-scale GDCNs. We develop a suite of efficient solutions for GDCNs of various scales. For small-scale GDCNs, we propose an analytical approach based on convex programming. For medium-scale GDCNs, we develop a distributed allocation algorithm exploiting the processing power of DCs in parallel. Afterward, we provide a novel low-complexity (decentralized) sub-graph extraction method, based on which we introduce cloud crawlers aiming to extract allocations of good potentials for large-scale GDCNs. Given these suggested strategies, we further investigate strategy selection under both fixed and adaptive DC pricing schemes, and propose an online learning algorithm for each.

Distributed Parallel and Cluster Computing

Learning-based Dynamic Cache Management in a Cloud

112 - Jinhwan Choi , Yu Gu , Jinoh Kim 2019

Caches are an important component of modern computing systems given their significant impact on performance. In particular, caches play a key role in the cloud due to the nature of large-scale, data-intensive processing. One of the key challenges for the cloud providers is how to share the caching capacity among tenants, under the circumstance that each often requires a different degree of quality of service (QoS) with respect to data access performance. The invariant is that the individual tenants QoS requirements should be satisfied while the cache usage is optimized in a system-wide manner. In this paper, we introduce a learning-based approach for dynamic cache management in a cloud, which is based on the estimation of data access pattern of a tenant and the prediction of cache performance for the access pattern in question. We consider a variety of probability distributions to estimate the data access pattern, and examine a set of learning-based regression techniques to predict the cache hit rate for the access pattern. The predicted cache hit rate is then used to make a decision whether reallocating cache space is needed to meet the QoS requirement for the tenant. Our experimental results with an extensive set of synthetic traces and the YCSB benchmark show that the proposed method consistently optimizes the cache space while satisfying the QoS requirement.

Distributed Parallel and Cluster Computing