Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Differentiate Quality of Experience Scheduling for Deep Learning Applications with Docker Containers in the Cloud

480 0 0.0 ( 0 )

Download Cite

Added by Ying Mao

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Ying Mao - Weifeng Yan - Yun Song

Distributed Parallel and Cluster Computing Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

With the prevalence of big-data-driven applications, such as face recognition on smartphones and tailored recommendations from Google Ads, we are on the road to a lifestyle with significantly more intelligence than ever before. For example, Aipoly Vision [1] is an object and color recognizer that helps the blind, visually impaired, and color blind understand their surroundings. At the back end side of their intelligence, various neural networks powered models are running to enable quick responses to users. Supporting those models requires lots of cloud-based computational resources, e.g. CPUs and GPUs. The cloud providers charge their clients by the amount of resources that they occupied. From clients perspective, they have to balance the budget and quality of experiences (e.g. response time). The budget leans on individual business owners and the required Quality of Experience (QoE) depends on usage scenarios of different applications, for instance, an autonomous vehicle requires realtime response, but, unlocking your smartphone can tolerate delays. However, cloud providers fail to offer a QoE based option to their clients. In this paper, we propose DQoES, a differentiate quality of experience scheduler for deep learning applications. DQoES accepts clients specification on targeted QoEs, and dynamically adjust resources to approach their targets. Through extensive, cloud-based experiments, DQoES demonstrates that it can schedule multiple concurrent jobs with respect to various QoEs and achieve up to 8x times more satisfied models compared to the existing system.

rate research

Resource Management Schemes for Cloud-Native Platforms with Computing Containers of Docker and Kubernetes

93 - Ying Mao , Yuqi Fu , Suwen Gu 2020

Businesses have made increasing adoption and incorporation of cloud technology into internal processes in the last decade. The cloud-based deployment provides on-demand availability without active management. More recently, the concept of cloud-native application has been proposed and represents an invaluable step toward helping organizations develop software faster and update it more frequently to achieve dramatic business outcomes. Cloud-native is an approach to build and run applications that exploit the cloud computing delivery models advantages. It is more about how applications are created and deployed than where. The container-based virtualization technology, such as Docker and Kubernetes, serves as the foundation for cloud-native applications. This paper investigates the performance of two popular computational-intensive applications, big data, and deep learning, in a cloud-native environment. We analyze the system overhead and resource usage for these applications. Through extensive experiments, we show that the completion time reduces by up to 79.4% by changing the default setting and increases by up to 96.7% due to different resource management schemes on two platforms. Additionally, the resource release is delayed by up to 116.7% across different systems. Our work can guide developers, administrators, and researchers to better design and deploy their applications by selecting and configuring a hosting platform.

Distributed Parallel and Cluster Computing Performance

Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

84 - Ying Mao , Yuqi Fu , Wenjia Zheng 2020

In the past decade, we have witnessed a dramatically increasing volume of data collected from varied sources. The explosion of data has transformed the world as more information is available for collection and analysis than ever before. To maximize the utilization, various machine and deep learning models have been developed, e.g. CNN [1] and RNN [2], to study data and extract valuable information from different perspectives. While data-driven applications improve countless products, training models for hyperparameter tuning is still a time-consuming and resource-intensive process. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers, such as Amazon Web Services [3], create an isolated virtual environment (virtual machines and containers) for clients, who share physical resources, e.g., CPU and memory. On the cloud, resource management schemes are implemented to enable better sharing among users and boost the system-wide performance. However, general scheduling approaches, such as spread priority and balanced resource schedulers, do not work well with deep learning workloads. In this project, we propose SpeCon, a novel container scheduler that is optimized for shortlived deep learning applications. Based on virtualized containers, such as Kubernetes [4] and Docker [5], SpeCon analyzes the common characteristics of training processes. We design a suite of algorithms to monitor the progress of the training and speculatively migrate the slow-growing models to release resources for fast-growing ones. Specifically, the extensive experiments demonstrate that SpeCon improves the completion time of an individual job by up to 41.5%, 14.8% system-wide and 24.7% in terms of makespan.

Distributed Parallel and Cluster Computing Performance

Machine Learning for Performance Prediction of Spark Cloud Applications

83 - Alexandre Maros , Fabricio Murai , Ana Paula Couto da Silva andn Jussara M. Almeida 2021

Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of todays most widely used frameworks for big data analysis. We compare our approach with textit{Ernest} (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernests performance, sometimes enabling us to reduce the prediction error from 126-187% to only 5-19%.

Distributed Parallel and Cluster Computing Performance

DRAPS: Dynamic and Resource-Aware Placement Scheme for Docker Containers in a Heterogeneous Cluster

111 - Ying Mao , Jenna Oak , Anthony Pompili 2018

Virtualization is a promising technology that has facilitated cloud computing to become the next wave of the Internet revolution. Adopted by data centers, millions of applications that are powered by various virtual machines improve the quality of services. Although virtual machines are well-isolated among each other, they suffer from redundant boot volumes and slow provisioning time. To address limitations, containers were born to deploy and run distributed applications without launching entire virtual machines. As a dominant player, Docker is an open-source implementation of container technology. When managing a cluster of Docker containers, the management tool, Swarmkit, does not take the heterogeneities in both physical nodes and virtualized containers into consideration. The heterogeneity lies in the fact that different nodes in the cluster may have various configurations, concerning resource types and availabilities, etc., and the demands generated by services are varied, such as CPU-intensive (e.g. Clustering services) as well as memory-intensive (e.g. Web services). In this paper, we target on investigating the Docker container cluster and developed, DRAPS, a resource-aware placement scheme to boost the system performance in a heterogeneous cluster.

Distributed Parallel and Cluster Computing

A Comparative Study of Containers and Virtual Machines in Big Data Environment

181 - Qi Zhang , Ling Liu , Calton Pu 2018

Container technique is gaining increasing attention in recent years and has become an alternative to traditional virtual machines. Some of the primary motivations for the enterprise to adopt the container technology include its convenience to encapsulate and deploy applications, lightweight operations, as well as efficiency and flexibility in resources sharing. However, there still lacks an in-depth and systematic comparison study on how big data applications, such as Spark jobs, perform between a container environment and a virtual machine environment. In this paper, by running various Spark applications with different configurations, we evaluate the two environments from many interesting aspects, such as how convenient the execution environment can be set up, what are makespans of different workloads running in each setup, how efficient the hardware resources, such as CPU and memory, are utilized, and how well each environment can scale. The results show that compared with virtual machines, containers provide a more easy-to-deploy and scalable environment for big data workloads. The research work in this paper can help practitioners and researchers to make more informed decisions on tuning their cloud environment and configuring the big data applications, so as to achieve better performance and higher resources utilization.

Distributed Parallel and Cluster Computing Performance

comments

Fetching comments

University of Babylon

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Differentiate Quality of Experience Scheduling for Deep Learning Applications with Docker Containers in the Cloud

Ask ChatGPT about the research

No Arabic abstract

Read More