uqSim: Scalable and Validated Simulation of Cloud Microservices

86 0 0.0 ( 0 )

Download Cite

Added by Christina Delimitrou

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Yanqi Zhang - Yu Gan - Christina Delimitrou

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Current cloud services are moving away from monolithic designs and towards graphs of many loosely-coupled, single-concerned microservices. Microservices have several advantages, including speeding up development and deployment, allowing specialization of the software infrastructure, and helping with debugging and error isolation. At the same time they introduce several hardware and software challenges. Given that most of the performance and efficiency implications of microservices happen at scales larger than what is available outside production deployments, studying such effects requires designing the right simulation infrastructures. We present uqSim, a scalable and validated queueing network simulator designed specifically for interactive microservices. uqSim provides detailed intra- and inter-microservice models that allow it to faithfully reproduce the behavior of complex, many-tier applications. uqSim is also modular, allowing reuse of individual models across microservices and end-to-end applications. We have validated uqSim both against simple and more complex microservices graphs, and have shown that it accurately captures performance in terms of throughput and tail latency. Finally, we use uqSim to model the tail at scale effects of request fanout, and the performance impact of power management in latency-sensitive microservices.

rate research

An Open-Source Benchmark Suite for Cloud and IoT Microservices

74 - Yu Gan , Yanqi Zhang , Dailun Cheng 2019

Cloud services have recently started undergoing a major shift from monolithic applications, to graphs of hundreds of loosely-coupled microservices. Microservices fundamentally change a lot of assumptions current cloud systems are designed with, and present both opportunities and challenges when optimizing for quality of service (QoS) and utilization. In this paper we explore the implications microservices have across the cloud system stack. We first present DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible. DeathStarBench includes a social network, a media service, an e-commerce site, a banking system, and IoT applications for coordination control of UAV swarms. We then use DeathStarBench to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks. Finally, we explore the tail at scale effects of microservices in real deployments with hundreds of users, and highlight the increased pressure they put on performance predictability.

Distributed Parallel and Cluster Computing

Sage: Using Unsupervised Learning for Scalable Performance Debugging in Microservices

118 - Yu Gan , Mingyu Liang , Sundar Dev 2021

Cloud applications are increasingly shifting from large monolithic services to complex graphs of loosely-coupled microservices. Despite the advantages of modularity and elasticity microservices offer, they also complicate cluster management and performance debugging, as dependencies between tiers introduce backpressure and cascading QoS violations. We present Sage, a machine learning-driven root cause analysis system for interactive cloud microservices. Sage leverages unsupervised ML models to circumvent the overhead of trace labeling, captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud services QoS. In experiments on both dedicated local clusters and large clusters on Google Compute Engine we show that Sage consistently achieves over 93% accuracy in correctly identifying the root cause of QoS violations, and improves performance predictability.

Distributed Parallel and Cluster Computing Performance

A Scalable and Cloud-Native Hyperparameter Tuning System

186 - Johnu George , Ce Gao , Richard Liu 2020

In this paper, we introduce Katib: a scalable, cloud-native, and production-ready hyperparameter tuning system that is agnostic of the underlying machine learning framework. Though there are multiple hyperparameter tuning systems available, this is the first one that caters to the needs of both users and administrators of the system. We present the motivation and design of the system and contrast it with existing hyperparameter tuning systems, especially in terms of multi-tenancy, scalability, fault-tolerance, and extensibility. It can be deployed on local machines, or hosted as a service in on-premise data centers, or in private/public clouds. We demonstrate the advantage of our system using experimental results as well as real-world, production use cases. Katib has active contributors from multiple companies and is open-sourced at emph{https://github.com/kubeflow/katib} under the Apache 2.0 license.

Distributed Parallel and Cluster Computing Machine Learning

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

226 - Shaohuai Shi , Xianhao Zhou , Shutao Song 2020

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

Distributed Parallel and Cluster Computing Artificial Intelligence

Simulation of Quantum Many-Body Systems on Amazon Cloud

94 - Justin A. Reyes , Eduardo R. Mucciolo , Dan Marinescu 2019

Quantum many-body systems (QMBs) are some of the most challenging physical systems to simulate numerically. Methods involving approximations for tensor network (TN) contractions have proven to be viable alternatives to algorithms such as quantum Monte Carlo or simulated annealing. However, these methods are cumbersome, difficult to implement, and often have significant limitations in their accuracy and efficiency when considering systems in more than one dimension. In this paper, we explore the exact computation of TN contractions on two-dimensional geometries and present a heuristic improvement of TN contraction that reduces the computing time, the amount of memory, and the communication time. We run our algorithm for the Ising model using memory optimized x1.32x large instances on Amazon Web Services (AWS) Elastic Compute Cloud (EC2). Our results show that cloud computing is a viable alternative to supercomputers for this class of scientific applications.

Distributed Parallel and Cluster Computing Strongly Correlated Electrons Quantum Physics