No Arabic abstract
Systems for processing big data---e.g., Hadoop, Spark, and massively parallel databases---need to run workloads on behalf of multiple tenants simultaneously. The abundant disk-based storage in these systems is usually complemented by a smaller, but much faster, {em cache}. Cache is a precious resource: Tenants who get to use cache can see two orders of magnitude performance improvement. Cache is also a limited and hence shared resource: Unlike a resource like a CPU core which can be used by only one tenant at a time, a cached data item can be accessed by multiple tenants at the same time. Cache, therefore, has to be shared by a multi-tenancy-aware policy across tenants, each having a unique set of priorities and workload characteristics. In this paper, we develop cache allocation strategies that speed up the overall workload while being {em fair} to each tenant. We build a novel fairness model targeted at the shared resource setting that incorporates not only the more standard concepts of Pareto-efficiency and sharing incentive, but also define envy freeness via the notion of {em core} from cooperative game theory. Our cache management platform, ROBUS, uses randomization over small time batches, and we develop a proportionally fair allocation mechanism that satisfies the core property in expectation. We show that this algorithm and related fair algorithms can be approximated to arbitrary precision in polynomial time. We evaluate these algorithms on a ROBUS prototype implemented on Spark with RDD store used as cache. Our evaluation on a synthetically generated industry-standard workload shows that our algorithms provide a speedup close to performance optimal algorithms while guaranteeing fairness across tenants.
With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.
The in-memory cache system is an important component in a cloud for the data access performance. As the tenants may have different performance goals for data access depending on the nature of their tasks, effectively managing the memory cache is a crucial concern in such a shared computing environment. Two extreme methods for managing the memory cache are unlimited sharing and complete isolation, both of which would be inefficient with the expensive storage complexity to meet the per-tenant performance requirement. In this paper, we present a new cache model that incorporates global caching (based on unlimited sharing) and static caching (offering complete isolation) for a private cloud, in which it is critical to offer the guaranteed performance while minimizing the operating cost. This paper also presents a cache insertion algorithm tailored to the proposed cache model. From an extensive set of experiments conducted on the simulation and emulation settings, the results confirm the validity of the presented cache architecture and insertion algorithm showing the optimized use of the cache space for meeting the per-tenant performance requirement.
Web application performance is heavily reliant on the hit rate of memory-based caches. Current DRAM-based web caches statically partition their memory across multiple applications sharing the cache. This causes under utilization of memory which negatively impacts cache hit rates. We present Memshare, a novel web memory cache that dynamically manages memory across applications. Memshare provides a resource sharing model that guarantees private memory to different applications while dynamically allocating the remaining shared memory to optimize overall hit rate. Todays high cost of DRAM storage and the availability of high performance CPU and memory bandwidth, make web caches memory capacity bound. Memshares log-structured design allows it to provide significantly higher hit rates and dynamically partition memory among applications at the expense of increased CPU and memory bandwidth consumption. In addition, Memshare allows applications to use their own eviction policy for their objects, independent of other applications. We implemented Memshare and ran it on a week-long trace from a commercial memcached provider. We demonstrate that Memshare increases the combined hit rate of the applications in the trace by an 6.1% (from 84.7% hit rate to 90.8% hit rate) and reduces the total number of misses by 39.7% without affecting system throughput or latency. Even for single-tenant applications, Memshare increases the average hit rate of the current state-of-the-art memory cache by an additional 2.7% on our real-world trace.
Container technologies have been evolving rapidly in the cloud-native era. Kubernetes, as a production-grade container orchestration platform, has been proven to be successful at managing containerized applications in on-premises datacenters. However, Kubernetes lacks sufficient multi-tenant supports by design, meaning in cloud environments, dedicated clusters are required to serve multiple users, i.e., tenants. This limitation significantly diminishes the benefits of cloud computing, and makes it difficult to build multi-tenant software as a service (SaaS) products using Kubernetes. In this paper, we propose Virtual-Cluster, a new multi-tenant framework that extends Kubernetes with adequate multi-tenant supports. Basically, VirtualCluster provides both control plane and data plane isolations while sharing the underlying compute resources among tenants. The new framework preserves the API compatibility by avoiding modifying the Kubernetes core components. Hence, it can be easily integrated with existing Kubernetes use cases. Our experimental results show that the overheads introduced by VirtualCluster, in terms of latency and throughput, is moderate.
In recent years, machine intelligence (MI) applications have emerged as a major driver for the computing industry. Optimizing these workloads is important but complicated. As memory demands grow and data movement overheads increasingly limit performance, determining the best GPU caching policy to use for a diverse range of MI workloads represents one important challenge. To study this, we evaluate 17 MI applications and characterize their behaviors using a range of GPU caching strategies. In our evaluations, we find that the choice of caching policy in GPU caches involves multiple performance trade-offs and interactions, and there is no one-size-fits-all GPU caching policy for MI workloads. Based on detailed simulation results, we motivate and evaluate a set of cache optimizations that consistently match the performance of the best static GPU caching policies.