No Arabic abstract
Container technologies have been evolving rapidly in the cloud-native era. Kubernetes, as a production-grade container orchestration platform, has been proven to be successful at managing containerized applications in on-premises datacenters. However, Kubernetes lacks sufficient multi-tenant supports by design, meaning in cloud environments, dedicated clusters are required to serve multiple users, i.e., tenants. This limitation significantly diminishes the benefits of cloud computing, and makes it difficult to build multi-tenant software as a service (SaaS) products using Kubernetes. In this paper, we propose Virtual-Cluster, a new multi-tenant framework that extends Kubernetes with adequate multi-tenant supports. Basically, VirtualCluster provides both control plane and data plane isolations while sharing the underlying compute resources among tenants. The new framework preserves the API compatibility by avoiding modifying the Kubernetes core components. Hence, it can be easily integrated with existing Kubernetes use cases. Our experimental results show that the overheads introduced by VirtualCluster, in terms of latency and throughput, is moderate.
Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability.
This paper presents results of the ongoing development of the Cloud Services Delivery Infrastructure (CSDI) that provides a basis for infrastructure centric cloud services provisioning, operation and management in multi-cloud multi-provider environment defined as a Zero Touch Provisioning, Operation and Management (ZTP/ZTPOM) model. The presented work refers to use cases from data intensive research that require high performance computation resources and large storage volumes that are typically distributed between datacenters often involving multiple cloud providers. Automation for large scale scientific (and industrial) applications should include provisioning of both inter-cloud network infrastructure and intra-cloud application resources. It should provide support for the complete application operation workflow together with the possible application infrastructure and resources changes that can occur during the application lifecycle. The authors investigate existing technologies for automation of the service provisioning and management processes aiming to cross-pollinate best practices from currently disconnected domains such as cloud based applications provisioning and multi-domain high-performance network provisioning. The paper refers to the previous and legacy research by authors, the Open Cloud eXchange (OCX), that has been proposed to address the last mile problem in cloud services delivery to campuses over trans-national backbone networks such as GEANT. OCX will serve as an integral component of the prospective ZTP infrastructure over the GEANT network. Another important component, the Marketplace, is defined for providing cloud services and applications discovery (in generally intercloud environment) and may also support additional services such as services composition and trust brokering for establishing customer-provider federations.
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.
Driven by emerging tolerance-critical use cases of future communication networks, the demand on cloud computing service providers for their reliable and timely service delivery is to dramatically increase in the upcoming era. Advanced techniques to resolve the congestion of task queues are therefore called for. In this study we propose to rely on the impatient behavior of cloud service tenants towards a distributed risk-based queue management, which enables a profitability-sensitive task dropping while protecting the tenants data privacy. Regarding the service providers data privacy, we propose a dynamic online learning scheme, which allows the tenant to learn the queue dynamics from an adaptive number of observations on its own position in queue, so as to make a rational decision of impatient behavior.
The in-memory cache system is an important component in a cloud for the data access performance. As the tenants may have different performance goals for data access depending on the nature of their tasks, effectively managing the memory cache is a crucial concern in such a shared computing environment. Two extreme methods for managing the memory cache are unlimited sharing and complete isolation, both of which would be inefficient with the expensive storage complexity to meet the per-tenant performance requirement. In this paper, we present a new cache model that incorporates global caching (based on unlimited sharing) and static caching (offering complete isolation) for a private cloud, in which it is critical to offer the guaranteed performance while minimizing the operating cost. This paper also presents a cache insertion algorithm tailored to the proposed cache model. From an extensive set of experiments conducted on the simulation and emulation settings, the results confirm the validity of the presented cache architecture and insertion algorithm showing the optimized use of the cache space for meeting the per-tenant performance requirement.