No Arabic abstract
Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability.
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.
Container technologies have been evolving rapidly in the cloud-native era. Kubernetes, as a production-grade container orchestration platform, has been proven to be successful at managing containerized applications in on-premises datacenters. However, Kubernetes lacks sufficient multi-tenant supports by design, meaning in cloud environments, dedicated clusters are required to serve multiple users, i.e., tenants. This limitation significantly diminishes the benefits of cloud computing, and makes it difficult to build multi-tenant software as a service (SaaS) products using Kubernetes. In this paper, we propose Virtual-Cluster, a new multi-tenant framework that extends Kubernetes with adequate multi-tenant supports. Basically, VirtualCluster provides both control plane and data plane isolations while sharing the underlying compute resources among tenants. The new framework preserves the API compatibility by avoiding modifying the Kubernetes core components. Hence, it can be easily integrated with existing Kubernetes use cases. Our experimental results show that the overheads introduced by VirtualCluster, in terms of latency and throughput, is moderate.
Blockchain has attracted a broad range of interests from start-ups, enterprises and governments to build next generation applications in a decentralized manner. Similar to cloud platforms, a single blockchain-based system may need to serve multiple tenants simultaneously. However, design of multi-tenant blockchain-based systems is challenging to architects in terms of data and performance isolation, as well as scalability. First, tenants must not be able to read other tenants data and tenants with potentially higher workload should not affect read/write performance of other tenants. Second, multi-tenant blockchain-based systems usually require both scalability for each individual tenant and scalability with number of tenants. Therefore, in this paper, we propose a scalable platform architecture for multi-tenant blockchain-based systems to ensure data integrity while maintaining data privacy and performance isolation. In the proposed architecture, each tenant has an individual permissioned blockchain to maintain their own data and smart contracts. All tenant chains are anchored into a main chain, in a way that minimizes cost and load overheads. The proposed architecture has been implemented in a proof-of-concept prototype with our industry partner, Laava ID Pty Ltd (Laava). We evaluate our proposal in a three-fold way: fulfilment of the identified requirements, qualitative comparison with design alternatives, and quantitative analysis. The evaluation results show that the proposed architecture can achieve data integrity, performance isolation, data privacy, configuration flexibility, availability, cost efficiency and scalability.
Deep learning driven by large neural network models is overtaking traditional machine learning methods for understanding unstructured and perceptual data domains such as speech, text, and vision. At the same time, the as-a-Service-based business model on the cloud is fundamentally transforming the information technology industry. These two trends: deep learning, and as-a-service are colliding to give rise to a new business model for cognitive application delivery: deep learning as a service in the cloud. In this paper, we will discuss the details of the software architecture behind IBMs deep learning as a service (DLaaS). DLaaS provides developers the flexibility to use popular deep learning libraries such as Caffe, Torch and TensorFlow, in the cloud in a scalable and resilient manner with minimal effort. The platform uses a distribution and orchestration layer that facilitates learning from a large amount of data in a reasonable amount of time across compute nodes. A resource provisioning layer enables flexible job management on heterogeneous resources, such as graphics processing units (GPUs) and central processing units (CPUs), in an infrastructure as a service (IaaS) cloud.
In cloud computing, network Denial of Service (DoS) attacks are well studied and defenses have been implemented, but severe DoS attacks on a victims working memory by a single hostile VM are not well understood. Memory DoS attacks are Denial of Service (or Degradation of Service) attacks caused by contention for hardware memory resources on a cloud server. Despite the strong memory isolation techniques for virtual machines (VMs) enforced by the software virtualization layer in cloud servers, the underlying hardware memory layers are still shared by the VMs and can be exploited by a clever attacker in a hostile VM co-located on the same server as the victim VM, denying the victim the working memory he needs. We first show quantitatively the severity of contention on different memory resources. We then show that a malicious cloud customer can mount low-cost attacks to cause severe performance degradation for a Hadoop distributed application, and 38X delay in response time for an E-commerce website in the Amazon EC2 cloud. Then, we design an effective, new defense against these memory DoS attacks, using a statistical metric to detect their existence and execution throttling to mitigate the attack damage. We achieve this by a novel re-purposing of existing hardware performance counters and duty cycle modulation for security, rather than for improving performance or power consumption. We implement a full prototype on the OpenStack cloud system. Our evaluations show that this defense system can effectively defeat memory DoS attacks with negligible performance overhead.