No Arabic abstract
Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.
Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability.
In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing library OpenCV with Spark, and present a tool for the automated generation of PySpark wrappers from any SparkML estimator and use this tool to expose all work to the PySpark ecosystem. Finally, we provide a large library of tools for working and developing within the Spark ecosystem. We apply this work to the automated classification of Snow Leopards from camera trap images, and provide an end to end solution for the non-profit conservation organization, the Snow Leopard Trust.
Blockchain has attracted a broad range of interests from start-ups, enterprises and governments to build next generation applications in a decentralized manner. Similar to cloud platforms, a single blockchain-based system may need to serve multiple tenants simultaneously. However, design of multi-tenant blockchain-based systems is challenging to architects in terms of data and performance isolation, as well as scalability. First, tenants must not be able to read other tenants data and tenants with potentially higher workload should not affect read/write performance of other tenants. Second, multi-tenant blockchain-based systems usually require both scalability for each individual tenant and scalability with number of tenants. Therefore, in this paper, we propose a scalable platform architecture for multi-tenant blockchain-based systems to ensure data integrity while maintaining data privacy and performance isolation. In the proposed architecture, each tenant has an individual permissioned blockchain to maintain their own data and smart contracts. All tenant chains are anchored into a main chain, in a way that minimizes cost and load overheads. The proposed architecture has been implemented in a proof-of-concept prototype with our industry partner, Laava ID Pty Ltd (Laava). We evaluate our proposal in a three-fold way: fulfilment of the identified requirements, qualitative comparison with design alternatives, and quantitative analysis. The evaluation results show that the proposed architecture can achieve data integrity, performance isolation, data privacy, configuration flexibility, availability, cost efficiency and scalability.
Container technologies have been evolving rapidly in the cloud-native era. Kubernetes, as a production-grade container orchestration platform, has been proven to be successful at managing containerized applications in on-premises datacenters. However, Kubernetes lacks sufficient multi-tenant supports by design, meaning in cloud environments, dedicated clusters are required to serve multiple users, i.e., tenants. This limitation significantly diminishes the benefits of cloud computing, and makes it difficult to build multi-tenant software as a service (SaaS) products using Kubernetes. In this paper, we propose Virtual-Cluster, a new multi-tenant framework that extends Kubernetes with adequate multi-tenant supports. Basically, VirtualCluster provides both control plane and data plane isolations while sharing the underlying compute resources among tenants. The new framework preserves the API compatibility by avoiding modifying the Kubernetes core components. Hence, it can be easily integrated with existing Kubernetes use cases. Our experimental results show that the overheads introduced by VirtualCluster, in terms of latency and throughput, is moderate.
Swift for TensorFlow is a deep learning platform that scales from mobile devices to clusters of hardware accelerators in data centers. It combines a language-integrated automatic differentiation system and multiple Tensor implementations within a modern ahead-of-time compiled language oriented around mutable value semantics. The resulting platform has been validated through use in over 30 deep learning models and has been employed across data center and mobile applications.