No Arabic abstract
This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and war stories of users that have adopted BigDL to address their challenges(i.e., how to easily build end-to-end data analysis and deep learning pipelines for their production data).
A major driver behind the success of modern machine learning algorithms has been their ability to process ever-larger amounts of data. As a result, the use of distributed systems in both research and production has become increasingly prevalent as a means to scale to this growing data. At the same time, however, distributing the learning process can drastically complicate the implementation of even simple algorithms. This is especially problematic as many machine learning practitioners are not well-versed in the design of distributed systems, let alone those that have complicated communication topologies. In this work we introduce Launchpad, a programming model that simplifies the process of defining and launching distributed systems that is specifically tailored towards a machine learning audience. We describe our framework, its design philosophy and implementation, and give a number of examples of common learning algorithms whose designs are greatly simplified by this approach.
In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among different regions or organizations for machine learning tasks. Federated learning emerges as an efficient approach to exploit distributed data and computing resources, so as to collaboratively train machine learning models, while obeying the laws and regulations and ensuring data security and data privacy. In this paper, we provide a comprehensive survey of existing works for federated learning. We propose a functional architecture of federated learning systems and a taxonomy of related techniques. Furthermore, we present the distributed training, data communication, and security of FL systems. Finally, we analyze their limitations and propose future research directions.
This paper presents a deep learning framework based on Long Short-term Memory Network(LSTM) that predicts price movement of cryptocurrencies from trade-by-trade data. The main focus of this study is on predicting short-term price changes in a fixed time horizon from a looking back period. By carefully designing features and detailed searching for best hyper-parameters, the model is trained to achieve high performance on nearly a year of trade-by-trade data. The optimal model delivers stable high performance(over 60% accuracy) on out-of-sample test periods. In a realistic trading simulation setting, the prediction made by the model could be easily monetized. Moreover, this study shows that the LSTM model could extract universal features from trade-by-trade data, as the learned parameters well maintain their high performance on other cryptocurrency instruments that were not included in training data. This study exceeds existing researches in term of the scale and precision of data used, as well as the high prediction accuracy achieved.
This paper introduces RankMap, a platform-aware end-to-end framework for efficient execution of a broad class of iterative learning algorithms for massive and dense datasets. Our framework exploits data structure to factorize it into an ensemble of lower rank subspaces. The factorization creates sparse low-dimensional representations of the data, a property which is leveraged to devise effective mapping and scheduling of iterative learning algorithms on the distributed computing machines. We provide two APIs, one matrix-based and one graph-based, which facilitate automated adoption of the framework for performing several contemporary learning applications. To demonstrate the utility of RankMap, we solve sparse recovery and power iteration problems on various real-world datasets with up to 1.8 billion non-zeros. Our evaluations are performed on Amazon EC2 and IBM iDataPlex servers using up to 244 cores. The results demonstrate up to two orders of magnitude improvements in memory usage, execution speed, and bandwidth compared with the best reported prior work, while achieving the same level of learning accuracy.
Grid computing systems require innovative methods and tools to identify cybersecurity incidents and perform autonomous actions i.e. without administrator intervention. They also require methods to isolate and trace job payload activity in order to protect users and find evidence of malicious behavior. We introduce an integrated approach of security monitoring via Security by Isolation with Linux Containers and Deep Learning methods for the analysis of real time data in Grid jobs running inside virtualized High-Throughput Computing infrastructure in order to detect and prevent intrusions. A dataset for malware detection in Grid computing is described. We show in addition the utilization of generative methods with Recurrent Neural Networks to improve the collected dataset. We present Arhuaco, a prototype implementation of the proposed methods. We empirically study the performance of our technique. The results show that Arhuaco outperforms other methods used in Intrusion Detection Systems for Grid Computing. The study is carried out in the ALICE Collaboration Grid, part of the Worldwide LHC Computing Grid.