ﻻ يوجد ملخص باللغة العربية
Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be processed in distributed platforms due to their magnitude. This can be done using one of the various distributed machine learning libraries available. One of these libraries is dislib, a distributed machine learning library for Python especially designed to process large scale data sets on HPC clusters, which makes dislib an ideal candidate for analyzing scientific data. However, dislibs main distributed data structure, called Dataset, has some limitations, including poor performance in certain operations and low flexibility and usability. In this paper, we propose a novel distributed data structure for dislib, called ds-array, that addresses dislibs main limitations in data management. Ds-arrays simplify distributed data management in dislib by exposing a NumPy-like API, provide more flexibility, and reduce the computational complexity of some operations. This results in performance improvements of up to two orders of magnitude over Datasets, while also greatly improving scalability and usability.
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow
A major driver behind the success of modern machine learning algorithms has been their ability to process ever-larger amounts of data. As a result, the use of distributed systems in both research and production has become increasingly prevalent as a
In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among differ
This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It a
Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we fo