No Arabic abstract
This paper introduces a modeling framework for distributed regression with agents/experts observing attribute-distributed data (heterogeneous data). Under this model, a new algorithm, the iterative covariance optimization algorithm (ICOA), is designed to reshape the covariance matrix of the training residuals of individual agents so that the linear combination of the individual estimators minimizes the ensemble training error. Moreover, a scheme (Minimax Protection) is designed to provide a trade-off between the number of data instances transmitted among the agents and the performance of the ensemble estimator without undermining the convergence of the algorithm. This scheme also provides an upper bound (with high probability) on the test error of the ensemble estimator. The efficacy of ICOA combined with Minimax Protection and the comparison between the upper bound and actual performance are both demonstrated by simulations.
We consider energy minimization for data-intensive applications run on large number of servers, for given performance guarantees. We consider a system, where each incoming application is sent to a set of servers, and is considered to be completed if a subset of them finish serving it. We consider a simple case when each server core has two speed levels, where the higher speed can be achieved by higher power for each core independently. The core selects one of the two speeds probabilistically for each incoming application request. We model arrival of application requests by a Poisson process, and random service time at the server with independent exponential random variables. Our model and analysis generalizes to todays state-of-the-art in CPU energy management where each core can independently select a speed level from a set of supported speeds and corresponding voltages. The performance metrics under consideration are the mean number of applications in the system and the average energy expenditure. We first provide a tight approximation to study this previously intractable problem and derive closed form approximate expressions for the performance metrics when service times are exponentially distributed. Next, we study the trade-off between the approximate mean number of applications and energy expenditure in terms of the switching probability.
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of smaller tasks that make up a large computing job. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy, in the form of task replication and erasure coding, provides diversity that allows a job to be completed when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. In situations of constrained resources (here a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy, among replication, coding and splitting, which minimizes the expected job completion time. We consider three common service time distributions and establish three models that describe scaling of these distributions with the task size. We find that different distributions with different scaling models operate optimally at different levels of redundancy, and thus may require very different code rates.
The paper tackles the issue of $textit{checking}$ that all copies of a large data set replicated at several nodes of a network are identical. The fact that the replicas may be located at distant nodes prevents the system from verifying their equality locally, i.e., by having each node consult only nodes in its vicinity. On the other hand, it remains possible to assign $textit{certificates}$ to the nodes, so that verifying the consistency of the replicas can be achieved locally. However, we show that, as the data set is large, classical certification mechanisms, including distributed Merlin-Arthur protocols, cannot guarantee good completeness and soundness simultaneously, unless they use very large certificates. The main result of this paper is a distributed $textit{quantum}$ Merlin-Arthur protocol enabling the nodes to collectively check the consistency of the replicas, based on small certificates, and in a single round of message exchange between neighbors, with short messages. In particular, the certificate-size is logarithmic in the size of the data set, which gives an exponential advantage over classical certification mechanisms.
Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together.