ﻻ يوجد ملخص باللغة العربية
As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of smaller tasks that make up a large computing job. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy, in the form of task replication and erasure coding, provides diversity that allows a job to be completed when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. In situations of constrained resources (here a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy, among replication, coding and splitting, which minimizes the expected job completion time. We consider three common service time distributions and establish three models that describe scaling of these distributions with the task size. We find that different distributions with different scaling models operate optimally at different levels of redundancy, and thus may require very different code rates.
The recent Natural Language Processing techniques have been refreshing the state-of-the-art performance at an incredible speed. Training huge language models is therefore an imperative demand in both industry and academy. However, huge language model
Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the case when t
In cloud computing systems, assigning a task to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers, and reduce latency. But adding redundancy may result
Repair of multiple partially failed cache nodes is studied in a distributed wireless content caching system, where $r$ out of a total of $n$ cache nodes lose part of their cached data. Broadcast repair of failed cache contents at the network edge is
This paper considers the problem of asynchronous distributed multi-agent optimization on server-based system architecture. In this problem, each agent has a local cost, and the goal for the agents is to collectively find a minimum of their aggregate