No Arabic abstract
A distributed computing scenario is considered, where the computational power of a set of worker nodes is used to perform a certain computation task over a dataset that is dispersed among the workers. Lagrange coded computing (LCC), proposed by Yu et al., leverages the well-known Lagrange polynomial to perform polynomial evaluation of the dataset in such a scenario in an efficient parallel fashion while keeping the privacy of data amidst possible collusion of workers. This solution relies on quantizing the data into a finite field, so that Shamirs secret sharing, as one of its main building blocks, can be employed. Such a solution, however, is not properly scalable with the size of dataset, mainly due to computation overflows. To address such a critical issue, we propose a novel extension of LCC to the analog domain, referred to as analog LCC (ALCC). All the operations in the proposed ALCC protocol are done over the infinite fields of R/C but for practical implementations floating-point numbers are used. We characterize the privacy of data in ALCC, against any subset of colluding workers up to a certain size, in terms of the distinguishing security (DS) and the mutual information security (MIS) metrics. Also, the accuracy of outcome is characterized in a practical setting assuming operations are performed using floating-point numbers. Consequently, a fundamental trade-off between the accuracy of the outcome of ALCC and its privacy level is observed and is numerically evaluated. Moreover, we implement the proposed scheme to perform matrix-matrix multiplication over a batch of matrices. It is observed that ALCC is superior compared to the state-of-the-art LCC, implemented using fixed-point numbers, assuming both schemes use an equal number of bits to represent data symbols.
One of the major challenges in using distributed learning to train complicated models with large data sets is to deal with stragglers effect. As a solution, coded computation has been recently proposed to efficiently add redundancy to the computation tasks. In this technique, coding is used across data sets, and computation is done over coded data, such that the results of an arbitrary subset of worker nodes with a certain size are enough to recover the final results. The major challenges with those approaches are (1) they are limited to polynomial function computations, (2) the size of the subset of servers that we need to wait for grows with the multiplication of the size of the data set and the model complexity (the degree of the polynomial), which can be prohibitively large, (3) they are not numerically stable for computation over real numbers. In this paper, we propose Berrut Approximated Coded Computing (BACC), as an alternative approach, which is not limited to polynomial function computation. In addition, the master node can approximately calculate the final results, using the outcomes of any arbitrary subset of available worker nodes. The approximation approach is proven to be numerically stable with low computational complexity. In addition, the accuracy of the approximation is established theoretically and verified by simulation results in different settings such as distributed learning problems. In particular, BACC is used to train a deep neural network on a cluster of servers, which outperforms repetitive computation (repetition coding) in terms of the rate of convergence.
We consider the problem of coded computing, where a computational task is performed in a distributed fashion in the presence of adversarial workers. We propose techniques to break the adversarial toleration threshold barrier previously known in coded computing. More specifically, we leverage list-decoding techniques for folded Reed-Solomon codes and propose novel algorithms to recover the correct codeword using side information. In the coded computing setting, we show how the master node can perform certain carefully designed extra computations to obtain the side information. The workload of computing this side information is negligible compared to the computations done by each worker. This side information is then utilized to prune the output of the list decoder and uniquely recover the true outcome. We further propose folded Lagrange coded computing (FLCC) to incorporate the developed techniques into a specific coded computing setting. Our results show that FLCC outperforms LCC by breaking the barrier on the number of adversaries that can be tolerated. In particular, the corresponding threshold in FLCC is improved by a factor of two asymptotically compared to that of LCC.
Cloud providers have recently introduced new offerings whereby spare computing resources are accessible at discounts compared to on-demand computing. Exploiting such opportunity is challenging inasmuch as such resources are accessed with low-priority and therefore can elastically leave (through preemption) and join the computation at any time. In this paper, we design a new technique called coded elastic computing, enabling distributed computations over elastic resources. The proposed technique allows machines to leave the computation without sacrificing the algorithm-level performance, and, at the same time, adaptively reduce the workload at existing machines when new ones join the computation. Leveraging coded redundancy, our approach can achieve similar computational cost as the original (noiseless) method when all machines are present; the cost gracefully increases when machines are preempted and reduces when machines join. The performance of the proposed technique is evaluated on matrix-vector multiplication and linear regression tasks. In experimental validations, it can achieve exactly the same numerical result as the noiseless computation, while reducing the computation time by 46% when compared to non-adaptive coding schemes.
Coded computation can be used to speed up distributed learning in the presence of straggling workers. Partial recovery of the gradient vector can further reduce the computation time at each iteration; however, this can result in biased estimators, which may slow down convergence, or even cause divergence. Estimator bias will be particularly prevalent when the straggling behavior is correlated over time, which results in the gradient estimators being dominated by a few fast servers. To mitigate biased estimators, we design a $timely$ dynamic encoding framework for partial recovery that includes an ordering operator that changes the codewords and computation orders at workers over time. To regulate the recovery frequencies, we adopt an $age$ metric in the design of the dynamic encoding scheme. We show through numerical results that the proposed dynamic encoding strategy increases the timeliness of the recovered computations, which as a result, reduces the bias in model updates, and accelerates the convergence compared to the conventional static partial recovery schemes.
We consider a MapReduce-type task running in a distributed computing model which consists of ${K}$ edge computing nodes distributed across the edge of the network and a Master node that assists the edge nodes to compute output functions. The Master node and the edge nodes, both equipped with some storage memories and computing capabilities, are connected through a multicast network. We define the communication time spent during the transmission for the sequential implementation (all nodes send symbols sequentially) and parallel implementation (the Master node can send symbols during the edge nodes transmission), respectively. We propose a mixed coded distributed computing scheme that divides the system into two subsystems where the coded distributed computing (CDC) strategy proposed by Songze Li emph{et al.} is applied into the first subsystem and a novel master-aided CDC strategy is applied into the second subsystem. We prove that this scheme is optimal, i.e., achieves the minimum communication time for both the sequential and parallel implementation, and establish an {emph{optimal}} information-theoretic tradeoff between the overall communication time, computation load, and the Master nodes storage capacity. It demonstrates that incorporating a Master node with storage and computing capabilities can further reduce the communication time. For the sequential implementation, we deduce the approximately optimal file allocation between the two subsystems, which shows that the Master node should map as many files as possible in order to achieve smaller communication time. For the parallel implementation, if the Master nodes storage and computing capabilities are sufficiently large (not necessary to store and map all files), then the proposed scheme requires at most 1/2 of the minimum communication time of system without the help of the Master node.