No Arabic abstract
The coflow scheduling problem has emerged as a popular abstraction in the last few years to study data communication problems within a data center. In this basic framework, each coflow has a set of communication demands and the goal is to schedule many coflows in a manner that minimizes the total weighted completion time. A coflow is said to complete when all its communication needs are met. This problem has been extremely well studied for the case of complete bipartite graphs that model a data center with full bisection bandwidth and several approximation algorithms and effective heuristics have been proposed recently. In this work, we study a slightly different model of coflow scheduling in general graphs (to capture traffic between data centers) and develop practical and efficient approximation algorithms for it. Our main result is a randomized 2 approximation algorithm for the single path and free path model, significantly improving prior work. In addition, we demonstrate via extensive experiments that the algorithm is practical, easy to implement and performs well in practice.
This paper provides three nearly-optimal algorithms for scheduling $t$ jobs in the $mathsf{CLIQUE}$ model. First, we present a deterministic scheduling algorithm that runs in $O(mathsf{GlobalCongestion} + mathsf{dilation})$ rounds for jobs that are sufficiently efficient in terms of their memory. The $mathsf{dilation}$ is the maximum round complexity of any of the given jobs, and the $mathsf{GlobalCongestion}$ is the total number of messages in all jobs divided by the per-round bandwidth of $n^2$ of the $mathsf{CLIQUE}$ model. Both are inherent lower bounds for any scheduling algorithm. Then, we present a randomized scheduling algorithm which runs $t$ jobs in $O(mathsf{GlobalCongestion} + mathsf{dilation}cdotlog{n}+t)$ rounds and only requires that inputs and outputs do not exceed $O(nlog n)$ bits per node, which is met by, e.g., almost all graph problems. Lastly, we adjust the emph{random-delay-based} scheduling algorithm [Ghaffari, PODC15] from the $mathsf{CLIQUE}$ model and obtain an algorithm that schedules any $t$ jobs in $O(t / n + mathsf{LocalCongestion} + mathsf{dilation}cdotlog{n})$ rounds, where the $mathsf{LocalCongestion}$ relates to the congestion at a single node of the $mathsf{CLIQUE}$. We compare this algorithm to the previous approaches and show their benefit. We schedule the set of jobs on-the-fly, without a priori knowledge of its parameters or the communication patterns of the jobs. In light of the inherent lower bounds, all of our algorithms are nearly-optimal. We exemplify the power of our algorithms by analyzing the message complexity of the state-of-the-art MIS protocol [Ghaffari, Gouleakis, Konrad, Mitrovic and Rubinfeld, PODC18], and we show that we can solve $t$ instances of MIS in $O(t + loglogDeltalog{n})$ rounds, that is, in $O(1)$ amortized time, for $tgeq loglogDeltalog{n}$.
Coflow scheduling improves data-intensive application performance by improving their networking performance. State-of-the-art online coflow schedulers in essence approximate the classic Shortest-Job-First (SJF) scheduling by learning the coflow size online. In particular, they use multiple priority queues to simultaneously accomplish two goals: to sieve long coflows from short coflows, and to schedule short coflows with high priorities. Such a mechanism pays high overhead in learning the coflow size: moving a large coflow across the queues delays small and other large coflows, and moving similar-sized coflows across the queues results in inadvertent round-robin scheduling. We propose Philae, a new online coflow scheduler that exploits the spatial dimension of coflows, i.e., a coflow has many flows, to drastically reduce the overhead of coflow size learning. Philae pre-schedules sampled flows of each coflow and uses their sizes to estimate the average flow size of the coflow. It then resorts to Shortest Coflow First, where the notion of shortest is determined using the learned coflow sizes and coflow contention. We show that the sampling-based learning is robust to flow size skew and has the added benefit of much improved scalability from reduced coordinator-local agent interactions. Our evaluation using an Azure testbed, a publicly available production cluster trace from Facebook shows that compared to the prior art Aalo, Philae reduces the coflow completion time (CCT) in average (P90) cases by 1.50x (8.00x) on a 150-node testbed and 2.72x (9.78x) on a 900-node testbed. Evaluation using additional traces further demonstrates Philaes robustness to flow size skew.
We study the problem of clock synchronization in highly dynamic networks, where communication links can appear or disappear at any time. The nodes in the network are equipped with hardware clocks, but the rate of the hardware clocks can vary arbitrarily within specific bounds, and the estimates that nodes can obtain about the clock values of other nodes are inherently inaccurate. Our goal in this setting is to output a logical clock at each node such that the logical clocks of any two nodes are not too far apart, and nodes that remain close to each other in the network for a long time are better synchronized than distant nodes. This property is called gradient clock synchronization. Gradient clock synchronization has been widely studied in the static setting, where the network topology does not change. We show that the asymptotically optimal bounds obtained for the static case also apply to our highly dynamic setting: if two nodes remain at distance $d$ from each other for sufficiently long, it is possible to upper bound the difference between their clock values by $O(d log (D / d))$, where $D$ is the diameter of the network. This is known to be optimal even for static networks. Furthermore, we show that our algorithm has optimal stabilization time: when a path of length $d$ appears between two nodes, the time required until the clock skew between the two nodes is reduced to $O(d log (D / d))$ is $O(D)$, which we prove to be optimal. Finally, the techniques employed for the more intricate analysis of the algorithm for dynamic graphs provide additional insights that are also of interest for the static setting. In particular, we establish self-stabilization of the gradient property within $O(D)$ time.
Reconfigurable optical topologies are emerging as a promising technology to improve the efficiency of datacenter networks. This paper considers the problem of scheduling opportunistic links in such reconfigurable datacenters. We study the online setting and aim to minimize flow completion times. The problem is a two-tier generalization of classic switch scheduling problems. We present a stable-matching algorithm which is $2cdot (2/varepsilon+1)$-competitive against an optimal offline algorithm, in a resource augmentation model: the online algorithm runs $2+varepsilon$ times faster. Our algorithm and result are fairly general and allow for different link delays and also apply to hybrid topologies which combine fixed and reconfigurable links. Our analysis is based on LP relaxation and dual fitting.
This paper presents improved approximation algorithms for the problem of multiprocessor scheduling under uncertainty, or SUU, in which the execution of each job may fail probabilistically. This problem is motivated by the increasing use of distributed computing to handle large, computationally intensive tasks. In the SUU problem we are given n unit-length jobs and m machines, a directed acyclic graph G of precedence constraints among jobs, and unrelated failure probabilities q_{ij} for each job j when executed on machine i for a single timestep. Our goal is to find a schedule that minimizes the expected makespan, which is the expected time at which all jobs complete. Lin and Rajaraman gave the first approximations for this NP-hard problem for the special cases of independent jobs, precedence constraints forming disjoint chains, and precedence constraints forming trees. In this paper, we present asymptotically better approximation algorithms. In particular, we give an O(loglog min(m,n))-approximation for independent jobs (improving on the previously best O(log n)-approximation). We also give an O(log(n+m) loglog min(m,n))-approximation algorithm for precedence constraints that form disjoint chains (improving on the previously best O(log(n)log(m)log(n+m)/loglog(n+m))-approximation by a (log n/loglog n)^2 factor when n = poly(m). Our algorithm for precedence constraints forming chains can also be used as a component for precedence constraints forming trees, yielding a similar improvement over the previously best algorithms for trees.