No Arabic abstract
Internet supercomputing is an approach to solving partitionable, computation-intensive problems by harnessing the power of a vast number of interconnected computers. This paper presents a new algorithm for the problem of using network supercomputing to perform a large collection of independent tasks, while dealing with undependable processors. The adversary may cause the processors to return bogus results for tasks with certain probabilities, and may cause a subset $F$ of the initial set of processors $P$ to crash. The adversary is constrained in two ways. First, for the set of non-crashed processors $P-F$, the emph{average} probability of a processor returning a bogus result is inferior to $frac{1}{2}$. Second, the adversary may crash a subset of processors $F$, provided the size of $P-F$ is bounded from below. We consider two models: the first bounds the size of $P-F$ by a fractional polynomial, the second bounds this size by a poly-logarithm. Both models yield adversaries that are much stronger than previously studied. Our randomized synchronous algorithm is formulated for $n$ processors and $t$ tasks, with $nle t$, where depending on the number of crashes each live processor is able to terminate dynamically with the knowledge that the problem is solved with high probability. For the adversary constrained by a fractional polynomial, the round complexity of the algorithm is $O(frac{t}{n^varepsilon}log{n}log{log{n}})$, its work is $O(tlog{n} log{log{n}})$ and message complexity is $O(nlog{n}log{log{n}})$. For the poly-log constrained adversary, the round complexity is $O(t)$, work is $O(t n^{varepsilon})$, %$O(t , poly log{n})$, and message complexity is $O(n^{1+varepsilon})$ %$O(n , poly log{n})$. All bounds are shown to hold with high probability.
Internet supercomputing is an approach to solving partitionable, computation-intensive problems by harnessing the power of a vast number of interconnected computers. For the problem of using network supercomputing to perform a large collection of independent tasks, prior work introduced a decentralized approach and provided randomized synchronous algorithms that perform all tasks correctly with high probability, while dealing with misbehaving or crash-prone processors. The main weaknesses of existing algorithms is that they assume either that the emph{average} probability of a non-crashed processor returning incorrect results is inferior to $frac{1}{2}$, or that the probability of returning incorrect results is known to emph{each} processor. Here we present a randomized synchronous distributed algorithm that tightly estimates the probability of each processor returning correct results. Starting with the set $P$ of $n$ processors, let $F$ be the set of processors that crash. Our algorithm estimates the probability $p_i$ of returning a correct result for each processor $i in P-F$, making the estimates available to all these processors. The estimation is based on the $(epsilon, delta)$-approximation, where each estimated probability $tilde{p_i}$ of $p_i$ obeys the bound ${sf Pr}[p_i(1-epsilon) leq tilde{p_i} leq p_i(1+epsilon)] > 1 - delta$, for any constants $delta >0$ and $epsilon >0$ chosen by the user. An important aspect of this algorithm is that each processor terminates without global coordination. We assess the efficiency of the algorithm in three adversarial models as follows. For the model where the number of non-crashed processors $|P-F|$ is linearly bounded the time complexity $T(n)$ of the algorithm is $Theta(log{n})$, work complexity $W(n)$ is $Theta(nlog{n})$, and message complexity $M(n)$ is $Theta(nlog^2n)$.
Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
To improve customer experience, datacenter operators offer support for simplifying application and resource management. For example, running workloads of workflows on behalf of customers is desirable, but requires increasingly more sophisticated autoscaling policies, that is, policies that dynamically provision resources for the customer. Although selecting and tuning autoscaling policies is a challenging task for datacenter operators, so far relatively few studies investigate the performance of autoscaling for workloads of workflows. Complementing previous knowledge, in this work we propose the first comprehensive performance study in the field. Using trace-based simulation, we compare state-of-the-art autoscaling policies across multiple application domains, workload arrival patterns (e.g., burstiness), and system utilization levels. We further investigate the interplay between autoscaling and regular allocation policies, and the complexity cost of autoscaling. Our quantitative study focuses not only on traditional performance metrics and on state-of-the-art elasticity metrics, but also on time- and memory-related autoscaling-complexity metrics. Our main results give strong and quantitative evidence about previously unreported operational behavior, for example, that autoscaling policies perform differently across application domains and by how much they differ.
Scientific workflows are a cornerstone of modern scientific computing. They are used to describe complex computational applications that require efficient and robust management of large volumes of data, which are typically stored/processed at heterogeneous, distributed resources. The workflow research and development community has employed a number of methods for the quantitative evaluation of existing and novel workflow algorithms and systems. In particular, a common approach is to simulate workflow executions. In previous work, we have presented a collection of tools that have been used for aiding research and development activities in the Pegasus project, and that have been adopted by others for conducting workflow research. Despite their popularity, there are several shortcomings that prevent easy adoption, maintenance, and consistency with the evolving structures and computational requirements of production workflows. In this work, we present WorkflowHub, a community framework that provides a collection of tools for analyzing workflow execution traces, producing realistic synthetic workflow traces, and simulating workflow executions. We demonstrate the realism of the generated synthetic traces by comparing simulated executions of these traces with actual workflow executions. We also contrast these results with those obtained when using the previously available collection of tools. We find that our framework not only can be used to generate representative synthetic workflow traces (i.e., with workflow structures and task characteristics distributions that resembles those in traces obtained from real-world workflow executions), but can also generate representative workflow traces at larger scales than that of available workflow traces.
As communication networks are growing at a fast pace, the need for more scalable approaches to operate such networks is pressing. Decentralization and locality are key concepts to provide scalability. Existing models for which local algorithms are designed fail to model an important aspect of many modern communication networks such as software-defined networks: the possibility to precompute distributed network state. We take this as an opportunity to study the fundamental question of how and to what extent local algorithms can benefit from preprocessing. In particular, we show that preprocessing allows for significant speedups of various networking problems. A main benefit is the precomputation of structural primitives, where purely distributed algorithms have to start from scratch. Maybe surprisingly, we also show that there are strict limitations on how much preprocessing can help in different scenarios. To this end, we provide approximation bounds for the maximum independent set problem---which however show that our obtained speedups are asymptotically optimal. Even though we show that physical link failures in general hinder the power of preprocessing, we can still facilitate the precomputation of symmetry breaking processes to bypass various runtime barriers. We believe that our model and results are of interest beyond the scope of this paper and apply to other dynamic networks as well.