Distributed-Memory Load Balancing with Cyclic Token-based Work-Stealing Applied to Reverse Time Migration

98 0 0.0 ( 0 )

Download Cite

Added by \\'Italo Assis

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Italo A. S. Assis - Ant^onio D. S. Oliveira - Tiago Barros

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Reverse time migration (RTM) is a prominent technique in seismic imaging. Its resulting subsurface images are used in the industry to investigate with higher confidence the existence and the conditions of oil and gas reservoirs. Because of its high computational cost, RTM must make use of parallel computers. Balancing the workload distribution of an RTM is a growing challenge in distributed computing systems. The competition for shared resources and the differently-sized tasks of the RTM are some of the possible sources of load imbalance. Although many load balancing techniques exist, scaling up for large problems and large systems remains a challenge because synchronization overhead also scales. This paper proposes a cyclic token-based work-stealing (CTWS) algorithm for distributed memory systems applied to RTM. The novel cyclic token approach reduces the number of failed steals, avoids communication overhead, and simplifies the victim selection and the termination strategy. The proposed method is implemented as a C library using the one-sided communication feature of the message passing interface (MPI) standard. Results obtained by applying the proposed technique to balance the workload of a 3D RTM system present a factor of 14.1% speedup and reductions of the load imbalance of 78.4% when compared to the conventional static distribution.

rate research

Efficient Load-Balancing through Distributed Token Dropping

203 - Sebastian Brandt , Barbara Keller , Joel Rybicki 2020

We introduce a new graph problem, the token dropping game, and we show how to solve it efficiently in a distributed setting. We use the token dropping game as a tool to design an efficient distributed algorithm for stable orientations and more generally for locally optimal semi-matchings. The prior work by Czygrinow et al. (DISC 2012) finds a stable orientation in $O(Delta^5)$ rounds in graphs of maximum degree $Delta$, while we improve it to $O(Delta^4)$ and also prove a lower bound of $Omega(Delta)$.

Distributed Parallel and Cluster Computing

Distributed Selfish Load Balancing with Weights and Speeds

548 - C. P. J. Adolphs , P. Berenbrink 2011

In this paper we consider neighborhood load balancing in the context of selfish clients. We assume that a network of n processors and m tasks is given. The processors may have different speeds and the tasks may have different weights. Every task is controlled by a selfish user. The objective of the user is to allocate his/her task to a processor with minimum load. We revisit the concurrent probabilistic protocol introduced in [6], which works in sequential rounds. In each round every task is allowed to query the load of one randomly chosen neighboring processor. If that load is smaller the task will migrate to that processor with a suitably chosen probability. Using techniques from spectral graph theory we obtain upper bounds on the expected convergence time towards approximate and exact Nash equilibria that are significantly better than the previous results in [6]. We show results for uniform tasks on non-uniform processors and the general case where the tasks have different weights and the machines have speeds. To the best of our knowledge, these are the first results for this general setting.

Distributed Parallel and Cluster Computing Data Structures and Algorithms

Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application

58 - Sri Raj Paul , John Mellor-Crummey , Mauricio Araya-Polo 2016

Applications to process seismic data employ scalable parallel systems to produce timely results. To fully exploit emerging processor architectures, application will need to employ threaded parallelism within a node and message passing across nodes. Today, MPI+OpenMP is the preferred programming model for this task. However, tuning hybrid programs for clusters is difficult. Performance tools can help users identify bottlenecks and uncover opportunities for improvement. This poster describes our experiences of applying Rice Universitys HPCToolkit and hardware performance counters to gain insight into an MPI+OpenMP code that performs Reverse Time Migration (RTM) on a cluster of multicore processors. The tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from the tools, we were able to improve the performance of the RTM code by roughly 30 percent.

Distributed Parallel and Cluster Computing

Local Deal-Agreement Based Monotonic Distributed Algorithms for Load Balancing in General Graphs

111 - Yefim Dinitz , Shlomi Dolev , Manish Kumar 2020

In computer networks, participants may cooperate in processing tasks, so that loads are balanced among them. We present local distributed algorithms that (repeatedly) use local imbalance criteria to transfer loads concurrently across the participants of the system, iterating until all loads are balanced. Our algorithms are based on a short local deal-agreement communication of proposal/deal, based on the neighborhood loads. They converge monotonically, always providing a better state as the execution progresses. Besides, our algorithms avoid making loads temporarily negative. Thus, they may be considered anytime ones, in the sense that they can be stopped at any time during the execution. We show that our synchronous load balancing algorithms achieve $epsilon$-Balanced state for the continuous setting and 1-Balanced state for the discrete setting in all graphs, within $O(n D log(n K/epsilon))$ and $O(n D log(n K/D) + n D^2)$ time, respectively, where $n$ is the number of nodes, $K$ is the initial discrepancy, $D$ is the graph diameter, and $epsilon$ is the final discrepancy. Our other monotonic synchronous and asynchronous algorithms for the discrete setting are generalizations of the first presented algorithms, where load balancing is performed concurrently with more than one neighbor. These algorithms arrive at a 1-Balanced state in time $O(n K^2)$ in general graphs, but have a potential to be faster as the loads are balanced among all neighbors, rather than with only one; we describe a scenario that demonstrates the potential for a fast ($O(1)$) convergence. Our asynchronous algorithm avoids the need to wait for the slowest participants activity prior to making the next load balancing steps as synchronous settings restrict. We also introduce a self-stabilizing version of our asynchronous algorithm.

Distributed Parallel and Cluster Computing

The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines

453 - Muhammad Anis Uddin Nasir , Gianmarco De Francisci Morales , Davidn Garcia-Soriano 2015

We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical power of two choices to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.

Distributed Parallel and Cluster Computing