Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Balanced Partitioning of Several Cache-Oblivious Algorithms

82 0 0.0 ( 0 )

Download Cite

Added by Yuan Tang

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Yuan Tang - Weiguo Gao

Distributed Parallel and Cluster Computing Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. Ballard et al. raised another open question on how to parallelize Strassens algorithm exactly and efficiently on an arbitrary number of processors. We propose a novel way of partitioning a cache-oblivious algorithm to achieve perfect strong scaling on an arbitrary number, even a prime number, of processors within a certain range in a shared-memory setting. Our approach is Processor-Aware but Cache-Oblivious (PACO). We demonstrate our approach on several important cache-oblivious algorithms, including LCS, 1D, GAP, classic rectangular matrix multiplication on a semiring, and Strassens algorithm. We discuss how to extend our approach to a distributed-memory architecture, or even a heterogeneous computing system. Hence, our work may provide a new perspective on the fundamental open problem of extending the recursive cache-oblivious technique to an arbitrary architecture. We provide an almost exact solution to the open problem on parallelizing Strassen. Our approach may provide a new perspective on extending the recursive cache-oblivious technique to an arbitrary architecture. All our algorithms demonstrate better scalability or better overall parallel cache complexities than the best known algorithms. Preliminary experiments justify our theoretical prediction that the PACO algorithms can outperform significantly state-of-the-art Processor-Oblivious (PO) and Processor-Aware (PA) counterparts.

rate research

Optimal Cache-Oblivious Mesh Layouts

460 - Michael A. Bender , Bradley C. Kuszmaul , Shang-Hua Teng 2009

A mesh is a graph that divides physical space into regularly-shaped regions. Meshes computations form the basis of many applications, e.g. finite-element methods, image rendering, and collision detection. In one important mesh primitive, called a mesh update, each mesh vertex stores a value and repeatedly updates this value based on the values stored in all neighboring vertices. The performance of a mesh update depends on the layout of the mesh in memory. This paper shows how to find a memory layout that guarantees that the mesh update has asymptotically optimal memory performance for any set of memory parameters. Such a memory layout is called cache-oblivious. Formally, for a $d$-dimensional mesh $G$, block size $B$, and cache size $M$ (where $M=Omega(B^d)$), the mesh update of $G$ uses $O(1+|G|/B)$ memory transfers. The paper also shows how the mesh-update performance degrades for smaller caches, where $M=o(B^d)$. The paper then gives two algorithms for finding cache-oblivious mesh layouts. The first layout algorithm runs in time $O(|G|log^2|G|)$ both in expectation and with high probability on a RAM. It uses $O(1+|G|log^2(|G|/M)/B)$ memory transfers in expectation and $O(1+(|G|/B)(log^2(|G|/M) + log|G|))$ memory transfers with high probability in the cache-oblivious and disk-access machine (DAM) models. The layout is obtained by finding a fully balanced decomposition tree of $G$ and then performing an in-order traversal of the leaves of the tree. The second algorithm runs faster by almost a $log|G|/loglog|G|$ factor in all three memory models, both in expectation and with high probability. The layout obtained by finding a relax-balanced decomposition tree of $G$ and then performing an in-order traversal of the leaves of the tree.

Data Structures and Algorithms Computational Engineering Mathematical Software

Dynamic Balanced Graph Partitioning

92 - Chen Avin , Marcin Bienkowski , Andreas Loukas 2015

This paper initiates the study of the classic balanced graph partitioning problem from an online perspective: Given an arbitrary sequence of pairwise communication requests between $n$ nodes, with patterns that may change over time, the objective is to service these requests efficiently by partitioning the nodes into $ell$ clusters, each of size $k$, such that frequently communicating nodes are located in the same cluster. The partitioning can be updated dynamically by migrating nodes between clusters. The goal is to devise online algorithms which jointly minimize the amount of inter-cluster communication and migration cost. The problem features interesting connections to other well-known online problems. For example, scenarios with $ell=2$ generalize online paging, and scenarios with $k=2$ constitute a novel online variant of maximum matching. We present several lower bounds and algorithms for settings both with and without cluster-size augmentation. In particular, we prove that any deterministic online algorithm has a competitive ratio of at least $k$, even with significant augmentation. Our main algorithmic contributions are an $O(k log{k})$-competitive deterministic algorithm for the general setting with constant augmentation, and a constant competitive algorithm for the maximum matching variant.

Data Structures and Algorithms

Distributed algorithms for fractional coloring

81 - Nicolas Bousquet , Louis Esperet , Franc{c}ois Pirot 2020

In this paper we study fractional coloring from the angle of distributed computing. Fractional coloring is the linear relaxation of the classical notion of coloring, and has many applications, in particular in scheduling. It was proved by Hasemann, Hirvonen, Rybicki and Suomela (2016) that for every real $alpha>1$ and integer $Delta$, a fractional coloring of total weight at most $alpha(Delta+1)$ can be obtained deterministically in a single round in graphs of maximum degree $Delta$, in the LOCAL model of computation. However, a major issue of this result is that the output of each vertex has unbounded size. Here we prove that even if we impose the more realistic assumption that the output of each vertex has constant size, we can find fractional colorings of total weight arbitrarily close to known tight bounds for the fractional chromatic number in several cases of interest. More precisely, we show that for any fixed $epsilon > 0$ and $Delta$, a fractional coloring of total weight at most $Delta+epsilon$ can be found in $O(log^*n)$ rounds in graphs of maximum degree $Delta$ with no $K_{Delta+1}$, while finding a fractional coloring of total weight at most $Delta$ in this case requires $Omega(log log n)$ rounds for randomized algorithms and $Omega( log n)$ rounds for deterministic algorithms. We also show how to obtain fractional colorings of total weight at most $2+epsilon$ in grids of any fixed dimension, for any $epsilon>0$, in $O(log^*n)$ rounds. Finally, we prove that in sparse graphs of large girth from any proper minor-closed family we can find a fractional coloring of total weight at most $2+epsilon$, for any $epsilon>0$, in $O(log n)$ rounds.

Distributed Parallel and Cluster Computing Data Structures and Algorithms Combinatorics

Fast Deterministic Algorithms for Highly-Dynamic Networks

117 - Keren Censor-Hillel , Neta Dafni , Victor I. Kolobov 2019

This paper provides an algorithmic framework for obtaining fast distributed algorithms for a highly-dynamic setting, in which *arbitrarily many* edge changes may occur in each round. Our algorithm significantly improves upon prior work in its combination of (1) having an $O(1)$ amortized time complexity, (2) using only $O(log{n})$-bit messages, (3) not posing any restrictions on the dynamic behavior of the environment, (4) being deterministic, (5) having strong guarantees for intermediate solutions, and (6) being applicable for a wide family of tasks. The tasks for which we deduce such an algorithm are maximal matching, $(degree+1)$-coloring, 2-approximation for minimum weight vertex cover, and maximal independent set (which is the most subtle case). For some of these tasks, node insertions can also be among the allowed topology changes, and for some of them also abrupt node deletions.

Distributed Parallel and Cluster Computing Data Structures and Algorithms

Energy-Efficient Scheduling: Classification, Bounds, and Algorithms

108 - Pragati Agrawal , Shrisha Rao 2016

The problem of attaining energy efficiency in distributed systems is of importance, but a general, non-domain-specific theory of energy-minimal scheduling is far from developed. In this paper, we classify the problems of energy-minimal scheduling and present theoretical foundations of the same. We derive results concerning energy-minimal scheduling of independent jobs in a distributed system with functionally similar machines with different working and idle power ratings. The machines considered in our system can have identical as well as different speeds. If the jobs can be divided into arbitrary parts, we show that the minimum-energy schedule can be generated in linear time and give exact scheduling algorithms. For the cases where jobs are non-divisible, we prove that the scheduling problems are NP-hard and also give approximation algorithms for the same along with their bounds.

Distributed Parallel and Cluster Computing Data Structures and Algorithms

comments

Fetching comments

University of Mosul

Additional details More universities

Balanced Partitioning of Several Cache-Oblivious Algorithms

Ask ChatGPT about the research

No Arabic abstract

Read More