No Arabic abstract
Counting k-cliques in a graph is an important problem in graph analysis with many applications. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. An important optimization is to eliminate search tree branches that discover the same clique redundantly. Eliminating redundant clique discovery is typically done via graph orientation or pivoting. Parallel implementations for both of these approaches have demonstrated promising performance on CPUs. In this paper, we present our GPU implementations of k-clique counting for both the graph orientation and pivoting approaches. Our implementations explore both vertex-centric and edge-centric parallelization schemes, and replace recursive search tree traversal with iterative traversal based on an explicitly-managed shared stack. We also apply various optimizations to reduce memory consumption and improve the utilization of parallel execution resources. Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean speedup of 12.39x, 6.21x, and 18.99x for k = 4, 7, and 10, respectively. We also evaluate the impact of the choice of parallelization scheme and the incremental speedup of each optimization. Our code will be open-sourced to enable further research on parallelizing k-clique counting on GPUs.
In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of 3.84x.
In this paper, we study new batch-dynamic algorithms for the $k$-clique counting problem, which are dynamic algorithms where the updates are batches of edge insertions and deletions. We study this problem in the parallel setting, where the goal is to obtain algorithms with low (polylogarithmic) depth. Our first result is a new parallel batch-dynamic triangle counting algorithm with $O(Deltasqrt{Delta+m})$ amortized work and $O(log^* (Delta+m))$ depth with high probability, and $O(Delta+m)$ space for a batch of $Delta$ edge insertions or deletions. Our second result is an algebraic algorithm based on parallel fast matrix multiplication. Assuming that a parallel fast matrix multiplication algorithm exists with parallel matrix multiplication constant $omega_p$, the same algorithm solves dynamic $k$-clique counting with $Oleft(minleft(Delta m^{frac{(2k - 1)omega_p}{3(omega_p + 1)}}, (Delta+m)^{frac{2(k + 1)omega_p}{3(omega_p + 1)}}right)right)$ amortized work and $O(log (Delta+m))$ depth with high probability, and $Oleft((Delta+m)^{frac{2(k + 1)omega_p}{3(omega_p + 1)}}right)$ space. Using a recently developed parallel $k$-clique counting algorithm, we also obtain a simple batch-dynamic algorithm for $k$-clique counting on graphs with arboricity $alpha$ running in $O(Delta(m+Delta)alpha^{k-4})$ expected work and $O(log^{k-2} n)$ depth with high probability, and $O(m + Delta)$ space. Finally, we present a multicore CPU implementation of our parallel batch-dynamic triangle counting algorithm. On a 72-core machine with two-way hyper-threading, our implementation achieves 36.54--74.73x parallel speedup, and in certain cases achieves significant speedups over existing parallel algorithms for the problem, which are not theoretically-efficient.
In this work, we use algebraic methods for studying distance computation and subgraph detection tasks in the congested clique model. Specifically, we adapt parallel matrix multiplication implementations to the congested clique, obtaining an $O(n^{1-2/omega})$ round matrix multiplication algorithm, where $omega < 2.3728639$ is the exponent of matrix multiplication. In conjunction with known techniques from centralised algorithmics, this gives significant improvements over previous best upper bounds in the congested clique model. The highlight results include: -- triangle and 4-cycle counting in $O(n^{0.158})$ rounds, improving upon the $O(n^{1/3})$ triangle detection algorithm of Dolev et al. [DISC 2012], -- a $(1 + o(1))$-approximation of all-pairs shortest paths in $O(n^{0.158})$ rounds, improving upon the $tilde{O} (n^{1/2})$-round $(2 + o(1))$-approximation algorithm of Nanongkai [STOC 2014], and -- computing the girth in $O(n^{0.158})$ rounds, which is the first non-trivial solution in this model. In addition, we present a novel constant-round combinatorial algorithm for detecting 4-cycles.
We design fast deterministic algorithms for distance computation in the congested clique model. Our key contributions include: -- A $(2+epsilon)$-approximation for all-pairs shortest paths in $O(log^2{n} / epsilon)$ rounds on unweighted undirected graphs. With a small additional additive factor, this also applies for weighted graphs. This is the first sub-polynomial constant-factor approximation for APSP in this model. -- A $(1+epsilon)$-approximation for multi-source shortest paths from $O(sqrt{n})$ sources in $O(log^2{n} / epsilon)$ rounds on weighted undirected graphs. This is the first sub-polynomial algorithm obtaining this approximation for a set of sources of polynomial size. Our main techniques are new distance tools that are obtained via improved algorithms for sparse matrix multiplication, which we leverage to construct efficient hopsets and shortest paths. Furthermore, our techniques extend to additional distance problems for which we improve upon the state-of-the-art, including diameter approximation, and an exact single-source shortest paths algorithm for weighted undirected graphs in $tilde{O}(n^{1/6})$ rounds.
While algebrisation constitutes a powerful technique in the design and analysis of centralised algorithms, to date there have been hardly any applications of algebraic techniques in the context of distributed graph algorithms. This work is a case study that demonstrates the potential of algebrisation in the distributed context. We will focus on distributed graph algorithms in the congested clique model; the graph problems that we will consider include, e.g., the triangle detection problem and the all-pairs shortest path problem (APSP). There is plenty of prior work on combinatorial algorithms in the congested clique model: for example, Dolev et al. (DISC 2012) gave an algorithm for triangle detection with a running time of $tilde O(n^{1/3})$, and Nanongkai (STOC 2014) gave an approximation algorithm for APSP with a running time of $tilde O(n^{1/2})$. In this work, we will use algebraic techniques -- in particular, algorithms based on fast matrix multiplication -- to solve both triangle detection and the unweighted APSP in time $O(n^{0.15715})$; for weighted APSP, we give a $(1+o(1))$-approximation with this running time, as well as an exact $tilde O(n^{1/3})$ solution.