No Arabic abstract
In this paper, we study new batch-dynamic algorithms for the $k$-clique counting problem, which are dynamic algorithms where the updates are batches of edge insertions and deletions. We study this problem in the parallel setting, where the goal is to obtain algorithms with low (polylogarithmic) depth. Our first result is a new parallel batch-dynamic triangle counting algorithm with $O(Deltasqrt{Delta+m})$ amortized work and $O(log^* (Delta+m))$ depth with high probability, and $O(Delta+m)$ space for a batch of $Delta$ edge insertions or deletions. Our second result is an algebraic algorithm based on parallel fast matrix multiplication. Assuming that a parallel fast matrix multiplication algorithm exists with parallel matrix multiplication constant $omega_p$, the same algorithm solves dynamic $k$-clique counting with $Oleft(minleft(Delta m^{frac{(2k - 1)omega_p}{3(omega_p + 1)}}, (Delta+m)^{frac{2(k + 1)omega_p}{3(omega_p + 1)}}right)right)$ amortized work and $O(log (Delta+m))$ depth with high probability, and $Oleft((Delta+m)^{frac{2(k + 1)omega_p}{3(omega_p + 1)}}right)$ space. Using a recently developed parallel $k$-clique counting algorithm, we also obtain a simple batch-dynamic algorithm for $k$-clique counting on graphs with arboricity $alpha$ running in $O(Delta(m+Delta)alpha^{k-4})$ expected work and $O(log^{k-2} n)$ depth with high probability, and $O(m + Delta)$ space. Finally, we present a multicore CPU implementation of our parallel batch-dynamic triangle counting algorithm. On a 72-core machine with two-way hyper-threading, our implementation achieves 36.54--74.73x parallel speedup, and in certain cases achieves significant speedups over existing parallel algorithms for the problem, which are not theoretically-efficient.
Maintaining a $k$-core decomposition quickly in a dynamic graph is an important problem in many applications, including social network analytics, graph visualization, centrality measure computations, and community detection algorithms. The main challenge for designing efficient $k$-core decomposition algorithms is that a single change to the graph can cause the decomposition to change significantly. We present the first parallel batch-dynamic algorithm for maintaining an approximate $k$-core decomposition that is efficient in both theory and practice. Given an initial graph with $m$ edges, and a batch of $B$ updates, our algorithm maintains a $(2 + delta)$-approximation of the coreness values for all vertices (for any constant $delta > 0$) in $O(Blog^2 m)$ amortized work and $O(log^2 m loglog m)$ depth (parallel time) with high probability. Our algorithm also maintains a low out-degree orientation of the graph in the same bounds. We implemented and experimentally evaluated our algorithm on a 30-core machine with two-way hyper-threading on $11$ graphs of varying densities and sizes. Compared to the state-of-the-art algorithms, our algorithm achieves up to a 114.52x speedup against the best multicore implementation and up to a 497.63x speedup against the best sequential algorithm, obtaining results for graphs that are orders-of-magnitude larger than those used in previous studies. In addition, we present the first approximate static $k$-core algorithm with linear work and polylogarithmic depth. We show that on a 30-core machine with two-way hyper-threading, our implementation achieves up to a 3.9x speedup in the static case over the previous state-of-the-art parallel algorithm.
Subgraph counting is a fundamental problem in analyzing massive graphs, often studied in the context of social and complex networks. There is a rich literature on designing efficient, accurate, and scalable algorithms for this problem. In this work, we tackle this challenge and design several new algorithms for subgraph counting in the Massively Parallel Computation (MPC) model: Given a graph $G$ over $n$ vertices, $m$ edges and $T$ triangles, our first main result is an algorithm that, with high probability, outputs a $(1+varepsilon)$-approximation to $T$, with optimal round and space complexity provided any $S geq max{(sqrt m, n^2/m)}$ space per machine, assuming $T=Omega(sqrt{m/n})$. Our second main result is an $tilde{O}_{delta}(log log n)$-rounds algorithm for exactly counting the number of triangles, parametrized by the arboricity $alpha$ of the input graph. The space per machine is $O(n^{delta})$ for any constant $delta$, and the total space is $O(malpha)$, which matches the time complexity of (combinatorial) triangle counting in the sequential model. We also prove that this result can be extended to exactly counting $k$-cliques for any constant $k$, with the same round complexity and total space $O(malpha^{k-2})$. Alternatively, allowing $O(alpha^2)$ space per machine, the total space requirement reduces to $O(nalpha^2)$. Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri (ITCS 2020) for exactly counting all subgraphs of size at most $5$, can be implemented in the MPC model in $tilde{O}_{delta}(sqrt{log n})$ rounds, $O(n^{delta})$ space per machine and $O(malpha^3)$ total space. Therefore, this result also exhibits the phenomenon that a time bound in the sequential model translates to a space bound in the MPC model.
Counting k-cliques in a graph is an important problem in graph analysis with many applications. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. An important optimization is to eliminate search tree branches that discover the same clique redundantly. Eliminating redundant clique discovery is typically done via graph orientation or pivoting. Parallel implementations for both of these approaches have demonstrated promising performance on CPUs. In this paper, we present our GPU implementations of k-clique counting for both the graph orientation and pivoting approaches. Our implementations explore both vertex-centric and edge-centric parallelization schemes, and replace recursive search tree traversal with iterative traversal based on an explicitly-managed shared stack. We also apply various optimizations to reduce memory consumption and improve the utilization of parallel execution resources. Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean speedup of 12.39x, 6.21x, and 18.99x for k = 4, 7, and 10, respectively. We also evaluate the impact of the choice of parallelization scheme and the incremental speedup of each optimization. Our code will be open-sourced to enable further research on parallelizing k-clique counting on GPUs.
This paper gives a new deterministic algorithm for the dynamic Minimum Spanning Forest (MSF) problem in the EREW PRAM model, where the goal is to maintain a MSF of a weighted graph with $n$ vertices and $m$ edges while supporting edge insertions and deletions. We show that one can solve the dynamic MSF problem using $O(sqrt n)$ processors and $O(log n)$ worst-case update time, for a total of $O(sqrt n log n)$ work. This improves on the work of Ferragina [IPPS 1995] which costs $O(log n)$ worst-case update time and $O(n^{2/3} log{frac{m}{n}})$ work.
We give the first Congested Clique algorithm that computes a sparse hopset with polylogarithmic hopbound in polylogarithmic time. Given a graph $G=(V,E)$, a $(beta,epsilon)$-hopset $H$ with hopbound $beta$, is a set of edges added to $G$ such that for any pair of nodes $u$ and $v$ in $G$ there is a path with at most $beta$ hops in $G cup H$ with length within $(1+epsilon)$ of the shortest path between $u$ and $v$ in $G$. Our hopsets are significantly sparser than the recent construction of Censor-Hillel et al. [6], that constructs a hopset of size $tilde{O}(n^{3/2})$, but with a smaller polylogarithmic hopbound. On the other hand, the previously known constructions of sparse hopsets with polylogarithmic hopbound in the Congested Clique model, proposed by Elkin and Neiman [10],[11],[12], all require polynomial rounds. One tool that we use is an efficient algorithm that constructs an $ell$-limited neighborhood cover, that may be of independent interest. Finally, as a side result, we also give a hopset construction in a variant of the low-memory Massively Parallel Computation model, with improved running time over existing algorithms.