We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the best performance due to its ability to exploit efficient filtering steps to remove unnecessary work and its high-performance set-intersection core.
In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of 3.84x.
Graphics Processing Unit (GPU) computing is becoming an alternate computing platform for numerical simulations. However, it is not clear which numerical scheme will provide the highest computational efficiency for different types of problems. To this end, numerical accuracies and computational work of several numerical methods are compared using a GPU computing implementation. The Correction Procedure via Reconstruction (CPR), Discontinuous Galerkin (DG), Nodal Discontinuous Galerkin (NDG), Spectral Difference (SD), and Finite Volume (FV) methods are investigated using various reconstruction orders. Both smooth and discontinuous cases are considered for two-dimensional simulations. For discontinuous problems, MUSCL schemes are employed with FV, while CPR, DG, NDG, and SD use slope limiting. The computation time to reach a set error criteria and total time to complete solutions are compared across the methods. It is shown that while FV methods can produce solutions with low computational times, they produce larger errors than high-order methods for smooth problems at the same order of accuracy. For discontinuous problems, the methods show good agreement with one another in terms of solution profiles, and the total computational times between FV, CPR, and SD are comparable.
We study the problem of finding large cuts in $d$-regular triangle-free graphs. In prior work, Shearer (1992) gives a randomised algorithm that finds a cut of expected size $(1/2 + 0.177/sqrt{d})m$, where $m$ is the number of edges. We give a simpler algorithm that does much better: it finds a cut of expected size $(1/2 + 0.28125/sqrt{d})m$. As a corollary, this shows that in any $d$-regular triangle-free graph there exists a cut of at least this size. Our algorithm can be interpreted as a very efficient randomised distributed algorithm: each node needs to produce only one random bit, and the algorithm runs in one synchronous communication round. This work is also a case study of applying computational techniques in the design of distributed algorithms: our algorithm was designed by a computer program that searched for optimal algorithms for small values of $d$.
Triangle counting is a building block for a wide range of graph applications. Traditional wisdom suggests that i) hashing is not suitable for triangle counting, ii) edge-centric triangle counting beats vertex-centric design, and iii) communication-free and workload balanced graph partitioning is a grand challenge for triangle counting. On the contrary, we advocate that i) hashing can help the key operations for scalable triangle counting on Graphics Processing Units (GPUs), i.e., list intersection and graph partitioning, ii)vertex-centric design reduces both hash table construction cost and memory consumption, which is limited on GPUs. In addition, iii) we exploit graph and workload collaborative, and hashing-based 2D partitioning to scale vertex-centric triangle counting over 1,000 GPUswith sustained scalability. In this work, we present TRUST which performs triangle counting with the hash operation and vertex-centric mechanism at the core. To the best of our knowledge, TRUSTis the first work that achieves over one trillion Traversed Edges Per Second (TEPS) rate for triangle counting.
In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.