SubGraph2Vec: Highly-Vectorized Tree-likeSubgraph Counting

133 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jiayu Li

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Langshi Chen - Jiayu Li - Ariful Azad

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Subgraph counting aims to count occurrences of a template T in a given network G(V, E). It is a powerful graph analysis tool and has found real-world applications in diverse domains. Scaling subgraph counting problems is known to be memory bounded and computationally challenging with exponential complexity. Although scalable parallel algorithms are known for several graph problems such as Triangle Counting and PageRank, this is not common for counting complex subgraphs. Here we address this challenge and study connected acyclic graphs or trees. We propose a novel vectorized subgraph counting algorithm, named Subgraph2Vec, as well as both shared memory and distributed implementations: 1) reducing algorithmic complexity by minimizing neighbor traversal; 2) achieving a highly-vectorized implementation upon linear algebra kernels to significantly improve performance and hardware utilization. 3) Subgraph2Vec improves the overall performance over the state-of-the-art work by orders of magnitude and up to 660x on a single node. 4) Subgraph2Vec in distributed mode can scale up the template size to 20 and maintain good strong scalability. 5) enabling portability to both CPU and GPU.

قيم البحث

اقرأ أيضاً

Vectorized VByte Decoding

243 - Jeff Plaisance , Nathan Kurz , Daniel Lemire 2015

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder, making the format once again competitive with regard to speed.

استرجاع المعلومات

VHT: Vertical Hoeffding Tree

44 - Nicolas Kourtellis , Gianmarco De Francisci Morales , Albert Bifetn 2016

IoT Big Data requires new machine learning methods able to scale to large size of data arriving at high speed. Decision trees are popular machine learning models since they are very effective, yet easy to interpret and visualize. In the literature, w e can find distributed algorithms for learning decision trees, and also streaming algorithms, but not algorithms that combine both features. In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining distributed data streams, and thus able to run on real-world clusters. We run several experiments to study the accuracy and throughput performance of our new VHT algorithm, as well as its ability to scale while keeping its superior performance with respect to non-distributed decision trees.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي قواعد البيانات

Efficient counting with optimal resilience

60 - Christoph Lenzen , Joel Rybicki , Jukka Suomela 2015

Consider a complete communication network of $n$ nodes, where the nodes receive a common clock pulse. We study the synchronous $c$-counting problem: given any starting state and up to $f$ faulty nodes with arbitrary behaviour, the task is to eventual ly have all correct nodes labeling the pulses with increasing values modulo $c$ in agreement. Thus, we are considering algorithms that are self-stabilising despite Byzantine failures. In this work, we give new algorithms for the synchronous counting problem that (1) are deterministic, (2) have optimal resilience, (3) have a linear stabilisation time in $f$ (asymptotically optimal), (4) use a small number of states, and consequently, (5) communicate a small number of bits per round. Prior algorithms either resort to randomisation, use a large number of states and need high communication bandwidth, or have suboptimal resilience. In particular, we achieve an exponential improvement in both state complexity and message size for deterministic algorithms. Moreover, we present two complementary approaches for reducing the number of bits communicated during and after stabilisation.

النظم الموزعة والتوازية والحوسبة العنقودية

K-Clique Counting on GPUs

309 - Mohammad Almasri , Izzat El Hajj , Rakesh Nagi 2021

Counting k-cliques in a graph is an important problem in graph analysis with many applications. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. An important optimization is to eliminate search tre e branches that discover the same clique redundantly. Eliminating redundant clique discovery is typically done via graph orientation or pivoting. Parallel implementations for both of these approaches have demonstrated promising performance on CPUs. In this paper, we present our GPU implementations of k-clique counting for both the graph orientation and pivoting approaches. Our implementations explore both vertex-centric and edge-centric parallelization schemes, and replace recursive search tree traversal with iterative traversal based on an explicitly-managed shared stack. We also apply various optimizations to reduce memory consumption and improve the utilization of parallel execution resources. Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean speedup of 12.39x, 6.21x, and 18.99x for k = 4, 7, and 10, respectively. We also evaluate the impact of the choice of parallelization scheme and the incremental speedup of each optimization. Our code will be open-sourced to enable further research on parallelizing k-clique counting on GPUs.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Fast BFS-Based Triangle Counting on GPUs

77 - Leyuan Wang , John D. Owens 2019

In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massive ly parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our implementation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges-per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU implementations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of 3.84x.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات