Do you want to publish a course? Click here

Cache-Efficient Fork-Processing Patterns on Large Graphs

247   0   0.0 ( 0 )
 Added by Shengliang Lu
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

As large graph processing emerges, we observe a costly fork-processing pattern (FPP) that is common in many graph algorithms. The unique feature of the FPP is that it launches many independent queries from different source vertices on the same graph. For example, an algorithm in analyzing the network community profile can execute Personalized PageRanks that start from tens of thousands of source vertices at the same time. We study the efficiency of handling FPPs in state-of-the-art graph processing systems on multi-core architectures. We find that those systems suffer from severe cache miss penalty because of the irregular and uncoordinated memory accesses in processing FPPs. In this paper, we propose ForkGraph, a cache-efficient FPP processing system on multi-core architectures. To improve the cache reuse, we divide the graph into partitions each sized of LLC capacity, and the queries in an FPP are buffered and executed on the partition basis. We further develop efficient intra- and inter-partition execution strategies for efficiency. For intra-partition processing, since the graph partition fits into LLC, we propose to execute each graph query with efficient sequential algorithms (in contrast with parallel algorithms in existing parallel graph processing systems) and present an atomic-free query processing by consolidating contending operations to cache-resident graph partition. For inter-partition processing, we propose yielding and priority-based scheduling, to reduce redundant work in processing. Besides, we theoretically prove that ForkGraph performs the same amount of work, to within a constant factor, as the fastest known sequential algorithms in FPP queries processing, which is work efficient. Our evaluations on real-world graphs show that ForkGraph significantly outperforms state-of-the-art graph processing systems with two orders of magnitude speedups.



rate research

Read More

We propose the algorithms for performing multiway joins using a new type of coarse grain reconfigurable hardware accelerator~-- ``Plasticine~-- that, compared with other accelerators, emphasizes high compute capability and high on-chip communication bandwidth. Joining three or more relations in a single step, i.e. multiway join, is efficient when the join of any two relations yields too large an intermediate relation. We show at least 200X speedup for a sequence of binary hash joins execution on Plasticine over CPU. We further show that in some realistic cases, a Plasticine-like accelerator can make 3-way joins more efficient than a cascade of binary hash joins on the same hardware, by a factor of up to 45X.
Large-scale graph-structured data arising from social networks, databases, knowledge bases, web graphs, etc. is now available for analysis and mining. Graph-mining often involves relationship queries, which seek a ranked list of interesting interconnections among a given set of entities, corresponding to nodes in the graph. While relationship queries have been studied for many years, using various terminologies, e.g., keyword-search, Steiner-tree in a graph etc., the solutions proposed in the literature so far have not focused on scaling relationship queries to large graphs having billions of nodes and edges, such are now publicly available in the form of linked-open-data. In this paper, we present an algorithm for distributed keyword search (DKS) on large graphs, based on the graph-parallel computing paradigm Pregel. We also present an analytical proof that our algorithm produces an optimally ranked list of answers if run to completion. Even if terminated early, our algorithm produces approximate answers along with bounds. We describe an optimized implementation of our DKS algorithm along with time-complexity analysis. Finally, we report and analyze experiments using an implementation of DKS on Giraph the graph-parallel computing framework based on Pregel, and demonstrate that we can efficiently process relationship queries on large-scale subsets of linked-open-data.
Inspired by the success of Googles Pregel, many systems have been developed recently for iterative computation over big graphs. These systems provide a user-friendly vertex-centric programming interface, where a programmer only needs to specify the behavior of one generic vertex when developing a parallel graph algorithm. However, most existing systems require the input graph to reside in memories of the machines in a cluster, and the few out-of-core systems suffer from problems such as poor efficiency for sparse computation workload, high demand on network bandwidth, and expensive cost incurred by external-memory join and group-by. In this paper, we introduce the GraphD system for a user to process very large graphs with ordinary computing resources. GraphD fully overlaps computation with communication, by streaming edges and messages on local disks, while transmitting messages in parallel. For a broad class of Pregel algorithms where message combiner is applicable, GraphD eliminates the need of any expensive external-memory join or group-by. These key techniques allow GraphD to achieve comparable performance to in-memory Pregel-like systems without keeping edges and messages in memories. We prove that to process a graph G=(V, E) with n machines using GraphD, each machine only requires O(|V|/n) memory space, allowing GraphD to scale to very large graphs with a small cluster. Extensive experiments show that GraphD beats existing out-of-core systems by orders of magnitude, and achieves comparable performance to in-memory systems running with enough memories.
Reachability query is a fundamental problem on graphs, which has been extensively studied in academia and industry. Since graphs are subject to frequent updates in many applications, it is essential to support efficient graph updates while offering good performance in reachability queries. Existing solutions compress the original graph with the Directed Acyclic Graph (DAG) and propose efficient query processing and index update techniques. However, they focus on optimizing the scenarios where the Strong Connected Components(SCCs) remain unchanged and have overlooked the prohibitively high cost of the DAG maintenance when SCCs are updated. In this paper, we propose DBL, an efficient DAG-free index to support the reachability query on dynamic graphs with insertion-only updates. DBL builds on two complementary indexes: Dynamic Landmark (DL) label and Bidirectional Leaf (BL) label. The former leverages landmark nodes to quickly determine reachable pairs whereas the latter prunes unreachable pairs by indexing the leaf nodes in the graph. We evaluate DBL against the state-of-the-art approaches on dynamic reachability index with extensive experiments on real-world datasets. The results have demonstrated that DBL achieves orders of magnitude speedup in terms of index update, while still producing competitive query efficiency.
Arising user-centric graph applications such as route planning and personalized social network analysis have initiated a shift of paradigms in modern graph processing systems towards multi-query analysis, i.e., processing multiple graph queries in parallel on a shared graph. These applications generate a dynamic number of localized queries around query hotspots such as popular urban areas. However, existing graph processing systems are not yet tailored towards these properties: The employed methods for graph partitioning and synchronization management disregard query locality and dynamism which leads to high query latency. To this end, we propose the system Q-Graph for multi-query graph analysis that considers query locality on three levels. (i) The query-aware graph partitioning algorithm Q-cut maximizes query locality to reduce communication overhead. (ii) The method for synchronization management, called hybrid barrier synchronization, allows for full exploitation of local queries spanning only a subset of partitions. (iii) Both methods adapt at runtime to changing query workloads in order to maintain and exploit locality. Our experiments show that Q-cut reduces average query latency by up to 57 percent compared to static query-agnostic partitioning algorithms.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا