ترغب بنشر مسار تعليمي؟ اضغط هنا

Counting Triangles in Massive Graphs with MapReduce

58   0   0.0 ( 0 )
 نشر من قبل Tamara Kolda
 تاريخ النشر 2013
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Graphs and networks are used to model interactions in a variety of contexts. There is a growing need to quickly assess the characteristics of a graph in order to understand its underlying structure. Some of the most useful metrics are triangle-based and give a measure of the connectedness of mutual friends. This is often summarized in terms of clustering coefficients, which measure the likelihood that two neighbors of a node are themselves connected. Computing these measures exactly for large-scale networks is prohibitively expensive in both memory and time. However, a recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clustering coefficients. In this paper, we describe how to implement this approach in MapReduce to deal with massive graphs. We show results on publicly-available networks, the largest of which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as well as the global clustering coefficient and total number of triangles, in an average of 0.33 seconds per million edges plus overhead (approximately 225 seconds total for our configuration). The technique can also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we highlight differences between social and non-social networks. To the best of our knowledge, these are the largest triangle-based graph computations published to date.

قيم البحث

اقرأ أيضاً

Matching problems are ubiquitous. They occur in economic markets, labor markets, internet advertising, and elsewhere. In this paper we focus on an application of matching for social media. Our goal is to distribute content from information suppliers to information consumers. We seek to maximize the overall relevance of the matched content from suppliers to consumers while regulating the overall activity, e.g., ensuring that no consumer is overwhelmed with data and that all suppliers have chances to deliver their content. We propose two matching algorithms, GreedyMR and StackMR, geared for the MapReduce paradigm. Both algorithms have provable approximation guarantees, and in practice they produce high-quality solutions. While both algorithms scale extremely well, we can show that StackMR requires only a poly-logarithmic number of MapReduce steps, making it an attractive option for applications with very large datasets. We experimentally show the trade-offs between quality and efficiency of our solutions on two large datasets coming from real-world social-media web sites.
We study the problem of approximating the $3$-profile of a large graph. $3$-profiles are generalizations of triangle counts that specify the number of times a small graph appears as an induced subgraph of a large graph. Our algorithm uses the novel c oncept of $3$-profile sparsifiers: sparse graphs that can be used to approximate the full $3$-profile counts for a given large graph. Further, we study the problem of estimating local and ego $3$-profiles, two graph quantities that characterize the local neighborhood of each vertex of a graph. Our algorithm is distributed and operates as a vertex program over the GraphLab PowerGraph framework. We introduce the concept of edge pivoting which allows us to collect $2$-hop information without maintaining an explicit $2$-hop neighborhood list at each vertex. This enables the computation of all the local $3$-profiles in parallel with minimal communication. We test out implementation in several experiments scaling up to $640$ cores on Amazon EC2. We find that our algorithm can estimate the $3$-profile of a graph in approximately the same time as triangle counting. For the harder problem of ego $3$-profiles, we introduce an algorithm that can estimate profiles of hundreds of thousands of vertices in parallel, in the timescale of minutes.
Alon and Yuster proved that the number of orientations of any $n$-vertex graph in which every $K_3$ is transitively oriented is at most $2^{lfloor n^2/4rfloor}$ for $n geq 10^4$ and conjectured that the precise lower bound on $n$ should be $n geq 8$. We confirm their conjecture and, additionally, characterize the extremal families by showing that the balanced complete bipartite graph with $n$ vertices is the only $n$-vertex graph for which there are exactly $2^{lfloor n^2/4rfloor}$ such orientations.
With the wealth of high-throughput sequencing data generated by recent large-scale consortia, predictive gene expression modelling has become an important tool for integrative analysis of transcriptomic and epigenetic data. However, sequencing data-s ets are characteristically large, and previously modelling frameworks are typically inefficient and unable to leverage multi-core or distributed processing architectures. In this study, we detail an efficient and parallelised MapReduce implementation of gene expression modelling. We leverage the computational efficiency of this framework to provide an integrative analysis of over fifty histone modification data-sets across a variety of cancerous and non-cancerous cell-lines. Our results demonstrate that the genome-wide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched epigenetic/transcriptomic data from non-cancerous cell-lines are able to predict cancerous expression with equivalent genome-wide fidelity.
Bollobas and Nikiforov [J. Combin. Theory, Ser. B. 97 (2007) 859--865] conjectured the following. If $G$ is a $K_{r+1}$-free graph on at least $r+1$ vertices and $m$ edges, then $lambda^2_1(G)+lambda^2_2(G)leq frac{r-1}{r}cdot2m$, where $lambda_1(G)$ and $lambda_2(G)$ are the largest and the second largest eigenvalues of the adjacency matrix $A(G)$, respectively. In this paper, we confirm the conjecture in the case $r=2$, by using tools from doubly stochastic matrix theory, and also characterize all families of extremal graphs. Motivated by classic theorems due to ErdH{o}s and Nosal respectively, we prove that every non-bipartite graph $G$ of order $n$ and size $m$ contains a triangle, if one of the following is true: (1) $lambda_1(G)geqsqrt{m-1}$ and $G eq C_5cup (n-5)K_1$; and (2) $lambda_1(G)geq lambda_1(S(K_{lfloorfrac{n-1}{2}rfloor,lceilfrac{n-1}{2}rceil}))$ and $G eq S(K_{lfloorfrac{n-1}{2}rfloor,lceilfrac{n-1}{2}rceil})$, where $S(K_{lfloorfrac{n-1}{2}rfloor,lceilfrac{n-1}{2}rceil})$ is obtained from $K_{lfloorfrac{n-1}{2}rfloor,lceilfrac{n-1}{2}rceil}$ by subdividing an edge. Both conditions are best possible. We conclude this paper with some open problems.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا