ترغب بنشر مسار تعليمي؟ اضغط هنا

The ground truth about metadata and community detection in networks

88   0   0.0 ( 0 )
 نشر من قبل Daniel Larremore
 تاريخ النشر 2016
والبحث باللغة English




اسأل ChatGPT حول البحث

Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex systems components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because such networks links are formed explicitly based on those known communities. However, there are no planted communities in real world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. Here, we show that metadata are not the same as ground truth, and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structure.



قيم البحث

اقرأ أيضاً

Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing ap proaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and lack of interpretability. Here we solve the community detection problem for bipartite networks by formulating a bipartite stochastic block model, which explicitly includes vertex type information and may be trivially extended to $k$-partite networks. This bipartite stochastic block model yields a projection-free and statistically principled method for community detection that makes clear assumptions and parameter choices and yields interpretable results. We demonstrate this models ability to efficiently and accurately find community structure in synthetic bipartite networks with known structure and in real-world bipartite networks with unknown structure, and we characterize its performance in practical contexts.
We consider the network constraints on the bounds of the assortativity coefficient, which measures the tendency of nodes with the same attribute values to be interconnected. The assortativity coefficient is the Pearsons correlation coefficient of nod e attribute values across network edges and ranges between -1 and 1. We focus here on the assortativity of binary node attributes and show that properties of the network, such as degree distribution and the number of nodes with each attribute value place constraints upon the attainable values of the assortativity coefficient. We explore the assortativity in three different spaces, that is, ensembles of graph configurations and node-attribute assignments that are valid for a given set of network constraints. We provide means for obtaining bounds on the extremal values of assortativity for each of these spaces. Finally, we demonstrate that under certain conditions the network constraints severely limit the maximum and minimum values of assortativity, which may present issues in how we interpret the assortativity coefficient.
We develop a Bayesian hierarchical model to identify communities in networks for which we do not observe the edges directly, but instead observe a series of interdependent signals for each of the nodes. Fitting the model provides an end-to-end commun ity detection algorithm that does not extract information as a sequence of point estimates but propagates uncertainties from the raw data to the community labels. Our approach naturally supports multiscale community detection as well as the selection of an optimal scale using model comparison. We study the properties of the algorithm using synthetic data and apply it to daily returns of constituents of the S&P100 index as well as climate data from US cities.
Modularity based community detection encompasses a number of widely used, efficient heuristics for identification of structure in networks. Recently, a belief propagation approach to modularity optimization provided a useful guide for identifying non -trivial structure in single-layer networks in a way that other optimization heuristics have not. In this paper, we extend modularity belief propagation to multilayer networks. As part of this development, we also directly incorporate a resolution parameter. We show that adjusting the resolution parameter affects the convergence properties of the algorithm and yields different community structures than the baseline. We compare our approach with a widely used community detection tool, GenLouvain, across a range of synthetic, multilayer benchmark networks, demonstrating that our method performs comparably to the state of the art. Finally, we demonstrate the practical advantages of the additional information provided by our tool by way of two real-world network examples. We show how the convergence properties of the algorithm can be used in selecting the appropriate resolution and coupling parameters and how the node-level marginals provide an interpretation for the strength of attachment to the identified communities. We have released our tool as a Python package for convenient use.
Recent progress towards unraveling the hidden geometric organization of real multiplexes revealed significant correlations across the hyperbolic node coordinates in different network layers, which facilitated applications like trans-layer link predic tion and mutual navigation. But are geometric correlations alone sufficient to explain the topological relation between the layers of real systems? Here we provide the negative answer to this question. We show that connections in real systems tend to persist from one layer to another irrespectively of their hyperbolic distances. This suggests that in addition to purely geometric aspects the explicit link formation process in one layer impacts the topology of other layers. Based on this finding, we present a simple modification to the recently developed Geometric Multiplex Model to account for this effect, and show that the extended model can reproduce the behavior observed in real systems. We also find that link persistence is significant in all considered multiplexes and can explain their layers high edge overlap, which cannot be explained by coordinate correlations alone. Furthermore, by taking both link persistence and hyperbolic distance correlations into account we can improve trans-layer link prediction. These findings guide the development of multiplex embedding methods, suggesting that such methods should be accounting for both coordinate correlations and link persistence across layers.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا