No Arabic abstract
We show that the recently introduced label propagation method for detecting communities in complex networks is equivalent to find the local minima of a simple Potts model. Applying to empirical data, the number of such local minima was found to be very high, much larger than the number of nodes in the graph. The aggregation method for combining information from more local minima shows a tendency to fragment the communities into very small pieces.
Community detection is a fundamental and important problem in network science, as community structures often reveal both topological and functional relationships between different components of the complex system. In this paper, we first propose a gradient descent framework of modularity optimization called vector-label propagation algorithm (VLPA), where a node is associated with a vector of continuous community labels instead of one label. Retaining weak structural information in vector-label, VLPA outperforms some well-known community detection methods, and particularly improves the performance in networks with weak community structures. Further, we incorporate stochastic gradient strategies into VLPA to avoid stuck in the local optima, leading to the stochastic vector-label propagation algorithm (sVLPA). We show that sVLPA performs better than Louvain Method, a widely used community detection algorithm, on both artificial benchmarks and real-world networks. Our theoretical scheme based on vector-label propagation can be directly applied to high-dimensional networks where each node has multiple features, and can also be used for optimizing other partition measures such as modularity with resolution parameters.
The original design of Graph Convolution Network (GCN) couples feature transformation and neighborhood aggregation for node representation learning. Recently, some work shows that coupling is inferior to decoupling, which supports deep graph propagation better and has become the latest paradigm of GCN (e.g., APPNP and SGCN). Despite effectiveness, the working mechanisms of the decoupled GCN are not well understood. In this paper, we explore the decoupled GCN for semi-supervised node classification from a novel and fundamental perspective -- label propagation. We conduct thorough theoretical analyses, proving that the decoupled GCN is essentially the same as the two-step label propagation: first, propagating the known labels along the graph to generate pseudo-labels for the unlabeled nodes, and second, training normal neural network classifiers on the augmented pseudo-labeled data. More interestingly, we reveal the effectiveness of decoupled GCN: going beyond the conventional label propagation, it could automatically assign structure- and model- aware weights to the pseudo-label data. This explains why the decoupled GCN is relatively robust to the structure noise and over-smoothing, but sensitive to the label noise and model initialization. Based on this insight, we propose a new label propagation method named Propagation then Training Adaptively (PTA), which overcomes the flaws of the decoupled GCN with a dynamic and adaptive weighting strategy. Our PTA is simple yet more effective and robust than decoupled GCN. We empirically validate our findings on four benchmark datasets, demonstrating the advantages of our method. The code is available at https://github.com/DongHande/PT_propagation_then_training.
According to Fortunato and Barthelemy, modularity-based community detection algorithms have a resolution threshold such that small communities in a large network are invisible. Here we generalize their work and show that the q-state Potts community detection method introduced by Reichardt and Bornholdt also has a resolution threshold. The model contains a parameter by which this threshold can be tuned, but no a priori principle is known to select the proper value. Single global optimization criteria do not seem capable for detecting all communities if their size distribution is broad.
Networks in nature possess a remarkable amount of structure. Via a series of data-driven discoveries, the cutting edge of network science has recently progressed from positing that the random graphs of mathematical graph theory might accurately describe real networks to the current viewpoint that networks in nature are highly complex and structured entities. The identification of high order structures in networks unveils insights into their functional organization. Recently, Clauset, Moore, and Newman, introduced a new algorithm that identifies such heterogeneities in complex networks by utilizing the hierarchy that necessarily organizes the many levels of structure. Here, we anchor their algorithm in a general community detection framework and discuss the future of community detection.
Modern statistical modeling is an important complement to the more traditional approach of physics where Complex Systems are studied by means of extremely simple idealized models. The Minimum Description Length (MDL) is a principled approach to statistical modeling combining Occams razor with Information Theory for the selection of models providing the most concise descriptions. In this work, we introduce the Boltzmannian MDL (BMDL), a formalization of the principle of MDL with a parametric complexity conveniently formulated as the free-energy of an artificial thermodynamic system. In this way, we leverage on the rich theoretical and technical background of statistical mechanics, to show the crucial importance that phase transitions and other thermodynamic concepts have on the problem of statistical modeling from an information theoretic point of view. For example, we provide information theoretic justifications of why a high-temperature series expansion can be used to compute systematic approximations of the BMDL when the formalism is used to model data, and why statistically significant model selections can be identified with ordered phases when the BMDL is used to model models. To test the introduced formalism, we compute approximations of BMDL for the problem of community detection in complex networks, where we obtain a principled MDL derivation of the Girvan-Newman (GN) modularity and the Zhang-Moore (ZM) community detection method. Here, by means of analytical estimations and numerical experiments on synthetic and empirical networks, we find that BMDL-based correction terms of the GN modularity improve the quality of the detected communities and we also find an information theoretic justification of why the ZM criterion for estimation of the number of network communities is better than alternative approaches such as the bare minimization of a free energy.