No Arabic abstract
Graph clustering is an important technique to understand the relationships between the vertices in a big graph. In this paper, we propose a novel random-walk-based graph clustering method. The proposed method restricts the reach of the walking agent using an inflation function and a normalization function. We analyze the behavior of the limited random walk procedure and propose a novel algorithm for both global and local graph clustering problems. Previous random-walk-based algorithms depend on the chosen fitness function to find the clusters around a seed vertex. The proposed algorithm tackles the problem in an entirely different manner. We use the limited random walk procedure to find attracting vertices in a graph and use them as features to cluster the vertices. According to the experimental results on the simulated graph data and the real-world big graph data, the proposed method is superior to the state-of-the-art methods in solving graph clustering problems. Since the proposed method uses the embarrassingly parallel paradigm, it can be efficiently implemented and embedded in any parallel computing environment such as a MapReduce framework. Given enough computing resources, we are capable of clustering graphs with millions of vertices and hundreds millions of edges in a reasonable time.
Measuring graph clustering quality remains an open problem. To address it, we introduce quality measures based on comparisons of intra- and inter-cluster densities, an accompanying statistical test of the significance of their differences and a step-by-step routine for clustering quality assessment. Our null hypothesis does not rely on any generative model for the graph, unlike modularity which uses the configuration model as a null model. Our measures are shown to meet the axioms of a good clustering quality function, unlike the very commonly used modularity measure. They also have an intuitive graph-theoretic interpretation, a formal statistical interpretation and can be easily tested for significance. Our work is centered on the idea that well clustered graphs will display a significantly larger intra-cluster density than inter-cluster density. We develop tests to validate the existence of such a cluster structure. We empirically explore the behavior of our measures under a number of stress test scenarios and compare their behavior to the commonly used modularity and conductance measures. Empirical stress test results confirm that our measures compare very favorably to the established ones. In particular, they are shown to be more responsive to graph structure and less sensitive to sample size and breakdowns during numerical implementation and less sensitive to uncertainty in connectivity. These features are especially important in the context of larger data sets or when the data may contain errors in the connectivity patterns.
Random graph models are important constructs for data analytic applications as well as pure mathematical developments, as they provide capabilities for network synthesis and principled analysis. Several models have been developed with the aim of faithfully preserving important graph metrics and substructures. With the goal of capturing degree distribution, clustering coefficient, and communities in a single random graph model, we propose a new model to address shortcomings in a progression of network modeling capabilities. The Block Two-Level Erd{H{o}}s-R{e}nyi (BTER) model of Seshadhri et al., designed to allow prescription of expected degree and clustering coefficient distributions, neglects community modeling, while the Generalized BTER (GBTER) model of Bridges et al., designed to add community modeling capabilities to BTER, struggles to faithfully represent all three characteristics simultaneously. In this work, we fit BTER and two GBTER configurations to several real-world networks and compare the results with that of our new model, the Extended GBTER (EGBTER) model. Our results support that EBGTER adds a community-modeling flexibility to BTER, while retaining a satisfactory level of accuracy in terms of degree and clustering coefficient. Our insights and empirical testing of previous models as well as the new model are novel contributions to the literature.
Random walk-based sampling methods are gaining popularity and importance in characterizing large networks. While powerful, they suffer from the slow mixing problem when the graph is loosely connected, which results in poor estimation accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but it is inapplicable if the graph does not support uniform vertex sampling (UNI). In this work, we develop methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe that many graphs under study, called target graphs, do not exist in isolation. In many situations, a target graph is related to an auxiliary graph and a bipartite graph, and they together form a better connected {em two-layered network structure}. This new viewpoint brings extra benefits to graph sampling: if directly sampling a target graph is difficult, we can sample it indirectly with the assistance of the other two graphs. We propose a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics. Experiments conducted on both synthetic and real-world networks demonstrate the effectiveness and usefulness of these new techniques.
The enormous successes have been made by quantum algorithms during the last decade. In this paper, we combine the quantum random walk (QRW) with the problem of data clustering, and develop two clustering algorithms based on the one dimensional QRW. Then, the probability distributions on the positions induced by QRW in these algorithms are investigated, which also indicates the possibility of obtaining better results. Consequently, the experimental results have demonstrated that data points in datasets are clustered reasonably and efficiently, and the clustering algorithms are of fast rates of convergence. Moreover, the comparison with other algorithms also provides an indication of the effectiveness of the proposed approach.
Outliers are samples that are generated by different mechanisms from other normal data samples. Graphs, in particular social network graphs, may contain nodes and edges that are made by scammers, malicious programs or mistakenly by normal users. Detecting outlier nodes and edges is important for data mining and graph analytics. However, previous research in the field has merely focused on detecting outlier nodes. In this article, we study the properties of edges and propose outlier edge detection algorithms using two random graph generation models. We found that the edge-ego-network, which can be defined as the induced graph that contains two end nodes of an edge, their neighboring nodes and the edges that link these nodes, contains critical information to detect outlier edges. We evaluated the proposed algorithms by injecting outlier edges into some real-world graph data. Experiment results show that the proposed algorithms can effectively detect outlier edges. In particular, the algorithm based on the Preferential Attachment Random Graph Generation model consistently gives good performance regardless of the test graph data. Further more, the proposed algorithms are not limited in the area of outlier edge detection. We demonstrate three different applications that benefit from the proposed algorithms: 1) a preprocessing tool that improves the performance of graph clustering algorithms; 2) an outlier node detection algorithm; and 3) a novel noisy data clustering algorithm. These applications show the great potential of the proposed outlier edge detection techniques.