ترغب بنشر مسار تعليمي؟ اضغط هنا

Outlier Edge Detection Using Random Graph Generation Models and Applications

96   0   0.0 ( 0 )
 نشر من قبل Honglei Zhang Honglei Zhang
 تاريخ النشر 2016
والبحث باللغة English




اسأل ChatGPT حول البحث

Outliers are samples that are generated by different mechanisms from other normal data samples. Graphs, in particular social network graphs, may contain nodes and edges that are made by scammers, malicious programs or mistakenly by normal users. Detecting outlier nodes and edges is important for data mining and graph analytics. However, previous research in the field has merely focused on detecting outlier nodes. In this article, we study the properties of edges and propose outlier edge detection algorithms using two random graph generation models. We found that the edge-ego-network, which can be defined as the induced graph that contains two end nodes of an edge, their neighboring nodes and the edges that link these nodes, contains critical information to detect outlier edges. We evaluated the proposed algorithms by injecting outlier edges into some real-world graph data. Experiment results show that the proposed algorithms can effectively detect outlier edges. In particular, the algorithm based on the Preferential Attachment Random Graph Generation model consistently gives good performance regardless of the test graph data. Further more, the proposed algorithms are not limited in the area of outlier edge detection. We demonstrate three different applications that benefit from the proposed algorithms: 1) a preprocessing tool that improves the performance of graph clustering algorithms; 2) an outlier node detection algorithm; and 3) a novel noisy data clustering algorithm. These applications show the great potential of the proposed outlier edge detection techniques.



قيم البحث

اقرأ أيضاً

Recently, many systems for graph analysis have been developed to address the growing needs of both industry and academia to study complex graphs. Insight into the practical uses of graph analysis will allow future developments of such systems to opti mize for real-world usage, instead of targeting single use cases or hypothetical workloads. This insight may be derived from surveys on the applications of graph analysis. However, existing surveys are limited in the variety of application domains, datasets, and/or graph analysis techniques they study. In this work we present and apply a systematic method for identifying practical use cases of graph analysis. We identify commonly used graph features and analysis methods and use our findings to construct a taxonomy of graph analysis applications. We conclude that practical use cases of graph analysis cover a diverse set of graph features and analysis methods. Furthermore, most applications combine multiple features and methods. Our findings motivate further development of graph analysis systems to support a broader set of applications and to facilitate the combination of multiple analysis methods in an (interactive) workflow.
We apply spectral clustering and multislice modularity optimization to a Los Angeles Police Department field interview card data set. To detect communities (i.e., cohesive groups of vertices), we use both geographic and social information about stops involving street gang members in the LAPD district of Hollenbeck. We then compare the algorithmically detected communities with known gang identifications and argue that discrepancies are due to sparsity of social connections in the data as well as complex underlying sociological factors that blur distinctions between communities.
Graph clustering is an important technique to understand the relationships between the vertices in a big graph. In this paper, we propose a novel random-walk-based graph clustering method. The proposed method restricts the reach of the walking agent using an inflation function and a normalization function. We analyze the behavior of the limited random walk procedure and propose a novel algorithm for both global and local graph clustering problems. Previous random-walk-based algorithms depend on the chosen fitness function to find the clusters around a seed vertex. The proposed algorithm tackles the problem in an entirely different manner. We use the limited random walk procedure to find attracting vertices in a graph and use them as features to cluster the vertices. According to the experimental results on the simulated graph data and the real-world big graph data, the proposed method is superior to the state-of-the-art methods in solving graph clustering problems. Since the proposed method uses the embarrassingly parallel paradigm, it can be efficiently implemented and embedded in any parallel computing environment such as a MapReduce framework. Given enough computing resources, we are capable of clustering graphs with millions of vertices and hundreds millions of edges in a reasonable time.
Graph models, like other machine learning models, have implicit and explicit biases built-in, which often impact performance in nontrivial ways. The models faithfulness is often measured by comparing the newly generated graph against the source graph using any number or combination of graph properties. Differences in the size or topology of the generated graph therefore indicate a loss in the model. Yet, in many systems, errors encoded in loss functions are subtle and not well understood. In the present work, we introduce the Infinity Mirror test for analyzing the robustness of graph models. This straightforward stress test works by repeatedly fitting a model to its own outputs. A hypothetically perfect graph model would have no deviation from the source graph; however, the models implicit biases and assumptions are exaggerated by the Infinity Mirror test, exposing potential issues that were previously obscured. Through an analysis of thousands of experiments on synthetic and real-world graphs, we show that several conventional graph models degenerate in exciting and informative ways. We believe that the observed degenerative patterns are clues to the future development of better graph models.
Random graph models are important constructs for data analytic applications as well as pure mathematical developments, as they provide capabilities for network synthesis and principled analysis. Several models have been developed with the aim of fait hfully preserving important graph metrics and substructures. With the goal of capturing degree distribution, clustering coefficient, and communities in a single random graph model, we propose a new model to address shortcomings in a progression of network modeling capabilities. The Block Two-Level Erd{H{o}}s-R{e}nyi (BTER) model of Seshadhri et al., designed to allow prescription of expected degree and clustering coefficient distributions, neglects community modeling, while the Generalized BTER (GBTER) model of Bridges et al., designed to add community modeling capabilities to BTER, struggles to faithfully represent all three characteristics simultaneously. In this work, we fit BTER and two GBTER configurations to several real-world networks and compare the results with that of our new model, the Extended GBTER (EGBTER) model. Our results support that EBGTER adds a community-modeling flexibility to BTER, while retaining a satisfactory level of accuracy in terms of degree and clustering coefficient. Our insights and empirical testing of previous models as well as the new model are novel contributions to the literature.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا