No Arabic abstract
The unsupervised detection of anomalies in time series data has important applications in user behavioral modeling, fraud detection, and cybersecurity. Anomaly detection has, in fact, been extensively studied in categorical sequences. However, we often have access to time series data that represent paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies, we must account for the fact that such data contain a large number of independent observations of paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem, we introduce HYPA, a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph. HYPA provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.
In many complex systems, networks and graphs arise in a natural manner. Often, time evolving behavior can be easily found and modeled using time-series methodology. Amongst others, two common research problems in network analysis are community detection and change-point detection. Community detection aims at finding specific sub-structures within the networks, and change-point detection tries to find the time points at which sub-structures change. We propose a novel methodology to detect both community structures and change points simultaneously based on a model selection framework in which the Minimum Description Length Principle (MDL) is utilized as minimizing objective criterion. The promising practical performance of the proposed method is illustrated via a series of numerical experiments and real data analysis.
Heterogeneous Information Network (HIN) has attracted much attention due to its wide applicability in a variety of data mining tasks, especially for tasks with multi-typed objects. A potentially large number of meta-paths can be extracted from the heterogeneous networks, providing abundant semantic knowledge. Though a variety of meta-paths can be defined, too many meta-paths are redundant. Reduction on the number of meta-paths can enhance the effectiveness since some redundant meta-paths provide interferential linkage to the task. Moreover, the reduced meta-paths can reflect the characteristic of the heterogeneous network. Previous endeavors try to reduce the number of meta-paths under the guidance of supervision information. Nevertheless, supervised information is expensive and may not always be available. In this paper, we propose a novel algorithm, SPMR (Semantic Preserving Meta-path Reduction), to reduce a set of pre-defined meta-paths in an unsupervised setting. The proposed method is able to evaluate a set of meta-paths to maximally preserve the semantics of original meta-paths after reduction. Experimental results show that SPMR can select a succinct subset of meta-paths which can achieve comparable or even better performance with fewer meta-paths.
Detecting communities in large-scale networks is a challenging task when each vertex may belong to multiple communities, as is often the case in social networks. The multiple memberships of vertices and thus the strong overlaps among communities render many detection algorithms invalid. We develop a Partial Community Merger Algorithm (PCMA) for detecting communities with significant overlaps as well as slightly overlapping and disjoint ones. It is a bottom-up approach based on properly reassembling partial information of communities revealed in ego networks of vertices to reconstruct complete communities. Noise control and merger order are the two key issues in implementing this idea. We propose a novel similarity measure between two merged communities that can suppress noise and an efficient algorithm that recursively merges the most similar pair of communities. The validity and accuracy of PCMA is tested against two benchmarks and compared to four existing algorithms. It is the most efficient one with linear complexity and it outperforms the compared algorithms when vertices have multiple memberships. PCMA is applied to two huge online social networks, Friendster and Sina Weibo. Millions of communities are detected and they are of higher qualities than the corresponding metadata groups. We find that the latter should not be regarded as the ground-truth of structural communities. The significant overlapping pattern found in the detected communities confirms the need of new algorithms, such as PCMA, to handle multiple memberships of vertices in social networks.
Identifying communities in networks is a fundamental and challenging problem of practical importance in many fields of science. Current methods either ignore the heterogeneous distribution of nodal degrees or assume prior knowledge of the number of communities. Here we propose an efficient hypothesis test for community detection based on quantifying dissimilarities between graphs. Given a random graph, the null hypothesis is that it is of degree-corrected Erd{o}s-R{e}nyi type. We compare the dissimilarity between them by a measure incorporating the vertex distance distribution, the clustering coefficient distribution, and the alpha-centrality distribution, which is used for our hypothesis test. We design a two-stage bipartitioning algorithm to uncover the number of communities and the corresponding structure simultaneously. Experiments on synthetic and real networks show that our method outperforms state-of-the-art ones.
Many social and economic systems can be represented as attributed networks encoding the relations between entities who are themselves described by different node attributes. Finding anomalies in these systems is crucial for detecting abuses such as credit card frauds, web spams or network intrusions. Intuitively, anomalous nodes are defined as nodes whose attributes differ starkly from the attributes of a certain set of nodes of reference, called the context of the anomaly. While some methods have proposed to spot anomalies locally, globally or within a community context, the problem remain challenging due to the multi-scale composition of real networks and the heterogeneity of node metadata. Here, we propose a principled way to uncover outlier nodes simultaneously with the context with respect to which they are anomalous, at all relevant scales of the network. We characterize anomalous nodes in terms of the concentration retained for each node after smoothing specific signals localized on the vertices of the graph. Besides, we introduce a graph signal processing formulation of the Markov stability framework used in community detection, in order to find the context of anomalies. The performance of our method is assessed on synthetic and real-world attributed networks and shows superior results concerning state of the art algorithms. Finally, we show the scalability of our approach in large networks employing Chebychev polynomial approximations.