Consistency of random-walk based network embedding algorithms

86 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yichi Zhang

تاريخ النشر 2021

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Yichi Zhang - Minh Tang

التعلم الالي التعلم الآلي الشبكات الاجتماعية والمعلومات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Random-walk based network embedding algorithms like node2vec and DeepWalk are widely used to obtain Euclidean representation of the nodes in a network prior to performing down-stream network inference tasks. Nevertheless, despite their impressive empirical performance, there is a lack of theoretical results explaining their behavior. In this paper we studied the node2vec and DeepWalk algorithms through the perspective of matrix factorization. We analyze these algorithms in the setting of community detection for stochastic blockmodel graphs; in particular we established large-sample error bounds and prove consistent community recovery of node2vec/DeepWalk embedding followed by k-means clustering. Our theoretical results indicate a subtle interplay between the sparsity of the observed networks, the window sizes of the random walks, and the convergence rates of the node2vec/DeepWalk embedding toward the embedding of the true but unknown edge probabilities matrix. More specifically, as the network becomes sparser, our results suggest using larger window sizes, or equivalently, taking longer random walks, in order to attain better convergence rate for the resulting embeddings. The paper includes numerical experiments corroborating these observations.

قيم البحث

75 - Jiawei Shen , Xincheng Shu , Hu Yang 2021

Network embedding aims to represent a network into a low dimensional space where the network structural information and inherent properties are maximumly preserved. Random walk based network embedding methods such as DeepWalk and node2vec have shown outstanding performance in the aspect of preserving the network topological structure. However, these approaches either predict the distribution of a nodes neighbors in both direction together, which makes them unable to capture any asymmetric relationship in a network; or preserve asymmetric relationship in only one direction and hence lose the one in another direction. To address these limitations, we propose bidirectional group random walk based network embedding method (BiGRW), which treats the distributions of a nodes neighbors in the forward and backward direction in random walks as two different asymmetric network structural information. The basic idea of BiGRW is to learn a representation for each node that is useful to predict its distribution of neighbors in the forward and backward direction separately. Apart from that, a novel random walk sampling strategy is proposed with a parameter {alpha} to flexibly control the trade-off between breadth-first sampling (BFS) and depth-first sampling (DFS). To learn representations from node attributes, we design an attributed version of BiGRW (BiGRW-AT). Experimental results on several benchmark datasets demonstrate that the proposed methods significantly outperform the state-of-the-art plain and attributed network embedding methods on tasks of node classification and clustering.

الشبكات الاجتماعية والمعلومات

Delving Into Deep Walkers: A Convergence Analysis of Random-Walk-Based Vertex Embeddings

90 - Dominik Kloepfer , Angelica I. Aviles-Rivero , Daniel Heydecker 2021

Graph vertex embeddings based on random walks have become increasingly influential in recent years, showing good performance in several tasks as they efficiently transform a graph into a more computationally digestible format while preserving relevan t information. However, the theoretical properties of such algorithms, in particular the influence of hyperparameters and of the graph structure on their convergence behaviour, have so far not been well-understood. In this work, we provide a theoretical analysis for random-walks based embeddings techniques. Firstly, we prove that, under some weak assumptions, vertex embeddings derived from random walks do indeed converge both in the single limit of the number of random walks $N to infty$ and in the double limit of both $N$ and the length of each random walk $Ltoinfty$. Secondly, we derive concentration bounds quantifying the converge rate of the corpora for the single and double limits. Thirdly, we use these results to derive a heuristic for choosing the hyperparameters $N$ and $L$. We validate and illustrate the practical importance of our findings with a range of numerical and visual experiments on several graphs drawn from real-world applications.

التعلم الالي التعلم الآلي الاحتمالات

Investigating Extensions to Random Walk Based Graph Embedding

436 - Joerg Schloetterer , Martin Wehking , Fatemeh Salehi Rizi 2020

Graph embedding has recently gained momentum in the research community, in particular after the introduction of random walk and neural network based approaches. However, most of the embedding approaches focus on representing the local neighborhood of nodes and fail to capture the global graph structure, i.e. to retain the relations to distant nodes. To counter that problem, we propose a novel extension to random walk based graph embedding, which removes a percentage of least frequent nodes from the walks at different levels. By this removal, we simulate farther distant nodes to reside in the close neighborhood of a node and hence explicitly represent their connection. Besides the common evaluation tasks for graph embeddings, such as node classification and link prediction, we evaluate and compare our approach against related methods on shortest path approximation. The results indicate, that extensions to random walk based methods (including our own) improve the predictive performance only slightly - if at all.

التعلم الآلي التعلم الالي

Quantization Algorithms for Random Fourier Features

95 - Xiaoyun Li , Ping Li 2021

The method of random projection (RP) is the standard technique in machine learning and many other areas, for dimensionality reduction, approximate near neighbor search, compressed sensing, etc. Basically, RP provides a simple and effective scheme for approximating pairwise inner products and Euclidean distances in massive data. Closely related to RP, the method of random Fourier features (RFF) has also become popular, for approximating the Gaussian kernel. RFF applies a specific nonlinear transformation on the projected data from random projections. In practice, using the (nonlinear) Gaussian kernel often leads to better performance than the linear kernel (inner product), partly due to the tuning parameter $(gamma)$ introduced in the Gaussian kernel. Recently, there has been a surge of interest in studying properties of RFF. After random projections, quantization is an important step for efficient data storage, computation, and transmission. Quantization for RP has also been extensive studied in the literature. In this paper, we focus on developing quantization algorithms for RFF. The task is in a sense challenging due to the tuning parameter $gamma$ in the Gaussian kernel. For example, the quantizer and the quantized data might be tied to each specific tuning parameter $gamma$. Our contribution begins with an interesting discovery, that the marginal distribution of RFF is actually free of the Gaussian kernel parameter $gamma$. This small finding significantly simplifies the design of the Lloyd-Max (LM) quantization scheme for RFF in that there would be only one LM quantizer for RFF (regardless of $gamma$). We also develop a variant named LM$^2$-RFF quantizer, which in certain cases is more accurate. Experiments confirm that the proposed quantization schemes perform well.

التعلم الالي التعلم الآلي

Ergodic Limits, Relaxations, and Geometric Properties of Random Walk Node Embeddings

153 - Christy Lin , Daniel Sussman , Prakash Ishwar 2021

Random walk based node embedding algorithms learn vector representations of nodes by optimizing an objective function of node embedding vectors and skip-bigram statistics computed from random walks on the network. They have been applied to many super vised learning problems such as link prediction and node classification and have demonstrated state-of-the-art performance. Yet, their properties remain poorly understood. This paper studies properties of random walk based node embeddings in the unsupervised setting of discovering hidden block structure in the network, i.e., learning node representations whose cluster structure in Euclidean space reflects their adjacency structure within the network. We characterize the ergodic limits of the embedding objective, its generalization, and related convex relaxations to derive corresponding non-randomiz

التعلم الالي التعلم الآلي