No Arabic abstract
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs. By utilizing fast matrix block-approximation techniques, we propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions, while being able to meaningfully model both local information of the graph (e.g., degrees) as well as global information (e.g., clustering coefficient, assortativity, etc.) if desired. This allows one to efficiently generate random networks with similar properties as an observed network, and the models can be used for several downstream tasks such as link prediction. Our methods are scalable to sparse graphs consisting of millions of nodes. Empirical evaluation demonstrates competitiveness in terms of both speed and accuracy with state-of-the-art methods -- which are typically based on embedding the graph into some low-dimensional space -- for link prediction, showcasing the potential of a more direct and interpretable probabalistic model for this task.
We provide the first information theoretic tight analysis for inference of latent community structure given a sparse graph along with high dimensional node covariates, correlated with the same latent communities. Our work bridges recent theoretical breakthroughs in the detection of latent community structure without nodes covariates and a large body of empirical work using diverse heuristics for combining node covariates with graphs for inference. The tightness of our analysis implies in particular, the information theoretical necessity of combining the different sources of information. Our analysis holds for networks of large degrees as well as for a Gaussian version of the model.
Node embedding is a powerful approach for representing the structural role of each node in a graph. $textit{Node2vec}$ is a widely used method for node embedding that works by exploring the local neighborhoods via biased random walks on the graph. However, $textit{node2vec}$ does not consider edge weights when computing walk biases. This intrinsic limitation prevents $textit{node2vec}$ from leveraging all the information in weighted graphs and, in turn, limits its application to many real-world networks that are weighted and dense. Here, we naturally extend $textit{node2vec}$ to $textit{node2vec+}$ in a way that accounts for edge weights when calculating walk biases, but which reduces to $textit{node2vec}$ in the cases of unweighted graphs or unbiased walks. We empirically show that $textit{node2vec+}$ is more robust to additive noise than $textit{node2vec}$ in weighted graphs using two synthetic datasets. We also demonstrate that $textit{node2vec+}$ significantly outperforms $textit{node2vec}$ on a commonly benchmarked multi-label dataset (Wikipedia). Furthermore, we test $textit{node2vec+}$ against GCN and GraphSAGE using various challenging gene classification tasks on two protein-protein interaction networks. Despite some clear advantages of GCN and GraphSAGE, they show comparable performance with $textit{node2vec+}$. Finally, $textit{node2vec+}$ can be used as a general approach for generating biased random walks, benefiting all existing methods built on top of $textit{node2vec}$. $textit{Node2vec+}$ is implemented as part of $texttt{PecanPy}$, which is available at https://github.com/krishnanlab/PecanPy .
Many real-world networks are intrinsically directed. Such networks include activation of genes, hyperlinks on the internet, and the network of followers on Twitter among many others. The challenge, however, is to create a network model that has many of the properties of real-world networks such as powerlaw degree distributions and the small-world property. To meet these challenges, we introduce the textit{Directed} Random Geometric Graph (DRGG) model, which is an extension of the random geometric graph model. We prove that it is scale-free with respect to the indegree distribution, has binomial outdegree distribution, has a high clustering coefficient, has few edges and is likely small-world. These are some of the main features of aforementioned real world networks. We empirically observe that word association networks have many of the theoretical properties of the DRGG model.
A variety of approaches have been proposed to automatically infer the profiles of users from their digital footprint in social media. Most of the proposed approaches focus on mining a single type of information, while ignoring other sources of available user-generated content (UGC). In this paper, we propose a mechanism to infer a variety of user characteristics, such as, age, gender and personality traits, which can then be compiled into a user profile. To this end, we model social media users by incorporating and reasoning over multiple sources of UGC as well as social relations. Our model is based on a statistical relational learning framework using Hinge-loss Markov Random Fields (HL-MRFs), a class of probabilistic graphical models that can be defined using a set of first-order logical rules. We validate our approach on data from Facebook with more than 5k users and almost 725k relations. We show how HL-MRFs can be used to develop a generic and extensible user profiling framework by leveraging textual, visual, and relational content in the form of status updates, profile pictures and Facebook page likes. Our experimental results demonstrate that our proposed model successfully incorporates multiple sources of information and outperforms competing methods that use only one source of information or an ensemble method across the different sources for modeling of users in social media.
Graph embedding techniques, which learn low-dimensional representations of a graph, are achieving state-of-the-art performance in many graph mining tasks. Most existing embedding algorithms assign a single vector to each node, implicitly assuming that a single representation is enough to capture all characteristics of the node. However, across many domains, it is common to observe pervasively overlapping community structure, where most nodes belong to multiple communities, playing different roles depending on the contexts. Here, we propose persona2vec, a graph embedding framework that efficiently learns multiple representations of nodes based on their structural contexts. Using link prediction-based evaluation, we show that our framework is significantly faster than the existing state-of-the-art model while achieving better performance.