No Arabic abstract
Understanding the correlation between two different scores for the same set of items is a common problem in information retrieval, and the most commonly used statistics that quantifies this correlation is Kendalls $tau$. However, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular while comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in web and social graphs. We propose to extend Kendalls definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an $O(nlog n)$ algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data.
Given a set of attributed subgraphs known to be from different classes, how can we discover their differences? There are many cases where collections of subgraphs may be contrasted against each other. For example, they may be assigned ground truth labels (spam/not-spam), or it may be desired to directly compare the biological networks of different species or compound networks of different chemicals. In this work we introduce the problem of characterizing the differences between attributed subgraphs that belong to different classes. We define this characterization problem as one of partitioning the attributes into as many groups as the number of classes, while maximizing the total attributed quality score of all the given subgraphs. We show that our attribute-to-class assignment problem is NP-hard and an optimal $(1 - 1/e)$-approximation algorithm exists. We also propose two different faster heuristics that are linear-time in the number of attributes and subgraphs. Unlike previous work where only attributes were taken into account for characterization, here we exploit both attributes and social ties (i.e. graph structure). Through extensive experiments, we compare our proposed algorithms, show findings that agree with human intuition on datasets from Amazon co-purchases, Congressional bill sponsorships, and DBLP co-authorships. We also show that our approach of characterizing subgraphs is better suited for sense-making than discriminating classification approaches.
In recent years, network embedding methods have garnered increasing attention because of their effectiveness in various information retrieval tasks. The goal is to learn low-dimensional representations of vertexes in an information network and simultaneously capture and preserve the network structure. Critical to the performance of a network embedding method is how the edges/vertexes of the network is sampled for the learning process. Many existing methods adopt a uniform sampling method to reduce learning complexity, but when the network is non-uniform (i.e. a weighted network) such uniform sampling incurs information loss. The goal of this paper is to present a generalized vertex sampling framework that works seamlessly with most existing network embedding methods to support weighted instead of uniform vertex/edge sampling. For efficiency, we propose a delicate sequential vertex-to-context graph data structure, such that sampling a training pair for learning takes only constant time. For scalability and memory efficiency, we design the graph data structure in a way that keeps space consumption low without requiring additional space. In addition to implementing existing network embedding methods, the proposed framework can be used to implement extensions that feature high-order proximity modeling and weighted relation modeling. Experiments conducted on three datasets, including a commercial large-scale one, verify the effectiveness and efficiency of the proposed weighted network embedding methods on a variety of tasks, including word similarity search, multi-label classification, and item recommendation.
The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.
Network structure can affect when and how widely new ideas, products, and behaviors are adopted. In widely-used models of biological contagion, interventions that randomly rewire edges (generally making them longer) accelerate spread. However, there are other models relevant to social contagion, such as those motivated by myopic best-response in games with strategic complements, in which an individuals behavior is described by a threshold number of adopting neighbors above which adoption occurs (i.e., complex contagions). Recent work has argued that highly clustered, rather than random, networks facilitate spread of these complex contagions. Here we show that minor modifications to this model, which make it more realistic, reverse this result: we allow very rare below-threshold adoption, i.e., rarely adoption occurs when there is only one adopting neighbor. To model the trade-off between long and short edges we consider networks that are the union of cycle-power-$k$ graphs and random graphs on $n$ nodes. Allowing adoptions below threshold to occur with order $1/sqrt{n}$ probability along some short cycle edges is enough to ensure that random rewiring accelerates spread. Simulations illustrate the robustness of these results to other commonly-posited models for noisy best-response behavior. Hypothetical interventions that randomly rewire existing edges or add random edges (versus adding short, triad-closing edges) in hundreds of empirical social networks reduce time to spread. This revised conclusion suggests that those wanting to increase spread should induce formation of long ties, rather than triad-closing ties. More generally, this highlights the importance of noise in game-theoretic analyses of behavior.
Nutrition is a key determinant of long-term health, and social influence has long been theorized to be a key determinant of nutrition. It has been difficult to quantify the postulated role of social influence on nutrition using traditional methods such as surveys, due to the typically small scale and short duration of studies. To overcome these limitations, we leverage a novel source of data: logs of 38 million food purchases made over an 8-year period on the Ecole Polytechnique Federale de Lausanne (EPFL) university campus, linked to anonymized individuals via the smartcards used to make on-campus purchases. In a longitudinal observational study, we ask: How is a persons food choice affected by eating with someone else whose own food choice is healthy vs. unhealthy? To estimate causal effects from the passively observed log data, we control confounds in a matched quasi-experimental design: we identify focal users who at first do not have any regular eating partners but then start eating with a fixed partner regularly, and we match focal users into comparison pairs such that paired users are nearly identical with respect to covariates measured before acquiring the partner, where the two focal users new eating partners diverge in the healthiness of their respective food choice. A difference-in-differences analysis of the paired data yields clear evidence of social influence: focal users acquiring a healthy-eating partner change their habits significantly more toward healthy foods than focal users acquiring an unhealthy-eating partner. We further identify foods whose purchase frequency is impacted significantly by the eating partners healthiness of food choice. Beyond the main results, the work demonstrates the utility of passively sensed food purchase logs for deriving insights, with the potential of informing the design of public health interventions and food offerings.