No Arabic abstract
The ability to share social network data at the level of individual connections is beneficial to science: not only for reproducing results, but also for researchers who may wish to use it for purposes not foreseen by the data releaser. Sharing such data, however, can lead to serious privacy issues, because individuals could be re-identified, not only based on possible nodes attributes, but also from the structure of the network around them. The risk associated with re-identification can be measured and it is more serious in some networks than in others. Various optimization algorithms have been proposed to anonymize the network while keeping the number of changes minimal. However, existing algorithms do not provide guarantees on where the changes will be made, making it difficult to quantify their effect on various measures. Using network models and real data, we show that the average degree of networks is a crucial parameter for the severity of re-identification risk from nodes neighborhoods. Dense networks are more at risk, and, apart from a small band of average degree values, either almost all nodes are re-identifiable or they are all safe. Our results allow researchers to assess the privacy risk based on a small number of network statistics which are available even before the data is collected. As a rule-of-thumb, the privacy risks are high if the average degree is above 10. Guided by these results we propose a simple method based on edge sampling to mitigate the re-identification risk of nodes. Our method can be implemented already at the data collection phase. Its effect on various network measures can be estimated and corrected using sampling theory. These properties are in contrast with previous methods arbitrarily biasing the data. In this sense, our work could help in sharing network data in a statistically tractable way.
In-depth studies of sociotechnical systems are largely limited to single instances. Network surveys are expensive, and platforms vary in important ways, from interface design, to social norms, to historical contingencies. With single examples, we can not in general know how much of observed network structure is explained by historical accidents, random noise, or meaningful social processes, nor can we claim that network structure predicts outcomes, such as organization success or ecosystem health. Here, I show how we can adopt a comparative approach for settings where we have, or can cleverly construct, multiple instances of a network to estimate the natural variability in social systems. The comparative approach makes previously untested theories testable. Drawing on examples from the social networks literature, I discuss emerging directions in the study of populations of sociotechnical systems using insights from organization theory and ecology.
We propose a stochastic model for the diffusion of topics entering a social network modeled by a Watts-Strogatz graph. Our model sets into play an implicit competition between these topics as they vie for the attention of users in the network. The dynamics of our model are based on notions taken from real-world OSNs like Twitter where users either adopt an exogenous topic or copy topics from their neighbors leading to endogenous propagation. When instantiated correctly, the model achieves a viral regime where a few topics garner unusually good response from the network, closely mimicking the behavior of real-world OSNs. Our main contribution is our description of how clusters of proximate users that have spoken on the topic merge to form a large giant component making a topic go viral. This demonstrates that it is not weak ties but actually strong ties that play a major part in virality. We further validate our model and our hypotheses about its behavior by comparing our simulation results with the results of a measurement study conducted on real data taken from Twitter.
Here, we review the research we have done on social contagion. We describe the methods we have employed (and the assumptions they have entailed) in order to examine several datasets with complementary strengths and weaknesses, including the Framingham Heart Study, the National Longitudinal Study of Adolescent Health, and other observational and experimental datasets that we and others have collected. We describe the regularities that led us to propose that human social networks may exhibit a three degrees of influence property, and we review statistical approaches we have used to characterize inter-personal influence with respect to phenomena as diverse as obesity, smoking, cooperation, and happiness. We do not claim that this work is the final word, but we do believe that it provides some novel, informative, and stimulating evidence regarding social contagion in longitudinally followed networks. Along with other scholars, we are working to develop new methods for identifying causal effects using social network data, and we believe that this area is ripe for statistical development as current methods have known and often unavoidable limitations.
Peoples personal social networks are big and cluttered, and currently there is no good way to automatically organize them. Social networking sites allow users to manually categorize their friends into social circles (e.g. circles on Google+, and lists on Facebook and Twitter), however they are laborious to construct and must be updated whenever a users network grows. In this paper, we study the novel task of automatically identifying users social circles. We pose this task as a multi-membership node clustering problem on a users ego-network, a network of connections between her friends. We develop a model for detecting circles that combines network structure as well as user profile information. For each circle we learn its members and the circle-specific user profile similarity metric. Modeling node membership to multiple circles allows us to detect overlapping as well as hierarchically nested circles. Experiments show that our model accurately identifies circles on a diverse set of data from Facebook, Google+, and Twitter, for all of which we obtain hand-labeled ground-truth.
We introduce a new paradigm that is important for community detection in the realm of network analysis. Networks contain a set of strong, dominant communities, which interfere with the detection of weak, natural community structure. When most of the members of the weak communities also belong to stronger communities, they are extremely hard to be uncovered. We call the weak communities the hidden community structure. We present a novel approach called HICODE (HIdden COmmunity DEtection) that identifies the hidden community structure as well as the dominant community structure. By weakening the strength of the dominant structure, one can uncover the hidden structure beneath. Likewise, by reducing the strength of the hidden structure, one can more accurately identify the dominant structure. In this way, HICODE tackles both tasks simultaneously. Extensive experiments on real-world networks demonstrate that HICODE outperforms several state-of-the-art community detection methods in uncovering both the dominant and the hidden structure. In the Facebook university social networks, we find multiple non-redundant sets of communities that are strongly associated with residential hall, year of registration or career position of the faculties or students, while the state-of-the-art algorithms mainly locate the dominant ground truth category. In the Due to the difficulty of labeling all ground truth communities in real-world datasets, HICODE provides a promising approach to pinpoint the existing latent communities and uncover communities for which there is no ground truth. Finding this unknown structure is an extremely important community detection problem.