Caveat Emptor, Computational Social Science: Large-Scale Missing Data in a Widely-Published Reddit Corpus

60 0 0.0 ( 0 )

Download Cite

Added by Devin Gaffney

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Devin Gaffney - J. Nathan Matias

Social and Information Networks

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

As researchers use computational methods to study complex social behaviors at scale, the validity of this computational social science depends on the integrity of the data. On July 2, 2015, Jason Baumgartner published a dataset advertised to include ``every publicly available Reddit comment which was quickly shared on Bittorrent and the Internet Archive. This data quickly became the basis of many academic papers on topics including machine learning, social behavior, politics, breaking news, and hate speech. We have discovered substantial gaps and limitations in this dataset which may contribute to bias in the findings of that research. In this paper, we document the dataset, substantial missing observations in the dataset, and the risks to research validity from those gaps. In summary, we identify strong risks to research that considers user histories or network analysis, moderate risks to research that compares counts of participation, and lesser risk to machine learning research that avoids making representative claims about behavior and participation on Reddit.

rate research

Complex Systems and a Computational Social Science Perspective on the Labor Market

90 - Abdullah Almaatouq 2016

Labor market institutions are central for modern economies, and their polices can directly affect unemployment rates and economic growth. At the individual level, unemployment often has a detrimental impact on peoples well-being and health. At the national level, high employment is one of the central goals of any economic policy, due to its close association with national prosperity. The main goal of this thesis is to highlight the need for frameworks that take into account the complex structure of labor market interactions. In particular, we explore the benefits of leveraging tools from computational social science, network science, and data-driven theories to measure the flow of opportunities and information in the context of the labor market. First, we investigate our key hypothesis, which is that opportunity/information flow through weak ties, and this is a key determinant of the length of unemployment. We then extend the idea of opportunity/information flow to clusters of other economic activities, where we expect the flow within clusters of related activities to be higher than within isolated activities. This captures the intuition that within related activities there are more capitals involved and that such activities require similar capabilities. Therefore, more extensive clusters of economic activities should generate greater growth through exploiting the greater flow of opportunities and information. We quantify the opportunity/information flow using a complexity measure of two economic activities (i.e. jobs and exports).

Social and Information Networks General Finance

Measuring large-scale social networks with high resolution

144 - Arkadiusz Stopczynski , Vedran Sekara , Piotr Sapiezynski 2014

This paper describes the deployment of a large-scale study designed to measure human interactions across a variety of communication channels, with high temporal resolution and spanning multiple years - the Copenhagen Networks Study. Specifically, we collect data on face-to-face interactions, telecommunication, social networks, location, and background information (personality, demographic, health, politics) for a densely connected population of 1,000 individuals, using state-of-art smartphones as social sensors. Here we provide an overview of the related work and describe the motivation and research agenda driving the study. Additionally the paper details the data-types measured, and the technical infrastructure in terms of both backend and phone software, as well as an outline of the deployment procedures. We document the participant privacy procedures and their underlying principles. The paper is concluded with early results from data analysis, illustrating the importance of multi-channel high-resolution approach to data collection.

Social and Information Networks Physics and Society

Inference and Influence of Large-Scale Social Networks Using Snapshot Population Behaviour without Network Data

171 - Antonia Godoy-Lorite , Nick S. Jones 2020

Population behaviours, such as voting and vaccination, depend on social networks. Social networks can differ depending on behaviour type and are typically hidden. However, we do often have large-scale behavioural data, albeit only snapshots taken at one timepoint. We present a method that jointly infers large-scale network structure and a networked model of human behaviour using only snapshot population behavioural data. This exploits the simplicity of a few parameter, geometric socio-demographic network model and a spin based model of behaviour. We illustrate, for the EU Referendum and two London Mayoral elections, how the model offers both prediction and the interpretation of our homophilic inclinations. Beyond offering the extraction of behaviour specific network structure from large-scale behavioural datasets, our approach yields a crude calculus linking inequalities and social preferences to behavioural outcomes. We give examples of potential network sensitive policies: how changes to income inequality, a social temperature and homophilic preferences might have reduced polarisation in a recent election.

Social and Information Networks Physics and Society

Understanding the Hoarding Behaviors during the COVID-19 Pandemic using Large Scale Social Media Data

113 - Xupin Zhang , Hanjia Lyu , Jiebo Luo 2020

The COVID-19 pandemic has affected peoples lives around the world on an unprecedented scale. We intend to investigate hoarding behaviors in response to the pandemic using large-scale social media data. First, we collect hoarding-related tweets shortly after the outbreak of the coronavirus. Next, we analyze the hoarding and anti-hoarding patterns of over 42,000 unique Twitter users in the United States from March 1 to April 30, 2020, and dissect the hoarding-related tweets by age, gender, and geographic location. We find the percentage of females in both hoarding and anti-hoarding groups is higher than that of the general Twitter user population. Furthermore, using topic modeling, we investigate the opinions expressed towards the hoarding behavior by categorizing these topics according to demographic and geographic groups. We also calculate the anxiety scores for the hoarding and anti-hoarding related tweets using a lexical approach. By comparing their anxiety scores with the baseline Twitter anxiety score, we reveal further insights. The LIWC anxiety mean for the hoarding-related tweets is significantly higher than the baseline Twitter anxiety mean. Interestingly, beer has the highest calculated anxiety score compared to other hoarded items mentioned in the tweets.

Social and Information Networks Computers and Society

Uncovering Complex Overlapping Pattern of Communities in Large-scale Social Networks

82 - Elvis H. W. Xu , Pak Ming Hui 2018

The conventional notion of community that favors a high ratio of internal edges to outbound edges becomes invalid when each vertex participates in multiple communities. Such a behavior is commonplace in social networks. The significant overlaps among communities make most existing community detection algorithms ineffective. The lack of effective and efficient tools resulted in very few empirical studies on large-scale detection and analyses of overlapping community structure in real social networks. We developed recently a scalable and accurate method called the Partial Community Merger Algorithm (PCMA) with linear complexity and demonstrated its effectiveness by analyzing two online social networks, Sina Weibo and Friendster, with 79.4 and 65.6 million vertices, respectively. Here, we report in-depth analyses of the 2.9 million communities detected by PCMA to uncover their complex overlapping structure. Each community usually overlaps with a significant number of other communities and has far more outbound edges than internal edges. Yet, the communities remain well separated from each other. Most vertices in a community are multi-membership vertices, and they can be at the core or the peripheral. Almost half of the entire network can be accounted for by an extremely dense network of communities, with the communities being the vertices and the overlaps being the edges. The empirical findings ask for rethinking the notion of community, especially the boundary of a community. Realizing that it is how the edges are organized that matters, the f-core is suggested as a suitable concept for overlapping community in social networks. The results shed new light on the understanding of overlapping community.

Social and Information Networks Physics and Society