No Arabic abstract
Qualitative research provides methodological guidelines for observing and studying communities and cultures on online social media platforms. However, such methods demand considerable manual effort from researchers and may be overly focused and narrowed to certain online groups. In this work, we propose a complete solution to accelerate qualitative analysis of problematic online speech -- with a specific focus on opinions emerging from online communities -- by leveraging machine learning algorithms. First, we employ qualitative methods of deep observation for understanding problematic online speech. This initial qualitative study constructs an ontology of problematic speech, which contains social media postings annotated with their underlying opinions. The qualitative study also dynamically constructs the set of opinions, simultaneous with labeling the postings. Next, we collect a large dataset from three online social media platforms (Facebook, Twitter and Youtube) using keywords. Finally, we introduce an iterative data exploration procedure to augment the dataset. It alternates between a data sampler, which balances exploration and exploitation of unlabeled data, the automatic labeling of the sampled data, the manual inspection by the qualitative mapping team and, finally, the retraining of the automatic opinion classifier. We present both qualitative and quantitative results. First, we present detailed case studies of the dynamics of problematic speech in a far-right Facebook group, exemplifying its mutation from conservative to extreme. Next, we show that our method successfully learns from the initial qualitatively labeled and narrowly focused dataset, and constructs a larger dataset. Using the latter, we examine the dynamics of opinion emergence and co-occurrence, and we hint at some of the pathways through which extreme opinions creep into the mainstream online discourse.
In an increasingly polarized world, demagogues who reduce complexity down to simple arguments based on emotion are gaining in popularity. Are opinions and online discussions falling into demagoguery? In this work, we aim to provide computational tools to investigate this question and, by doing so, explore the nature and complexity of online discussions and their space of opinions, uncovering where each participant lies. More specifically, we present a modeling framework to construct latent representations of opinions in online discussions which are consistent with human judgements, as measured by online voting. If two opinions are close in the resulting latent space of opinions, it is because humans think they are similar. Our modeling framework is theoretically grounded and establishes a surprising connection between opinions and voting models and the sign-rank of a matrix. Moreover, it also provides a set of practical algorithms to both estimate the dimension of the latent space of opinions and infer where opinions expressed by the participants of an online discussion lie in this space. Experiments on a large dataset from Yahoo! News, Yahoo! Finance, Yahoo! Sports, and the Newsroom app suggest that unidimensional opinion models may often be unable to accurately represent online discussions, provide insights into human judgements and opinions, and show that our framework is able to circumvent language nuances such as sarcasm or humor by relying on human judgements instead of textual analysis.
In online debates individual arguments support or attack each other, leading to some subset of arguments being considered more relevant than others. However, in large discussions readers are often forced to sample a subset of the arguments being put forth. Since such sampling is rarely done in a principled manner, users may not read all the relevant arguments to get a full picture of the debate. This paper is interested in answering the question of how users should sample online conversations to selectively favour the currently justified or accepted positions in the debate. We apply techniques from argumentation theory and complex networks to build a model that predicts the probabilities of the normatively justified arguments given their location in online discussions. Our model shows that the proportion of replies that are supportive, the number of replies that comments receive, and the locations of un-replied comments all determine the probability that a comment is a justified argument. We show that when the degree distribution of the number of replies is homogeneous along the discussion, for acrimonious discussions, the distribution of justified arguments depends on the parity of the graph level. In supportive discussions the probability of having justified comments increases as one moves away from the root. For discussion trees that have a non-homogeneous in-degree distribution, for supportive discussions we observe the same behaviour as before, while for acrimonious discussions we cannot observe the same parity-based distribution. This is verified with data obtained from the online debating platform Kialo. By predicting the locations of the justified arguments in reply trees, we can suggest which arguments readers should sample to grasp the currently accepted opinions in such discussions. Our models have important implications for the design of future online debating platforms.
This paper studies the dynamics of opinion formation and polarization in social media. We investigate whether users stance concerning contentious subjects is influenced by the online discussions they are exposed to and interactions with users supporting different stances. We set up a series of predictive exercises based on machine learning models. Users are described using several posting activities features capturing their overall activity levels, posting success, the reactions their posts attract from users of different stances, and the types of discussions in which they engage. Given the user description at present, the purpose is to predict their stance in the future. Using a dataset of Brexit discussions on the Reddit platform, we show that the activity features regularly outperform the textual baseline, confirming the link between exposure to discussion and opinion. We find that the most informative features relate to the stance composition of the discussion in which users prefer to engage.
After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted the particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.
The novel coronavirus pandemic continues to ravage communities across the US. Opinion surveys identified importance of political ideology in shaping perceptions of the pandemic and compliance with preventive measures. Here, we use social media data to study complexity of polarization. We analyze a large dataset of tweets related to the pandemic collected between January and May of 2020, and develop methods to classify the ideological alignment of users along the moderacy (hardline vs moderate), political (liberal vs conservative) and science (anti-science vs pro-science) dimensions. While polarization along the science and political dimensions are correlated, politically moderate users are more likely to be aligned with the pro-science views, and politically hardline users with anti-science views. Contrary to expectations, we do not find that polarization grows over time; instead, we see increasing activity by moderate pro-science users. We also show that anti-science conservatives tend to tweet from the Southern US, while anti-science moderates from the Western states. Our findings shed light on the multi-dimensional nature of polarization, and the feasibility of tracking polarized opinions about the pandemic across time and space through social media data.