No Arabic abstract
We present a method for accurately predicting the long time popularity of online content from early measurements of user access. Using two content sharing portals, Youtube and Digg, we show that by modeling the accrual of views and votes on content offered by these services we can predict the long-term dynamics of individual submissions from initial data. In the case of Digg, measuring access to given stories during the first two hours allows us to forecast their popularity 30 days ahead with remarkable accuracy, while downloads of Youtube videos need to be followed for 10 days to attain the same performance. The differing time scales of the predictions are shown to be due to differences in how content is consumed on the two portals: Digg stories quickly become outdated, while Youtube videos are still found long after they are initially submitted to the portal. We show that predictions are more accurate for submissions for which attention decays quickly, whereas predictions for evergreen content will be prone to larger errors.
Social media, the modern marketplace of ideas, is vulnerable to manipulation. Deceptive inauthentic actors impersonate humans to amplify misinformation and influence public opinions. Little is known about the large-scale consequences of such operations, due to the ethical challenges posed by online experiments that manipulate human behavior. Here we introduce a model of information spreading where agents prefer quality information but have limited attention. We evaluate the impact of manipulation strategies aimed at degrading the overall quality of the information ecosystem. The model reproduces empirical patterns about amplification of low-quality information. We find that infiltrating a critical fraction of the network is more damaging than generating attention-grabbing content or targeting influentials. We discuss countermeasures suggested by these insights to increase the resilience of social media users to manipulation, and legal issues arising from regulations aimed at protecting human speech from suppression by inauthentic actors.
We use sequential large-scale crawl data to empirically investigate and validate the dynamics that underlie the evolution of the structure of the web. We find that the overall structure of the web is defined by an intricate interplay between experience or entitlement of the pages (as measured by the number of inbound hyperlinks a page already has), inherent talent or fitness of the pages (as measured by the likelihood that someone visiting the page would give a hyperlink to it), and the continual high rates of birth and death of pages on the web. We find that the web is conservative in judging talent and the overall fitness distribution is exponential, showing low variability. The small variance in talent, however, is enough to lead to experience distributions with high variance: The preferential attachment mechanism amplifies these small biases and leads to heavy-tailed power-law (PL) inbound degree distributions over all pages, as well as over pages that are of the same age. The balancing act between experience and talent on the web allows newly introduced pages with novel and interesting content to grow quickly and surpass older pages. In this regard, it is much like what we observe in high-mobility and meritocratic societies: People with entitlement continue to have access to the best resources, but there is just enough screening for fitness that allows for talented winners to emerge and join the ranks of the leaders. Finally, we show that the fitness estimates have potential practical applications in ranking query results.
In online collaborative learning environments, students create content and construct their own knowledge through complex interactions over time. To facilitate effective social learning and inclusive participation in this context, insights are needed into the correspondence between student-contributed artifacts and their subsequent popularity among peers. In this study, we represent student artifacts by their (a) contextual action logs (b) textual content, and (c) set of instructor-specified features, and use these representations to predict artifact popularity measures. Through a mixture of predictive analysis and visual exploration, we find that the neural embedding representation, learned from contextual action logs, has the strongest predictions of popularity, ahead of instructors knowledge, which includes academic value and creativity ratings. Because this representation can be learnt without extensive human labeling effort, it opens up possibilities for shaping more inclusive student interactions on the fly in collaboration with instructors and students alike.
Web sites where users create and rate content as well as form networks with other users display long-tailed distributions in many aspects of behavior. Using behavior on one such community site, Essembly, we propose and evaluate plausible mechanisms to explain these behaviors. Unlike purely descriptive models, these mechanisms rely on user behaviors based on information available locally to each user. For Essembly, we find the long-tails arise from large differences among user activity rates and qualities of the rated content, as well as the extensive variability in the time users devote to the site. We show that the models not only explain overall behavior but also allow estimating the quality of content from their early behaviors.
Online petitions are a cost-effective way for citizens to collectively engage with policy-makers in a democracy. Predicting the popularity of a petition --- commonly measured by its signature count --- based on its textual content has utility for policy-makers as well as those posting the petition. In this work, we model this task using CNN regression with an auxiliary ordinal regression objective. We demonstrate the effectiveness of our proposed approach using UK and US government petition datasets.