No Arabic abstract
In a perfect world, all articles consistently contain sufficient metadata to describe the resource. We know this is not the reality, so we are motivated to investigate the evolution of the metadata that is present when authors and publishers supply their own. Because applying metadata takes time, we recognize that each news article author has a limited metadata budget with which to spend their time and effort. How are they spending this budget? What are the top metadata categories in use? How did they grow over time? What purpose do they serve? We also recognize that not all metadata fields are used equally. What is the growth of individual fields over time? Which fields experienced the fastest adoption? In this paper, we review 227,726 HTML news articles from 29 outlets captured by the Internet Archive between 1998 and 2016. Upon reviewing the metadata fields in each article, we discovered that 2010 began a metadata renaissance as publishers embraced metadata for improved search engine ranking, search engine tracking, social media tracking, and social media sharing. When analyzing individual fields, we find that one application of metadata stands out above all others: social cards -- the cards generated by platforms like Twitter when one shares a URL. Once a metadata standard was established for cards in 2010, its fields were adopted by 20% of articles in the first year and reached more than 95% adoption by 2016. This rate of adoption surpasses efforts like Schema.org and Dublin Core by a fair margin. When confronted with these results on how news publishers spend their metadata budget, we must conclude that it is all about the cards.
Nowadays, researchers have moved to platforms like Twitter to spread information about their ideas and empirical evidence. Recent studies have shown that social media affects the scientific impact of a paper. However, these studies only utilize the tweet counts to represent Twitter activity. In this paper, we propose TweetPap, a large-scale dataset that introduces temporal information of citation/tweets and the metadata of the tweets to quantify and understand the discourse of scientific papers on social media. The dataset is publicly available at https://github.com/lingo-iitgn/TweetPap
To allow previewing a web page, social media platforms have developed social cards: visualizations consisting of vital information about the underlying resource. At a minimum, social cards often include features such as the web resources title, text summary, striking image, and domain name. News and scholarly articles on the web are frequently subject to social card creation when being shared on social media. However, we noticed that not all web resources offer sufficient metadata elements to enable appealing social cards. For example, the COVID-19 emergency has made it clear that scholarly articles, in particular, are at an aesthetic disadvantage in social media platforms when compared to their often more flashy disinformation rivals. Also, social cards are often not generated correctly for archived web resources, including pages that lack or predate standards for specifying striking images. With these observations, we are motivated to quantify the levels of inclusion of required metadata in web resources, its evolution over time for archived resources, and create and evaluate an algorithm to automatically select a striking image for social cards. We find that more than 40% of archived news articles sampled from the NEWSROOM dataset and 22% of scholarly articles sampled from the PubMed Central dataset fail to supply striking images. We demonstrate that we can automatically predict the striking image with a Precision@1 of 0.83 for news articles from NEWSROOM and 0.78 for scholarly articles from the open access journal PLOS ONE.
Traditionally, scholarly impact and visibility have been measured by counting publications and citations in the scholarly literature. However, increasingly scholars are also visible on the Web, establishing presences in a growing variety of social ecosystems. But how wide and established is this presence, and how do measures of social Web impact relate to their more traditional counterparts? To answer this, we sampled 57 presenters from the 2010 Leiden STI Conference, gathering publication and citations counts as well as data from the presenters Web footprints. We found Web presence widespread and diverse: 84% of scholars had homepages, 70% were on LinkedIn, 23% had public Google Scholar profiles, and 16% were on Twitter. For sampled scholars publications, social reference manager bookmarks were compared to Scopus and Web of Science citations; we found that Mendeley covers more than 80% of sampled articles, and that Mendeley bookmarks are significantly correlated (r=.45) to Scopus citation counts.
Large-scale image retrieval benchmarks invariably consist of images from the Web. Many of these benchmarks are derived from online photo sharing networks, like Flickr, which in addition to hosting images also provide a highly interactive social community. Such communities generate rich metadata that can naturally be harnessed for image classification and retrieval. Here we study four popular benchmark datasets, extending them with social-network metadata, such as the groups to which each image belongs, the comment thread associated with the image, who uploaded it, their location, and their network of friends. Since these types of data are inherently relational, we propose a model that explicitly accounts for the interdependencies between images sharing common properties. We model the task as a binary labeling problem on a network, and use structured learning techniques to learn model parameters. We find that social-network metadata are useful in a variety of classification tasks, in many cases outperforming methods based on image content.
Social media provides many opportunities to monitor and evaluate political phenomena such as referendums and elections. In this study, we propose a set of approaches to analyze long-running political events on social media with a real-world experiment: the debate about Brexit, i.e., the process through which the United Kingdom activated the option of leaving the European Union. We address the following research questions: Could Twitter-based stance classification be used to demonstrate public stance with respect to political events? What is the most efficient and comprehensive approach to measuring the impact of politicians on social media? Which of the polarized sides of the debate is more responsive to politician messages and the main issues of the Brexit process? What is the share of bot accounts in the Brexit discussion and which side are they for? By combining the user stance classification, topic discovery, sentiment analysis, and bot detection, we show that it is possible to obtain useful insights about political phenomena from social media data. We are able to detect relevant topics in the discussions, such as the demand for a new referendum, and to understand the position of social media users with respect to the different topics in the debate. Our comparative and temporal analysis of political accounts can detect the critical periods of the Brexit process and the impact they have on the debate.