ترغب بنشر مسار تعليمي؟ اضغط هنا

On the Use of ArXiv as a Dataset

161   0   0.0 ( 0 )
 نشر من قبل Colin B Clement
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXivs publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.



قيم البحث

اقرأ أيضاً

Model interpretability has become important to engenders appropriate user trust by providing the insight into the model prediction. However, most of the existing machine learning methods provide no interpretability for depression prediction, hence th eir predictions are obscure to human. In this work, we propose interpretive Multi-Modal Depression Detection with Hierarchical Attention Network MDHAN, for detection depressed users on social media and explain the model prediction. We have considered user posts along with Twitter-based multi-modal features, specifically, we encode user posts using two levels of attention mechanisms applied at the tweet-level and word-level, calculate each tweet and words importance, and capture semantic sequence features from the user timelines (posts). Our experiments show that MDHAN outperforms several popular and robust baseline methods, demonstrating the effectiveness of combining deep learning with multi-modal features. We also show that our model helps improve predictive performance when detecting depression in users who are posting messages publicly on social media. MDHAN achieves excellent performance and ensures adequate evidence to explain the prediction.
We present a framework to generate and evaluate thematic recommendations based on multilayer network representations of knowledge graphs (KGs). In this representation, each layer encodes a different type of relationship in the KG, and directed interl ayer couplings connect the same entity in different roles. The relative importance of different types of connections is captured by an intuitive salience matrix that can be estimated from data, tuned to incorporate domain knowledge, address different use cases, or respect business logic. We apply an adaptation of the personalised PageRank algorithm to multilayer models of KGs to generate item-item recommendations. These recommendations reflect the knowledge we hold about the content and are suitable for thematic and/or cold-start recommendation settings. Evaluating thematic recommendations from user data presents unique challenges that we address by developing a method to evaluate recommendations relying on user-item ratings, yet respecting their thematic nature. We also show that the salience matrix can be estimated from user data. We demonstrate the utility of our methods by significantly improving consumption metrics in an AB test where collaborative filtering delivered subpar performance. We also apply our approach to movie recommendation using publicly-available data to ensure the reproducibility of our results. We demonstrate that our approach outperforms existing thematic recommendation methods and is even competitive with collaborative filtering approaches.
The damaging effects of hate speech on social media are evident during the last few years, and several organizations, researchers and social media platforms tried to harness them in various ways. Despite these efforts, social media users are still af fected by hate speech. The problem is even more apparent to social groups that promote public discourse, such as journalists. In this work, we focus on countering hate speech that is targeted to journalistic social media accounts. To accomplish this, a group of journalists assembled a definition of hate speech, taking into account the journalistic point of view and the types of hate speech that are usually targeted against journalists. We then compile a large pool of tweets referring to journalism-related accounts in multiple languages. In order to annotate the pool of unlabeled tweets according to the definition, we follow a concise annotation strategy that involves active learning annotation stages. The outcome of this paper is a novel, publicly available collection of Twitter datasets in five different languages. Additionally, we experiment with state-of-the-art deep learning architectures for hate speech detection and use our annotated datasets to train and evaluate them. Finally, we propose an ensemble detection model that outperforms all individual models.
Information Extraction (IE) from the tables present in scientific articles is challenging due to complicated tabular representations and complex embedded text. This paper presents TabLeX, a large-scale benchmark dataset comprising table images genera ted from scientific articles. TabLeX consists of two subsets, one for table structure extraction and the other for table content extraction. Each table image is accompanied by its corresponding LATEX source code. To facilitate the development of robust table IE tools, TabLeX contains images in different aspect ratios and in a variety of fonts. Our analysis sheds light on the shortcomings of current state-of-the-art table extraction models and shows that they fail on even simple table images. Towards the end, we experiment with a transformer-based existing baseline to report performance scores. In contrast to the static benchmarks, we plan to augment this dataset with more complex and diverse tables at regular intervals.
In this article, we introduce the ContentWise Impressions dataset, a collection of implicit interactions and impressions of movies and TV series from an Over-The-Top media service, which delivers its media contents over the Internet. The dataset is d istinguished from other already available multimedia recommendation datasets by the availability of impressions, i.e., the recommendations shown to the user, its size, and by being open-source. We describe the data collection process, the preprocessing applied, its characteristics, and statistics when compared to other commonly used datasets. We also highlight several possible use cases and research questions that can benefit from the availability of user impressions in an open-source dataset. Furthermore, we release software tools to load and split the data, as well as examples of how to use both user interactions and impressions in several common recommendation algorithms.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا