ﻻ يوجد ملخص باللغة العربية
Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.
Political polarization in the US is on the rise. This polarization negatively affects the public sphere by contributing to the creation of ideological echo chambers. In this paper, we focus on addressing one of the factors that contributes to this po
In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the
Online users today are exposed to misleading and propagandistic news articles and media posts on a daily basis. To counter thus, a number of approaches have been designed aiming to achieve a healthier and safer online news and media consumption. Auto
Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingu
Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models on a low-resource language like Vietnamese. Besides, machine reading comprehension (MRC) for the health domain offers great potential for practical