FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection

80 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Radu Tudor Ionescu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Radu Tudor Ionescu - Adrian Gabriel Chifu

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we introduce FreSaDa, a French Satire Data Set, which is composed of 11,570 articles from the news domain. In order to avoid reporting unreasonably high accuracy rates due to the learning of characteristics specific to publication sources, we divided our samples into training, validation and test, such that the training publication sources are distinct from the validation and test publication sources. This gives rise to a cross-domain (cross-source) satire detection task. We employ two classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (average of CamemBERT word embeddings). As an additional contribution, we present an unsupervised domain adaptation method based on regarding the pairwise similarities (given by the dot product) between the training samples and the validation samples as features. By including these domain-specific features, we attain significant improvements for both character n-grams and CamemBERT embeddings.

قيم البحث

74 - Ana-Cristina Rogoz , Mihaela Gaman , Radu Tudor Ionescu 2021

In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73% on the test set) compared to the human-level accuracy (87%), leaving enough room for improvement in future research.

الحساب واللغة التعلم الآلي

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

108 - Simon Gabay 2020

With the development of big corpora of various periods, it becomes crucial to standardise linguistic annotation (e.g. lemmas, POS tags, morphological annotation) to increase the interoperability of the data produced, despite diachronic variations. In the present paper, we describe both methodologically (by proposing annotation principles) and technically (by creating the required training data and the relevant models) the production of a linguistic tagger for (early) modern French (16-18th c.), taking as much as possible into account already existing standards for contemporary and, especially, medieval French.

الحساب واللغة التعلم الآلي

Readdressing the UV solar variability with SATIRE-S: non-LTE effects

325 - R. V. Tagirov , A. I. Shapiro , N. A. Krivova 2019

Context. Solar spectral irradiance (SSI) variability is one of the key inputs to models of the Earths climate. Understanding solar irradiance fluctuations also helps to place the Sun among other stars in terms of their brightness variability patterns and to set detectability limits for terrestrial exo-planets. Aims. One of the most successful and widely used models of solar irradiance variability is SATIRE-S. It uses spectra of the magnetic features and surrounding quiet Sun computed with the ATLAS9 spectral synthesis code under the assumption of Local Thermodynamic Equilibrium (LTE). SATIRE-S has been at the forefront of solar variability modelling, but due to the limitations of the LTE approximation its output SSI has to be empirically corrected below 300 nm, which reduces the physical consistency of its results. This shortcoming is addressed in the present paper. Methods. We replace the ATLAS9 spectra of all atmospheric components in SATIRE-S with the spectra calculated using the non-LTE Spectral Synthesis Code (NESSY). We also use Fontenla et al. (1999) temperature and density stratification models of the solar atmosphere to compute the spectrum of the quiet Sun and faculae. Results. We compute non-LTE contrasts of spots and faculae and combine them with the SDO/HMI filling factors of the active regions to calculate the total and spectral solar irradiance variability during solar cycle 24. Conclusions. The non-LTE contrasts result in total and spectral solar irradiance in good agreement with the empirically corrected output of the LTE version. This suggests that empirical correction introduced into SATIRE-S output is well judged and that the corrected total and spectral solar irradiance obtained from the SATIRE-S model in LTE is fully consistent with the results of non-LTE computations.

الفيزياء الفلكية الشمسية والنجوم

MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models

107 - Mandy Guo , Yinfei Yang , Daniel Cer 2020

Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus (Ahmad et al.,2019).This paper presents MultiReQA, anew multi-domain ReQA evaluation suite com-posed of eight retrieval QA tasks d rawn from publicly available QA datasets. We provide the first systematic retrieval based evaluation over these datasets using two supervised neural models, based on fine-tuning BERT andUSE-QA models respectively, as well as a surprisingly strong information retrieval baseline,BM25. Five of these tasks contain both train-ing and test data, while three contain test data only. Performance on the five tasks with train-ing data shows that while a general model covering all domains is achievable, the best performance is often obtained by training exclusively on in-domain data.

الحساب واللغة التعلم الآلي

Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multi-modal Data

142 - Amila Silva , Ling Luo , Shanika Karunasekera 2021

With the rapid evolution of social media, fake news has become a significant social problem, which cannot be addressed in a timely manner using manual investigation. This has motivated numerous studies on automating fake news detection. Most studies explore supervised training models with different modalities (e.g., text, images, and propagation networks) of news records to identify fake news. However, the performance of such techniques generally drops if news records are coming from different domains (e.g., politics, entertainment), especially for domains that are unseen or rarely-seen during training. As motivation, we empirically show that news records from different domains have significantly different word usage and propagation patterns. Furthermore, due to the sheer volume of unlabelled news records, it is challenging to select news records for manual labelling so that the domain-coverage of the labelled dataset is maximized. Hence, this work: (1) proposes a novel framework that jointly preserves domain-specific and cross-domain knowledge in news records to detect fake news from different domains; and (2) introduces an unsupervised technique to select a set of unlabelled informative news records for manual labelling, which can be ultimately used to train a fake news detection model that performs well for many domains while minimizing the labelling cost. Our experiments show that the integration of the proposed fake news model and the selective annotation approach achieves state-of-the-art performance for cross-domain news datasets, while yielding notable improvements for rarely-appearing domains in news datasets.

الحساب واللغة استرجاع المعلومات التعلم الآلي