Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

TREMoLo-Tweets: A Multi-Label Corpus of French Tweets for Language Register Characterization

Tremolo-Tweets: مجموعة متعددة التسمية من التغريدات الفرنسية للحصول على توصيف تسجيل اللغة

1004 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The casual, neutral, and formal language registers are highly perceptible in discourse productions. However, they are still poorly studied in Natural Language Processing (NLP), especially outside English, and for new textual types like tweets. To stimulate research, this paper introduces a large corpus of 228,505 French tweets (6M words) annotated in language registers. Labels are provided by a multi-label CamemBERT classifier trained and checked on a manually annotated subset of the corpus, while the tweets are selected to avoid undesired biases. Based on the corpus, an initial analysis of linguistic traits from either human annotators or automatic extractions is provided to describe the corpus and pave the way for various NLP tasks. The corpus, annotation guide and classifier are available on http://tremolo.irisa.fr.

References used

https://aclanthology.org/

rate research

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

671 - Association for Computation Linguistics 2021 مقالة

We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.

domain adaptation french tweets general-domain french language تكيف المجال تغريدات فرنسية اللغة الفرنسية المجال الفرنسية صناعة حمض الفوسفور المزيد..

Sifting French Tweets to Investigate the Impact of Covid-19 in Triggering Intense Anxiety

763 - Association for Computation Linguistics 2021 مقالة

Sifting French Tweets to Investigate the Impact of Covid-19 in Triggering Intense Anxiety. Social media can be leveraged to understand public sentiment and feelings in real-time, and target public health messages based on user interests and emotions. In this paper, we investigate the impact of the COVID-19 pandemic in triggering intense anxiety, relying on messages exchanged on Twitter. More specifically, we provide : i) a quantitative and qualitative analysis of a corpus of tweets in French related to coronavirus, and ii) a pipeline approach (a filtering mechanism followed by Neural Network methods) to satisfactory classify messages expressing intense anxiety on social media, considering the role played by emotions.

triggering intense anxiety sifting french tweets investigate the impact يؤدي إلى القلق الشديد sifting تغريدات الفرنسية التحقيق في التأثير صناعة حمض الفوسفور المزيد..

tWT--WT: A Dataset to Assert the Role of Target Entities for Detecting Stance of Tweets

701 - Association for Computation Linguistics 2021 مقالة

The stance detection task aims at detecting the stance of a tweet or a text for a target. These targets can be named entities or free-form sentences (claims). Though the task involves reasoning of the tweet with respect to a target, we find that it i s possible to achieve high accuracy on several publicly available Twitter stance detection datasets without looking at the target sentence. Specifically, a simple tweet classification model achieved human-level performance on the WT--WT dataset and more than two-third accuracy on various other datasets. We investigate the existence of biases in such datasets to find the potential spurious correlations of sentiment-stance relations and lexical choice associated with the stance category. Furthermore, we propose a new large dataset free of such biases and demonstrate its aptness on the existing stance detection systems. Our empirical findings show much scope for research on the stance detection task and proposes several considerations for creating future stance detection datasets.

assert the role stance detection datasets أكد الدور مجموعات بيانات الكشف عن الموقف صناعة حمض الفوسفور

FrenLyS: A Tool for the Automatic Simplification of French General Language Texts

664 - Association for Computation Linguistics 2021 مقالة

Lexical simplification (LS) aims at replacing words considered complex in a sentence by simpler equivalents. In this paper, we present the first automatic LS service for French, FrenLys, which offers different techniques to generate, select and rank substitutes. The paper describes the different methods proposed by our tool, which includes both classical approaches (e.g. generation of candidates from lexical resources, frequency filter, etc.) and more innovative approaches such as the exploitation of CamemBERT, a model for French based on the RoBERTa architecture. To evaluate the different methods, a new evaluation dataset for French is introduced.

general language texts french general language نصوص اللغة العامة اللغة الفرنسية اللغة العامة صناعة حمض الفوسفور

Incorporating Domain Knowledge into Language Transformers for Multi-Label Classification of Chinese Medical Questions

760 - Association for Computation Linguistics 2021 مقالة

In this paper, we propose a knowledge infusion mechanism to incorporate domain knowledge into language transformers. Weakly supervised data is regarded as the main source for knowledge acquisition. We pre-train the language models to capture masked k nowledge of focuses and aspects and then fine-tune them to obtain better performance on the downstream tasks. Due to the lack of publicly available datasets for multi-label classification of Chinese medical questions, we crawled questions from medical question/answer forums and manually annotated them using eight predefined classes: persons and organizations, symptom, cause, examination, disease, information, ingredient, and treatment. Finally, a total of 1,814 questions with 2,340 labels. Each question contains an average of 1.29 labels. We used Baidu Medical Encyclopedia as the knowledge resource. Two transformers BERT and RoBERTa were implemented to compare performance on our constructed datasets. Experimental results showed that our proposed model with knowledge infusion mechanism can achieve better performance, no matter which evaluation metric including Macro F1, Micro F1, Weighted F1 or Subset Accuracy were considered.

incorporating domain knowledge chinese medical questions classification of chinese دمج المعرفة المجال أسئلة طبية الصينية تصنيف الصينية صناعة حمض الفوسفور المزيد..

TREMoLo-Tweets: A Multi-Label Corpus of French Tweets for Language Register Characterization

Tremolo-Tweets: مجموعة متعددة التسمية من التغريدات الفرنسية للحصول على توصيف تسجيل اللغة

Ask ChatGPT about the research

Read More

suggested questions