السجلات غير الرسمية والمحايدة واللغة الرسمية ملموسة للغاية في إنتاج خطاب.ومع ذلك، ما زالوا مدروسين بشكل سيئ في معالجة اللغة الطبيعية (NLP)، وخاصة خارج اللغة الإنجليزية، ولأنواع نصية جديدة مثل التغريدات.لتحفيز البحث، تقدم هذه الورقة كجن كبير قدره 228،505 تغريدات فرنسية (6M كلمات) مشروح في سجلات اللغة.يتم توفير التسميات من قبل مصنف كاممبرت متعدد الملصقات المدربة وتحقق من مجموعة فرعية مشروحة يدويا من Corpus، في حين يتم تحديد التغريدات لتجنب التحيزات غير المرغوب فيها.بناء على Corpus، يتم توفير تحليل أولي للسمات اللغوية من النحاذج البشرية أو الاستخراج التلقائي لوصف Corpus وتمهيد الطريق لمكاميات NLP المختلفة.تتوفر Corpus، دليل التوضيحية والتصنيف على http://tremolo.irisa.fr.
The casual, neutral, and formal language registers are highly perceptible in discourse productions. However, they are still poorly studied in Natural Language Processing (NLP), especially outside English, and for new textual types like tweets. To stimulate research, this paper introduces a large corpus of 228,505 French tweets (6M words) annotated in language registers. Labels are provided by a multi-label CamemBERT classifier trained and checked on a manually annotated subset of the corpus, while the tweets are selected to avoid undesired biases. Based on the corpus, an initial analysis of linguistic traits from either human annotators or automatic extractions is provided to describe the corpus and pave the way for various NLP tasks. The corpus, annotation guide and classifier are available on http://tremolo.irisa.fr.
References used
https://aclanthology.org/
We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR
Sifting French Tweets to Investigate the Impact of Covid-19 in Triggering Intense Anxiety. Social media can be leveraged to understand public sentiment and feelings in real-time, and target public health messages based on user interests and emotions.
The stance detection task aims at detecting the stance of a tweet or a text for a target. These targets can be named entities or free-form sentences (claims). Though the task involves reasoning of the tweet with respect to a target, we find that it i
Lexical simplification (LS) aims at replacing words considered complex in a sentence by simpler equivalents. In this paper, we present the first automatic LS service for French, FrenLys, which offers different techniques to generate, select and rank
In this paper, we propose a knowledge infusion mechanism to incorporate domain knowledge into language transformers. Weakly supervised data is regarded as the main source for knowledge acquisition. We pre-train the language models to capture masked k