New community

Subscribe to the gold package and get unlimited access to Shamra Academy

DALC: the Dutch Abusive Language Corpus

DALC: كوربوس اللغة الهولندية المسيئة

473 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

abusive language corpus dutch abusive language language corpus اللغة الهولندية المسيئة لغة لغة صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

As socially unacceptable language become pervasive in social media platforms, the need for automatic content moderation become more pressing. This contribution introduces the Dutch Abusive Language Corpus (DALC v1.0), a new dataset with tweets manually an- notated for abusive language. The resource ad- dress a gap in language resources for Dutch and adopts a multi-layer annotation scheme modeling the explicitness and the target of the abusive messages. Baselines experiments on all annotation layers have been conducted, achieving a macro F1 score of 0.748 for binary classification of the explicitness layer and .489 for target classification.

References used

https://aclanthology.org/

rate research

Self-Contextualized Attention for Abusive Language Identification

420 - Association for Computation Linguistics 2021 مقالة

The use of attention mechanisms in deep learning approaches has become popular in natural language processing due to its outstanding performance. The use of these mechanisms allows one managing the importance of the elements of a sequence in accordan ce to their context, however, this importance has been observed independently between the pairs of elements of a sequence (self-attention) and between the application domain of a sequence (contextual attention), leading to the loss of relevant information and limiting the representation of the sequences. To tackle these particular issues we propose the self-contextualized attention mechanism, which trades off the previous limitations, by considering the internal and contextual relationships between the elements of a sequence. The proposed mechanism was evaluated in four standard collections for the abusive language identification task achieving encouraging results. It outperformed the current attention mechanisms and showed a competitive performance with respect to state-of-the-art approaches.

abusive language identification language identification تحديد اللغة المسيئة تحديد اللغة صناعة حمض الفوسفور

Contextual-Lexicon Approach for Abusive Language Detection

349 - Association for Computation Linguistics 2021 مقالة

Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social medi a, which embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.

معلومات خاطئة social media abusive contextual-lexicon approach وسائل الاعلام الاجتماعية المسيئة نهج المعجم السياقي صناعة حمض الفوسفور

Investigating Annotator Bias in Abusive Language Datasets

311 - Association for Computation Linguistics 2021 مقالة

Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator b ias caused by the annotator's subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.

abusive language datasets language datasets مجموعات بيانات اللغة المسيئة مجموعات البيانات اللغة صناعة حمض الفوسفور

Unsupervised Domain Adaptation in Cross-corpora Abusive Language Detection

460 - Association for Computation Linguistics 2021 مقالة

The state-of-the-art abusive language detection models report great in-corpus performance, but underperform when evaluated on abusive comments that differ from the training scenario. As human annotation involves substantial time and effort, models th at can adapt to newly collected comments can prove to be useful. In this paper, we investigate the effectiveness of several Unsupervised Domain Adaptation (UDA) approaches for the task of cross-corpora abusive language detection. In comparison, we adapt a variant of the BERT model, trained on large-scale abusive comments, using Masked Language Model (MLM) fine-tuning. Our evaluation shows that the UDA approaches result in sub-optimal performance, while the MLM fine-tuning does better in the cross-corpora setting. Detailed analysis reveals the limitations of the UDA approaches and emphasizes the need to build efficient adaptation methods for this task.

تحديد اللغة unsupervised domain adaptation cross-corpora abusive language التكيف المنطقي غير المزعوم عبور كورسا لغة مسيئة صناعة حمض الفوسفور

HateBERT: Retraining BERT for Abusive Language Detection in English

664 - Association for Computation Linguistics 2021 مقالة

We introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we hav e curated and made available to the public. We present the results of a detailed comparison between a general pre-trained language model and the retrained version on three English datasets for offensive, abusive language and hate speech detection tasks. In all datasets, HateBERT outperforms the corresponding general BERT model. We also discuss a battery of experiments comparing the portability of the fine-tuned models across the datasets, suggesting that portability is affected by compatibility of the annotated phenomena.

abusive language detection retraining bert abusive language الكشف عن اللغة المسيئة إعادة تدريب بيرت لغة مسيئة صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

DALC: the Dutch Abusive Language Corpus

DALC: كوربوس اللغة الهولندية المسيئة

Ask ChatGPT about the research

Read More

suggested questions