Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format

تكامل البيانات لتصنيف التعليق السام: جعل أكثر من 40 مجموعة بيانات يمكن الوصول إليها بسهولة بتنسيق واحد موحد

660 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

toxic comment classification datasets easily accessible easily accessible تصنيف سام التعليق مجموعات البيانات بسهولة يمكن الوصول إليها بسهولة صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

With the rise of research on toxic comment classification, more and more annotated datasets have been released. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Despite recent efforts to create web pages that provide an overview, most publications still use only a single dataset. They are not stored in one central database, they come in many different data formats and it is difficult to interpret their class labels and how to reuse these labels in other projects. To overcome these issues, we present a collection of more than thirty datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data.

References used

https://aclanthology.org/

rate research

Universit\"at Regensburg MaxS at GermEval 2021 Task 1: Synthetic Data in Toxic Comment Classification

948 - Association for Computation Linguistics 2021 مقالة

We report on our submission to Task 1 of the GermEval 2021 challenge -- toxic comment classification. We investigate different ways of bolstering scarce training data to improve off-the-shelf model performance on a toxic comment classification task. To help address the limitations of a small dataset, we use data synthetically generated by a German GPT-2 model. The use of synthetic data has only recently been taking off as a possible solution to ad- dressing training data sparseness in NLP, and initial results are promising. However, our model did not see measurable improvement through the use of synthetic data. We discuss possible reasons for this finding and explore future works in the field.

تعليق سام صناعة حمض الفوسفور

Making Your Tweets More Fancy: Emoji Insertion to Texts

736 - Association for Computation Linguistics 2021 مقالة

In the social media, users frequently use small images called emojis in their posts. Although using emojis in texts plays a key role in recent communication systems, less attention has been paid on their positions in the given texts, despite that use rs carefully choose and put an emoji that matches their post. Exploring positions of emojis in texts will enhance understanding of the relationship between emojis and texts. We extend an emoji label prediction task taking into account the information of emoji positions, by jointly learning the emoji position in a tweet to predict the emoji label. The results demonstrate that the position of emojis in texts is a good clue to boost the performance of emoji label prediction. Human evaluation validates that there exists a suitable emoji position in a tweet, and our proposed task is able to make tweets more fancy and natural. In addition, considering emoji position can further improve the performance for the irony detection task compared to the emoji label prediction. We also report the experimental results for the modified dataset, due to the problem of the original dataset for the first shared task to predict an emoji label in SemEval2018.

emoji insertion emoji label emoji إدراج الرموز التعبيرية emoji label. الرموز التعبيرية صناعة حمض الفوسفور المزيد..

DeTox at GermEval 2021: Toxic Comment Classification

714 - Association for Computation Linguistics 2021 مقالة

In this work, we present our approaches on the toxic comment classification task (subtask 1) of the GermEval 2021 Shared Task. For this binary task, we propose three models: a German BERT transformer model; a multilayer perceptron, which was first tr ained in parallel on textual input and 14 additional linguistic features and then concatenated in an additional layer; and a multilayer perceptron with both feature types as input. We enhanced our pre-trained transformer model by re-training it with over 1 million tweets and fine-tuned it on two additional German datasets of similar tasks. The embeddings of the final fine-tuned German BERT were taken as the textual input features for our neural networks. Our best models on the validation data were both neural networks, however our enhanced German BERT gained with a F1-score = 0.5895 a higher prediction on the test data.

تحكم اللغة المدربة مسبقا comment classification task toxic comment مهام تصنيف التعليق تعليق سام صناعة حمض الفوسفور

Debiasing Methods in Natural Language Understanding Make Bias More Accessible

1080 - Association for Computation Linguistics 2021 مقالة

Model robustness to bias is often determined by the generalization on carefully designed out-of-distribution datasets. Recent debiasing methods in natural language understanding (NLU) improve performance on such datasets by pressuring models into mak ing unbiased predictions. An underlying assumption behind such methods is that this also leads to the discovery of more robust features in the model's inner representations. We propose a general probing-based framework that allows for post-hoc interpretation of biases in language models, and use an information-theoretic approach to measure the extractability of certain biases from the model's representations. We experiment with several NLU datasets and known biases, and show that, counter-intuitively, the more a language model is pushed towards a debiased regime, the more bias is actually encoded in its inner representations.

understanding make bias language understanding make فهم جعل التحيز صناعة حمض الفوسفور

Single Example Can Improve Zero-Shot Data Generation

864 - Association for Computation Linguistics 2021 مقالة

Sub-tasks of intent classification, such as robustness to distribution shift, adaptation to specific user groups and personalization, out-of-domain detection, require extensive and flexible datasets for experiments and evaluation. As collecting such datasets is time- and labor-consuming, we propose to use text generation methods to gather datasets. The generator should be trained to generate utterances that belong to the given intent. We explore two approaches to the generation of task-oriented utterances: in the zero-shot approach, the model is trained to generate utterances from seen intents and is further used to generate utterances for intents unseen during training. In the one-shot approach, the model is presented with a single utterance from a test intent. We perform a thorough automatic, and human evaluation of the intrinsic properties of two-generation approaches. The attributes of the generated data are close to original test sets, collected via crowd-sourcing.

improve zero-shot data improve zero-shot generate utterances تحسين البيانات الصفرية تحسين صفر النار توليد الكلام صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format

تكامل البيانات لتصنيف التعليق السام: جعل أكثر من 40 مجموعة بيانات يمكن الوصول إليها بسهولة بتنسيق واحد موحد

Ask ChatGPT about the research

Read More

suggested questions