Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

On Releasing Annotator-Level Labels and Information in Datasets

عند إطلاق ملصقات ومعلومات على مستوى Annotator في مجموعات البيانات

525 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

releasing annotator-level labels releasing annotator-level building nlp datasets إطلاق ملصقات على مستوى المعلقين الإفراج عن المستوى بناء مجموعات البيانات NLP. صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

A common practice in building NLP datasets, especially using crowd-sourced annotations, involves obtaining multiple annotator judgements on the same data instances, which are then flattened to produce a single ground truth'' label or score, through majority voting, averaging, or adjudication. While these approaches may be appropriate in certain annotation tasks, such aggregations overlook the socially constructed nature of human perceptions that annotations for relatively more subjective tasks are meant to capture. In particular, systematic disagreements between annotators owing to their socio-cultural backgrounds and/or lived experiences are often obfuscated through such aggregations. In this paper, we empirically demonstrate that label aggregation may introduce representational biases of individual and group perspectives. Based on this finding, we propose a set of recommendations for increased utility and transparency of datasets for downstream use cases.

References used

https://aclanthology.org/

rate research

Investigating Annotator Bias in Abusive Language Datasets

616 - Association for Computation Linguistics 2021 مقالة

Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator b ias caused by the annotator's subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.

abusive language datasets language datasets مجموعات بيانات اللغة المسيئة مجموعات البيانات اللغة صناعة حمض الفوسفور

Unsupervised Representation Disentanglement of Text: An Evaluation on Synthetic Datasets

991 - Association for Computation Linguistics 2021 مقالة

To highlight the challenges of achieving representation disentanglement for text domain in an unsupervised setting, in this paper we select a representative set of successfully applied models from the image domain. We evaluate these models on 6 disen tanglement metrics, as well as on downstream classification tasks and homotopy. To facilitate the evaluation, we propose two synthetic datasets with known generative factors. Our experiments highlight the existing gap in the text domain and illustrate that certain elements such as representation sparsity (as an inductive bias), or representation coupling with the decoder could impact disentanglement. To the best of our knowledge, our work is the first attempt on the intersection of unsupervised representation disentanglement and text, and provides the experimental framework and datasets for examining future developments in this direction.

unsupervised representation disentanglement representation disentanglement synthetic datasets devent الانسحاب غير المدعوم تمثيل disentanglement مجموعات البيانات الاصطناعية صناعة حمض الفوسفور المزيد..

Generating Datasets with Pretrained Language Models

649 - Association for Computation Linguistics 2021 مقالة

To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

محاذاة الإجراءات صناعة حمض الفوسفور

Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation

466 - Association for Computation Linguistics 2021 مقالة

Document-level human evaluation of machine translation (MT) has been raising interest in the community. However, little is known about the issues of using document-level methodologies to assess MT quality. In this article, we compare the inter-annota tor agreement (IAA) scores, the effort to assess the quality in different document-level methodologies, and the issue of misevaluation when sentences are evaluated out of context.

document-level human evaluation issues of annotator annotator agreement التقييم البشري على مستوى المستند قضايا المعلقين اتفاقية Annotator صناعة حمض الفوسفور المزيد..

How to Obtain Reliable Labels for MBTI Classification from Texts?

697 - Association for Computation Linguistics 2021 مقالة

Automatic detection of the Myers-Briggs Type Indicator (MBTI) from short posts attracted noticeable attention in the last few years. Recent studies showed that this is quite a difficult task, especially on commonly used Twitter data. Obtaining MBTI l abels is also difficult, as human annotation requires trained psychologists, and automatic way of obtaining them is through long questionnaires of questionable usability for the task. In this paper, we present a method for collecting reliable MBTI labels via only four carefully selected questions that can be applied to any type of textual data.

classification from texts obtain reliable labels mbti classification تصنيف من النصوص الحصول على ملصقات موثوقة MBTI تصنيف صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

On Releasing Annotator-Level Labels and Information in Datasets

عند إطلاق ملصقات ومعلومات على مستوى Annotator في مجموعات البيانات

Ask ChatGPT about the research

Read More

suggested questions