New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Generation Challenges: Results of the Accuracy Evaluation Shared Task

تحديات الجيل: نتائج تقييم الدقة المهمة المشتركة

510 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

generation challenges accuracy evaluation shared evaluation shared task تحديات الجيل تقاسم تقييم الدقة تقييم المهمة المشتركة صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The Shared Task on Evaluating Accuracy focused on techniques (both manual and automatic) for evaluating the factual accuracy of texts produced by neural NLG systems, in a sports-reporting domain. Four teams submitted evaluation techniques for this task, using very different approaches and techniques. The best-performing submissions did encouragingly well at this difficult task. However, all automatic submissions struggled to detect factual errors which are semantically or pragmatically complex (for example, based on incorrect computation or inference).

References used

https://aclanthology.org/

rate research

Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain

350 - Association for Computation Linguistics 2021 مقالة

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic metrics on two different domains: news and TED talks . All metrics were evaluated on how well they correlate at the system- and segment-level with human ratings. Contrary to previous years' editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages: (i) expert-based evaluation has been shown to be more reliable, (ii) we were able to evaluate all metrics on two different domains using translations of the same MT systems, (iii) we added 5 additional translations coming from the same system during system development. In addition, we designed three challenge sets that evaluate the robustness of all automatic metrics. We present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. We further show the impact of different reference translations on reference-based metrics and compare our expert-based MQM annotation with the DA scores acquired by WMT.

metrics shared task metrics shared مقاييس مشتركة المهمة تقاسم المقاييس صناعة حمض الفوسفور

Co-Teaching Student-Model through Submission Results of Shared Task

222 - Association for Computation Linguistics 2021 مقالة

Shared tasks have a long history and have become the mainstream of NLP research. Most of the shared tasks require participants to submit only system outputs and descriptions. It is uncommon for the shared task to request submission of the system itse lf because of the license issues and implementation differences. Therefore, many systems are abandoned without being used in real applications or contributing to better systems. In this research, we propose a scheme to utilize all those systems which participated in the shared tasks. We use all participated system outputs as task teachers in this scheme and develop a new model as a student aiming to learn the characteristics of each system. We call this scheme Co-Teaching.'' This scheme creates a unified system that performs better than the task's single best system. It only requires the system outputs, and slightly extra effort is needed for the participants and organizers. We apply this scheme to the SHINRA2019-JP'' shared task, which has nine participants with various output accuracies, confirming that the unified system outperforms the best system. Moreover, the code used in our experiments has been released.

submission results نتائج التقديم صناعة حمض الفوسفور

Findings of the WMT Shared Task on Machine Translation Using Terminologies

359 - Association for Computation Linguistics 2021 مقالة

Language domains that require very careful use of terminology are abundant and reflect a significant part of the translation industry. In this work we introduce a benchmark for evaluating the quality and consistency of terminology translation, focusi ng on the medical (and COVID-19 specifically) domain for five language pairs: English to French, Chinese, Russian, and Korean, as well as Czech to German. We report the descriptions and results of the participating systems, commenting on the need for further research efforts towards both more adequate handling of terminologies as well as towards a proper formulation and evaluation of the task.

wmt shared task wmt shared WMT مشاركتها المهمة شارك WMT. صناعة حمض الفوسفور

The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results

319 - Association for Computation Linguistics 2021 مقالة

In this paper, we introduce the Eval4NLP-2021 shared task on explainable quality estimation. Given a source-translation pair, this shared task requires not only to provide a sentence-level score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality. We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results. To the best of our knowledge, this is the first shared task on explainable NLP evaluation metrics. Datasets and results are available at https://github.com/eval4nlp/SharedTask2021.

explainable quality estimation تقدير الجودة الشرح صناعة حمض الفوسفور

Shared Task in Evaluating Accuracy: Leveraging Pre-Annotations in the Validation Process

388 - Association for Computation Linguistics 2021 مقالة

We hereby present our submission to the Shared Task in Evaluating Accuracy at the INLG 2021 Conference. Our evaluation protocol relies on three main components; rules and text classifiers that pre-annotate the dataset, a human annotator that validate s the pre-annotations, and a web interface that facilitates this validation. Our submission consists in fact of two submissions; we first analyze solely the performance of the rules and classifiers (pre-annotations), and then the human evaluation aided by the former pre-annotations using the web interface (hybrid). The code for the web interface and the classifiers is publicly available.

task in evaluating evaluating accuracy المهمة في تقييم تقييم الدقة صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Generation Challenges: Results of the Accuracy Evaluation Shared Task

تحديات الجيل: نتائج تقييم الدقة المهمة المشتركة

Ask ChatGPT about the research

Read More

suggested questions