New community

Subscribe to the gold package and get unlimited access to Shamra Academy

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

لشحن أو عدم الشحن: تقييم شامل للمقاييس التلقائية للترجمة الآلية

318 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

References used

https://aclanthology.org/

rate research

To Block or not to Block: Experiments with Machine Learning for News Comment Moderation

348 - Association for Computation Linguistics 2021 مقالة

Today, news media organizations regularly engage with readers by enabling them to comment on news articles. This creates the need for comment moderation and removal of disallowed comments -- a time-consuming task often performed by human moderators. In this paper we approach the problem of automatic news comment moderation as classification of comments into blocked and not blocked categories. We construct a novel dataset of annotated English comments, experiment with cross-lingual transfer of comment labels and evaluate several machine learning models on datasets of Croatian and Estonian news comments. Team name: SuperAdmin; Challenge: Detection of blocked comments; Tools/models: CroSloEn BERT, FinEst BERT, 24Sata comment dataset, Ekspress comment dataset.

أخبار سلوفينية كوربوس comment moderation comment حاجز تعليق الاعتدال تعليق صناعة حمض الفوسفور المزيد..

Assessing Reference-Free Peer Evaluation for Machine Translation

377 - Association for Computation Linguistics 2021 مقالة

Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model, and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach, and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.

assessing reference-free peer reference-free peer evaluation تقييم نظير مجاني المرجع تقييم نظير خال من المرجع صناعة حمض الفوسفور

Stream-level Latency Evaluation for Simultaneous Machine Translation

368 - Association for Computation Linguistics 2021 مقالة

Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time , and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.

simultaneous machine translation stream-level latency evaluation simultaneous machine ترجمة آلية في وقت واحد تقييم كويد مستوى الدفق آلة في وقت واحد صناعة حمض الفوسفور المزيد..

BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

410 - Association for Computation Linguistics 2021 مقالة

The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trained models into neural machine translation (NMT) systems. However, proposed methods for incorporating pre-trained models are non-trivial and mainly focus on BERT, which lacks a comparison of the impact that other pre-trained models may have on translation performance. In this paper, we demonstrate that simply using the output (contextualized embeddings) of a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art translation performance. Moreover, we also propose a stochastic layer selection approach and a concept of a dual-directional translation model to ensure the sufficient utilization of contextualized embeddings. In the case of without using back translation, our best models achieve BLEU scores of 30.45 for En→De and 38.61 for De→En on the IWSLT'14 dataset, and 31.26 for En→De and 34.94 for De→En on the WMT'14 dataset, which exceeds all published numbers.

ملخص وحدات المحتوى صناعة حمض الفوسفور

Exploring the Importance of Source Text in Automatic Post-Editing for Context-Aware Machine Translation

514 - Association for Computation Linguistics 2021 مقالة

Accurate translation requires document-level information, which is ignored by sentence-level machine translation. Recent work has demonstrated that document-level consistency can be improved with automatic post-editing (APE) using only target-languag e (TL) information. We study an extended APE model that additionally integrates source context. A human evaluation of fluency and adequacy in English--Russian translation reveals that the model with access to source context significantly outperforms monolingual APE in terms of adequacy, an effect largely ignored by automatic evaluation metrics. Our results show that TL-only modelling increases fluency without improving adequacy, demonstrating the need for conditioning on source text for automatic post-editing. They also highlight blind spots in automatic methods for targeted evaluation and demonstrate the need for human assessment to evaluate document-level translation quality reliably.

exploring the importance context-aware machine translation context-aware machine استكشاف الأهمية الترجمة الآلية السياق آلة السياق صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

لشحن أو عدم الشحن: تقييم شامل للمقاييس التلقائية للترجمة الآلية

Ask ChatGPT about the research

Read More

suggested questions