New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Assessing Reference-Free Peer Evaluation for Machine Translation

تقييم تقييم الأقران الخالي من المرجعية للترجمة الآلية

377 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model, and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach, and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.

References used

https://aclanthology.org/

rate research

Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics

700 - Association for Computation Linguistics 2021 مقالة

Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these metrics return certain scores. This year's Eval4NLP sh ared task tackles this challenge by searching for methods that can extract feature importance scores that correlate well with human word-level error annotations. In this paper we show that unsupervised metrics that are based on tokenmatching can intrinsically provide such scores. The submitted system interprets the similarities of the contextualized word-embeddings that are used to compute (X)BERTScore as word-level importance scores.

sentence-level translation evaluation reference-free word translation evaluation metrics تقييم مستوى الترجمة كلمة مجانية مقاييس تقييم الترجمة صناعة حمض الفوسفور المزيد..

Stream-level Latency Evaluation for Simultaneous Machine Translation

368 - Association for Computation Linguistics 2021 مقالة

Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time , and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.

simultaneous machine translation stream-level latency evaluation simultaneous machine ترجمة آلية في وقت واحد تقييم كويد مستوى الدفق آلة في وقت واحد صناعة حمض الفوسفور المزيد..

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

317 - Association for Computation Linguistics 2021 مقالة

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by decid ing which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

WMT البشرية الشرح البشري extensive evaluation تقييم واسع النطاق صناعة حمض الفوسفور

Almost Free Semantic Draft for Neural Machine Translation

274 - Association for Computation Linguistics 2021 مقالة

Translation quality can be improved by global information from the required target sentence because the decoder can understand both past and future information. However, the model needs additional cost to produce and consider such global information. In this work, to inject global information but also save cost, we present an efficient method to sample and consider a semantic draft as global information from semantic space for decoding with almost free of cost. Unlike other successful adaptations, we do not have to perform an EM-like process that repeatedly samples a possible semantic from the semantic space. Empirical experiments show that the presented method can achieve competitive performance in common language pairs with a clear advantage in inference efficiency. We will open all our source code on GitHub.

التنبؤات في الحكم صناعة حمض الفوسفور

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

305 - Association for Computation Linguistics 2021 مقالة

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In t his paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

بناء اللغة التصويرية captioning التوضيحية صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Assessing Reference-Free Peer Evaluation for Machine Translation

تقييم تقييم الأقران الخالي من المرجعية للترجمة الآلية

Ask ChatGPT about the research

Read More

suggested questions