ﻻ يوجد ملخص باللغة العربية
Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we obser
Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used ev
Many NLG tasks such as summarization, dialogue response, or open domain question answering focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a users intent or context of work is
Standard automatic metrics (such as BLEU) are problematic for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones nor can they identify the specific discourse phenomen
Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation systems quality over another. The community choice of automatic metric guides research directions and industrial developments by decidi