Research papers, master and doctoral theses about metrics

Is this the end of the gold standard? A straightforward reference-less grammatical error correction metric

374 - Association for Computation Linguistics 2021 مقالة

It is difficult to rank and evaluate the performance of grammatical error correction (GEC) systems, as a sentence can be rewritten in numerous correct ways. A number of GEC metrics have been used to evaluate proposed GEC systems; however, each system relies on either a comparison with one or more reference texts---in what is known as the gold standard for reference-based metrics---or a separate annotated dataset to fine-tune the reference-less metric. Reference-based systems have a low correlation with human judgement, cannot capture all the ways in which a sentence can be corrected, and require substantial work to develop a test dataset. We propose a reference-less GEC evaluation system that is strongly correlated with human judgement, solves the issues related to the use of a reference, and does not need another annotated dataset for fine-tuning. The proposed system relies solely on commonly available tools. Additionally, currently available reference-less metrics do not work properly when part of a sentence is repeated as opposed to reference-based metrics. In our proposed system, we look to address issues inherent in reference-less metrics and reference-based metrics.

منظور المحول metrics مقاييس صناعة حمض الفوسفور

Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain

233 - Association for Computation Linguistics 2021 مقالة

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic metrics on two different domains: news and TED talks . All metrics were evaluated on how well they correlate at the system- and segment-level with human ratings. Contrary to previous years' editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages: (i) expert-based evaluation has been shown to be more reliable, (ii) we were able to evaluate all metrics on two different domains using translations of the same MT systems, (iii) we added 5 additional translations coming from the same system during system development. In addition, we designed three challenge sets that evaluate the robustness of all automatic metrics. We present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. We further show the impact of different reference translations on reference-based metrics and compare our expert-based MQM annotation with the DA scores acquired by WMT.

metrics shared task metrics shared مقاييس مشتركة المهمة تقاسم المقاييس صناعة حمض الفوسفور

RoBLEURT Submission for WMT2021 Metrics Task

300 - Association for Computation Linguistics 2021 مقالة

In this paper, we present our submission to Shared Metrics Task: RoBLEURT (Robustly Optimizing the training of BLEURT). After investigating the recent advances of trainable metrics, we conclude several aspects of vital importance to obtain a well-per formed metric model by: 1) jointly leveraging the advantages of source-included model and reference-only model, 2) continuously pre-training the model with massive synthetic data pairs, and 3) fine-tuning the model with data denoising strategy. Experimental results show that our model reaching state-of-the-art correlations with the WMT2020 human annotations upon 8 out of 10 to-English language pairs.

shared metrics task metrics task robustly optimizing مهام المقاييس المشتركة مهام المقاييس تحسين بقوة صناعة حمض الفوسفور المزيد..

Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics

469 - Association for Computation Linguistics 2021 مقالة

Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these metrics return certain scores. This year's Eval4NLP sh ared task tackles this challenge by searching for methods that can extract feature importance scores that correlate well with human word-level error annotations. In this paper we show that unsupervised metrics that are based on tokenmatching can intrinsically provide such scores. The submitted system interprets the similarities of the contextualized word-embeddings that are used to compute (X)BERTScore as word-level importance scores.

sentence-level translation evaluation reference-free word translation evaluation metrics تقييم مستوى الترجمة كلمة مجانية مقاييس تقييم الترجمة صناعة حمض الفوسفور المزيد..

SummEval: Re-evaluating Summarization Evaluation

166 - Association for Computation Linguistics 2021 مقالة

Abstract The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evalua tion methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

re-evaluating summarization evaluation re-evaluating summarization evaluation metrics إعادة تقييم تقييم التلخيص إعادة تقييم التلخيص مقاييس التقييم صناعة حمض الفوسفور المزيد..

Weisfeiler-Leman in the Bamboo: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity

410 - Association for Computation Linguistics 2021 مقالة

Abstract Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths a nd weaknesses: Some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step. In this work we propose new Weisfeiler-Leman AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses. Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes. Furthermore, we introduce a Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity metrics. Bamboo maximizes the interpretability of results by defining multiple overt objectives that range from sentence similarity objectives to stress tests that probe a metric's robustness against meaning-altering and meaning- preserving graph transformations. We show the benefits of Bamboo by profiling previous metrics and our own metrics. Results indicate that our novel metrics may serve as a strong baseline for future work.

amr graph similarity amr graph amr graph metrics AMR الرسم البياني التشابه AMR الرسم البياني AMR مقاييس الرسم البياني صناعة حمض الفوسفور المزيد..

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

198 - Association for Computation Linguistics 2021 مقالة

Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do so in some evaluation settings.1

مجموعات البيانات الإنجليزية الحالية summarization evaluation summarization evaluation metrics تقييم تلخيص مقاييس تقييم تلخيص صناعة حمض الفوسفور

Operators approaches in studying the problem on normal oscillations of system of m capillary viscous fluid

896 - Tishreen University 2015 ورقة بحثية

Our aim of this paper is studying the problem on normal oscillations of system of capillary viscous fluids in vessel. We prove results about the spectrum of the problem for rotating vessel and prove that the systems of root elements ( eigenelements and associated elements ) form an Abel-Lidsky basis. Also , we use some results from the theory of J-self adjoint operators in studying the spectrum of the problem for non-rotating vessel.

Hydrodynamical systems Hilbert space جمل هيدروديناميكية فضاء هلبرت differential equation in Hilbert space مسائل قيم خاصة طيف مؤثر المترك غير المعرّف معادلات تفاضلية في فضاء هلبرت Eigenvalue problems Operator Spectrum indefinite metrics المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد