في هذه الورقة، نقدم متري جديد يعتمد على تضمين التعاون على نماذج التصنيف القابلة للتدريب لتقييم الدقة الدلالية لمولدات البيانات النصية العصبية. هذا المتريات مناسب بشكل خاص لتقييم أداء مولد النص بشكل خاص بشكل خاص وتقييم فعليا عندما يمكن ربط الجداول بمراجع متعددة وقيم الجدول تحتوي على كلمات نصية نصية. نقدم أولا كيف يمكن للمرء تنفيذ ومزيد من التخصص المتخصص من خلال تدريب نماذج التصنيف الأساسية في مجموعة بيانات قانونية إلى نصية. نظهر كيف قد يوفر ذلك تقييما أكثر قوة من مخططات التقييم الأخرى في الإعدادات الصعبة باستخدام مجموعة بيانات تضم أي رسوم بين قيم الجدول ومراجعها. أخيرا، نقوم بتقييم قدرات تعميمها على مجموعة بيانات معروفة، و WEBNLG، بمقارنتها بالتقييم البشري ومقياس تم إدخاله مؤخرا بناء على الاستدلال اللغوي الطبيعي. بعد ذلك، توضح كيف تميز بشكل طبيعي، سواء من الناحية الكمية والنوعية والإغفالات والهلوسة.
In this paper, we introduce a new embedding-based metric relying on trainable ranking models to evaluate the semantic accuracy of neural data-to-text generators. This metric is especially well suited to semantically and factually assess the performance of a text generator when tables can be associated with multiple references and table values contain textual utterances. We first present how one can implement and further specialize the metric by training the underlying ranking models on a legal Data-to-Text dataset. We show how it may provide a more robust evaluation than other evaluation schemes in challenging settings using a dataset comprising paraphrases between the table values and their respective references. Finally, we evaluate its generalization capabilities on a well-known dataset, WebNLG, by comparing it with human evaluation and a recently introduced metric based on natural language inference. We then illustrate how it naturally characterizes, both quantitatively and qualitatively, omissions and hallucinations.
References used
https://aclanthology.org/
While powerful pre-trained language models have improved the fluency of text generation models, semantic adequacy -the ability to generate text that is semantically faithful to the input- remains an unsolved issue. In this paper, we introduce a novel
QuestEval is a reference-less metric used in text-to-text tasks, that compares the generated summaries directly to the source text, by automatically asking and answering questions. Its adaptation to Data-to-Text tasks is not straightforward, as it re
The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semanti
Data-to-text generation systems are trained on large datasets, such as WebNLG, Ro-toWire, E2E or DART. Beyond traditional token-overlap evaluation metrics (BLEU or METEOR), a key concern faced by recent generators is to control the factuality of the
Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choi