Towards Human-Free Automatic Quality Evaluation of German Summarization

85 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Oleg Vasilyev

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Neslihan Iskender - Oleg Vasilyev - Tim Polzehl

الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Evaluating large summarization corpora using humans has proven to be expensive from both the organizational and the financial perspective. Therefore, many automatic evaluation metrics have been developed to measure the summarization quality in a fast and reproducible way. However, most of the metrics still rely on humans and need gold standard summaries generated by linguistic experts. Since BLANC does not require golden summaries and supposedly can use any underlying language model, we consider its application to the evaluation of summarization in German. This work demonstrates how to adjust the BLANC metric to a language other than English. We compare BLANC scores with the crowd and expert ratings, as well as with commonly used automatic metrics on a German summarization data set. Our results show that BLANC in German is especially good in evaluating informativeness.

قيم البحث

74 - Oleg Vasilyev , Vedant Dharnidharka , John Bohannon 2020

We present BLANC, a new approach to the automatic estimation of document summary quality. Our goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. Our approach achieves this by measur ing the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the documents text. We present evidence that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements. And unlike ROUGE, the BLANC method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.

الحساب واللغة

Towards automatic extractive text summarization of A-133 Single Audit reports with machine learning

189 - Vivian T. Chou , LeAnna Kent , Joel A. Gongora 2019

The rapid growth of text data has motivated the development of machine-learning based automatic text summarization strategies that concisely capture the essential ideas in a larger text. This study aimed to devise an extractive summarization method f or A-133 Single Audits, which assess if recipients of federal grants are compliant with program requirements for use of federal funding. Currently, these voluminous audits must be manually analyzed by officials for oversight, risk management, and prioritization purposes. Automated summarization has the potential to streamline these processes. Analysis focused on the Findings section of ~20,000 Single Audits spanning 2016-2018. Following text preprocessing and GloVe embedding, sentence-level k-means clustering was performed to partition sentences by topic and to establish the importance of each sentence. For each audit, key summary sentences were extracted by proximity to cluster centroids. Summaries were judged by non-expert human evaluation and compared to human-generated summaries using the ROUGE metric. Though the goal was to fully automate summarization of A-133 audits, human input was required at various stages due to large variability in audit writing style, content, and context. Examples of human inputs include the number of clusters, the choice to keep or discard certain clusters based on their content relevance, and the definition of a top sentence. Overall, this approach made progress towards automated extractive summaries of A-133 audits, with future work to focus on full automation and improving summary consistency. This work highlights the inherent difficulty and subjective nature of automated summarization in a real-world application.

الحساب واللغة التعلم الآلي التعلم الالي

A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy

61 - Wang Chen , Piji Li , Irwin King 2021

In recent years, reference-based and supervised summarization evaluation metrics have been widely explored. However, collecting human-annotated references and ratings are costly and time-consuming. To avoid these limitations, we propose a training-fr ee and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. The relevance score is computed between the pseudo reference built from the source document and the given summary, where the pseudo reference content is weighted by the sentence centrality to provide importance guidance. Besides an $F_1$-based relevance score, we also design an $F_beta$-based variant that pays more attention to the recall score. As for the redundancy score of the summary, we compute a self-masked similarity score with the summary itself to evaluate the redundant information in the summary. Finally, we combine the relevance and redundancy scores to produce the final evaluation score of the given summary. Extensive experiments show that our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.

الحساب واللغة

Does External Knowledge Help Explainable Natural Language Inference? Automatic Evaluation vs. Human Ratings

127 - Hendrik Schuff , Hsiu-Yu Yang , Heike Adel 2021

Natural language inference (NLI) requires models to learn and apply commonsense knowledge. These reasoning abilities are particularly important for explainable NLI systems that generate a natural language explanation in addition to their label predic tion. The integration of external knowledge has been shown to improve NLI systems, here we investigate whether it can also improve their explanation capabilities. For this, we investigate different sources of external knowledge and evaluate the performance of our models on in-domain data as well as on special transfer datasets that are designed to assess fine-grained reasoning capabilities. We find that different sources of knowledge have a different effect on reasoning abilities, for example, implicit knowledge stored in language models can hinder reasoning on numbers and negations. Finally, we conduct the largest and most fine-grained explainable NLI crowdsourcing study to date. It reveals that even large differences in automatic performance scores do neither reflect in human ratings of label, explanation, commonsense nor grammar correctness.

الحساب واللغة تفاعل الإنسان والحاسوب

QuestEval: Summarization Asks for Fact-based Evaluation

78 - Thomas Scialom , Paul-Alexis Dray , Patrick Gallinari 2021

Summarization evaluation remains an open research problem: current metrics such as ROUGE are known to be limited and to correlate poorly with human judgments. To alleviate this issue, recent work has proposed evaluation metrics which rely on question answering models to assess whether a summary contains all the relevant information in its source document. Though promising, the proposed approaches have so far failed to correlate better than ROUGE with human judgments. In this paper, we extend previous approaches and propose a unified framework, named QuestEval. In contrast to established metrics such as ROUGE or BERTScore, QuestEval does not require any ground-truth reference. Nonetheless, QuestEval substantially improves the correlation with human judgments over four evaluation dimensions (consistency, coherence, fluency, and relevance), as shown in the extensive experiments we report.

الحساب واللغة

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة تشرين

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Towards Human-Free Automatic Quality Evaluation of German Summarization

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً