يتم تقييم أنظمة التلخيص في نهاية المطاف من قبل المشردين البشري والاتصالات.عادة ما لا يعكس الحنجرة والمسلمون التركيبة السكانية للمستخدمين النهائيين، ولكن يتم تجنيدهم من خلال سكان الطلاب أو منصات الجماعة الجماعية مع التركيبة السكانية المنحرفة.لسيناريوهات التقييم المختلفة - التقييم ضد ملخصات الذهب وتصنيفات إنتاج النظام - نظهر أن التقييم الموجز حساس للسمات المحمية.هذا يمكن أن تنمية نظام التحيز والتقييم بشدة، مما يؤدي إلى بناء نماذج تلبي بعض المجموعات بدلا من غيرها.
Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios -- evaluation against gold summaries and system output ratings -- we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.
References used
https://aclanthology.org/
Abstract The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evalua
Abstract The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are,
AbstractMachine translation (MT) technology has facilitated our daily tasks by providing accessible shortcuts for gathering, processing, and communicating information. However, it can suffer from biases that harm users and society at large. As a rela
Natural Language Processing (NLP) systems are at the heart of many critical automated decision-making systems making crucial recommendations about our future world. Gender bias in NLP has been well studied in English, but has been less studied in oth
Languages differ in terms of the absence or presence of gender features, the number of gender classes and whether and where gender features are explicitly marked. These cross-linguistic differences can lead to ambiguities that are difficult to resolv