توفر هذه الورقة نظرة عامة سريعة على الطرق الممكنة كيفية اكتشاف هذه الترجمات المرجعية بالفعل من خلال تحرير نظام MT بعد التحرير.يتم تقديم طريقتين استنادتين إلى المقاييس التلقائية: اختلاف بلو بين MT المشتبه به وبعض اختلاف MT جيد و Blue باستخدام مراجع إضافية.كشفت هاتين الطريقتين الشكوك بأن المرجع التشيكي WMT 2020 يعتمد على MT.تم تأكيد الشك في تحليل يدوي من خلال إيجاد دليل ملموس لإجراءات ما بعد التحرير في جمل معينة.أخيرا، يتم تقديم نموذجية من تغييرات ما بعد التحرير حيث يتم تصنيف الأخطاء أو التغييرات النموذجية التي يتم إجراؤها بواسطة محرر ما بعد المحرر أو الأخطاء المعتمدة من MT.
This paper provides a quick overview of possible methods how to detect that reference translations were actually created by post-editing an MT system. Two methods based on automatic metrics are presented: BLEU difference between the suspected MT and some other good MT and BLEU difference using additional references. These two methods revealed a suspicion that the WMT 2020 Czech reference is based on MT. The suspicion was confirmed in a manual analysis by finding concrete proofs of the post-editing procedure in particular sentences. Finally, a typology of post-editing changes is presented where typical errors or changes made by the post-editor or errors adopted from the MT are classified.
References used
https://aclanthology.org/
This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be t
We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many ot
Automatic summarisation has the potential to aid physicians in streamlining clerical tasks such as note taking. But it is notoriously difficult to evaluate these systems and demonstrate that they are safe to be used in a clinical setting. To circumve
This paper offers a comparative evaluation of four commercial ASR systems which are evaluated according to the post-editing effort required to reach publishable'' quality and according to the number of errors they produce. For the error annotation ta
Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard wi