Selecting the best data filtering method for NMT training

published by Association for Computation Linguistics in 2021 in Artificial Intelligence and research's language is English Download

Abstract in English

Performance of NMT systems has been proven to depend on the quality of the training data. In this paper we explore different open-source tools that can be used to score the quality of translation pairs, with the goal of obtaining clean corpora for training NMT models. We measure the performance of these tools by correlating their scores with human scores, as well as rank models trained on the resulting filtered datasets in terms of their performance on different test sets and MT performance metrics.

References used

https://aclanthology.org/

Download