بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Adaptation of Machine Translation Models with Back-translated Data using Transductive Data Selection Methods

131 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Alberto Poncelas

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Alberto Poncelas - Gideon Maillette de Buy Wenniger - Andy Way

الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Data selection has proven its merit for improving Neural Machine Translation (NMT), when applied to authentic data. But the benefit of using synthetic data in NMT training, produced by the popular back-translation technique, raises the question if data selection could also be useful for synthetic data? In this work we use Infrequent N-gram Recovery (INR) and Feature Decay Algorithms (FDA), two transductive data selection methods to obtain subsets of sentences from synthetic data. These methods ensure that selected sentences share n-grams with the test set so the NMT model can be adapted to translate it. Performing data selection on back-translated data creates new challenges as the source-side may contain noise originated by the model used in the back-translation. Hence, finding n-grams present in the test set become more difficult. Despite that, in our work we show that adapting a model with a selection of synthetic data is an useful approach.

قيم البحث

94 - Thuy-Trang Vu , Xuanli He , Dinh Phung 2021

This paper considers the unsupervised domain adaptation problem for neural machine translation (NMT), where we assume the access to only monolingual text in either the source or target language in the new domain. We propose a cross-lingual data selec tion method to extract in-domain sentences in the missing language side from a large generic monolingual corpus. Our proposed method trains an adaptive layer on top of multilingual BERT by contrastive learning to align the representation between the source and target language. This then enables the transferability of the domain classifier between the languages in a zero-shot manner. Once the in-domain data is detected by the classifier, the NMT model is then adapted to the new domain by jointly learning translation and domain discrimination tasks. We evaluate our cross-lingual data selection method on NMT across five diverse domains in three language pairs, as well as a real-world scenario of translation for COVID-19. The results show that our proposed method outperforms other selection baselines up to +1.5 BLEU score.

الحساب واللغة

Cross-model Back-translated Distillation for Unsupervised Machine Translation

130 - Xuan-Phi Nguyen , Shafiq Joty , Thanh-Tung Nguyen 2020

Recent unsupervised machine translation (UMT) systems usually employ three main principles: initialization, language modeling and iterative back-translation, though they may apply them differently. Crucially, iterative back-translation and denoising auto-encoding for language modeling provide data diversity to train the UMT systems. However, the gains from these diversification processes has seemed to plateau. We introduce a novel component to the standard UMT framework called Cross-model Back-translated Distillation (CBD), that is aimed to induce another level of data diversification that existing principles lack. CBD is applicable to all previous UMT approaches. In our experiments, CBD achieves the state of the art in the WMT14 English-French, WMT16 English-German and English-Romanian bilingual unsupervised translation tasks, with 38.2, 30.1, and 36.3 BLEU respectively. It also yields 1.5-3.3 BLEU improvements in IWSLT English-French and English-German tasks. Through extensive experimental analyses, we show that CBD is effective because it embraces data diversity while other similar variants do not.

الحساب واللغة التعلم الآلي

Rapid Domain Adaptation for Machine Translation with Monolingual Data

125 - Mahdis Mahdieh , Mia Xu Chen , Yuan Cao 2020

One challenge of machine translation is how to quickly adapt to unseen domains in face of surging events like COVID-19, in which case timely and accurate translation of in-domain information into multiple languages is critical but little parallel dat a is available yet. In this paper, we propose an approach that enables rapid domain adaptation from the perspective of unsupervised translation. Our proposed approach only requires in-domain monolingual data and can be quickly applied to a preexisting translation system trained on general domain, reaching significant gains on in-domain translation quality with little or no drop on general-domain. We also propose an effective procedure of simultaneous adaptation for multiple domains and languages. To the best of our knowledge, this is the first attempt that aims to address unsupervised multilingual domain adaptation.

الحساب واللغة

Dynamically Composing Domain-Data Selection with Clean-Data Selection by Co-Curricular Learning for Neural Machine Translation

94 - Wei Wang , Isaac Caswell , Ciprian Chelba 2019

Noise and domain are important aspects of data quality for neural machine translation. Existing research focus separately on domain-data selection, clean-data selection, or their static combination, leaving the dynamic interaction across them not exp licitly examined. This paper introduces a co-curricular learning method to compose dynamic domain-data selection with dynamic clean-data selection, for transfer learning across both capabilities. We apply an EM-style optimization procedure to further refine the co-curriculum. Experiment results and analysis with two domains demonstrate the effectiveness of the method and the properties of data scheduled by the co-curriculum.

الحساب واللغة التعلم الآلي

Revisiting Simple Domain Adaptation Methods in Unsupervised Neural Machine Translation

148 - Haipeng Sun , Rui Wang , Kehai Chen 2019

Domain adaptation has been well-studied in supervised neural machine translation (SNMT). However, it has not been well-studied for unsupervised neural machine translation (UNMT), although UNMT has recently achieved remarkable results in several domai n-specific language pairs. Besides the inconsistent domains between training data and test data for SNMT, there sometimes exists an inconsistent domain between two monolingual training data for UNMT. In this work, we empirically show different scenarios for unsupervised neural machine translation. Based on these scenarios, we revisit the effect of the existing domain adaptation methods including batch weighting and fine tuning methods in UNMT. Finally, we propose modified methods to improve the performances of domain-specific UNMT systems.

الحساب واللغة

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الإتحاد الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Adaptation of Machine Translation Models with Back-translated Data using Transductive Data Selection Methods

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً