FFR v1.1: Fon-French Neural Machine Translation

402 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Chris C. Emezue

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Bonaventure F. P. Dossou - Chris C. Emezue

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

All over the world and especially in Africa, researchers are putting efforts into building Neural Machine Translation (NMT) systems to help tackle the language barriers in Africa, a continent of over 2000 different languages. However, the low-resourceness, diacritical, and tonal complexities of African languages are major issues being faced. The FFR project is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we introduce FFR Dataset, a corpus of Fon-to-French translations, describe the diacritical encoding process, and introduce our FFR v1.1 model, trained on the dataset. The dataset and model are made publicly available at https://github.com/ bonaventuredossou/ffr-v1, to promote collaboration and reproducibility.

قيم البحث

93 - Bonaventure F. P. Dossou , Chris C. Emezue 2020

Africa has the highest linguistic diversity in the world. On account of the importance of language to communication, and the importance of reliable, powerful and accurate machine translation models in modern inter-cultural communication, there have b een (and still are) efforts to create state-of-the-art translation models for the many African languages. However, the low-resources, diacritical and tonal complexities of African languages are major issues facing African NLP today. The FFR is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we describe our pilot project: the creation of a large growing corpora for Fon-to-French translations and our FFR v1.0 model, trained on this dataset. The dataset and model are made publicly available.

الحساب واللغة

Efficient Bidirectional Neural Machine Translation

137 - Xu Tan , Yingce Xia , Lijun Wu 2019

The encoder-decoder based neural machine translation usually generates a target sequence token by token from left to right. Due to error propagation, the tokens in the right side of the generated sequence are usually of poorer quality than those in t he left side. In this paper, we propose an efficient method to generate a sequence in both left-to-right and right-to-left manners using a single encoder and decoder, combining the advantages of both generation directions. Experiments on three translation tasks show that our method achieves significant improvements over conventional unidirectional approach. Compared with ensemble methods that train and combine two models with different generation directions, our method saves 50% model parameters and about 40% training time, and also improve inference speed.

الحساب واللغة التعلم الآلي

Echo State Neural Machine Translation

268 - Ankush Garg , Yuan Cao , 2020

We present neural machine translation (NMT) models inspired by echo state network (ESN), named Echo State NMT (ESNMT), in which the encoder and decoder layer weights are randomly generated then fixed throughout training. We show that even with this e xtremely simple model construction and training procedure, ESNMT can already reach 70-80% quality of fully trainable baselines. We examine how spectral radius of the reservoir, a key quantity that characterizes the model, determines the model behavior. Our findings indicate that randomized networks can work well even for complicated sequence-to-sequence prediction NLP tasks.

الحساب واللغة التعلم الآلي الحوسبة العصبية والتطورية

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

109 - Bonaventure F. P. Dossou , Chris C. Emezue 2021

Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preproc essing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical, and tonal properties of some African languages. That, coupled with the extremely low availability of training samples, hinders the production of reliable NMT models. In this paper, using Fon language as a case study, we revisit standard tokenization methods and introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Furthermore, we compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.

الحساب واللغة الذكاء الاصطناعي

On Compositionality in Neural Machine Translation

291 - Vikas Raunak , Vaibhav Kumar , Florian Metze 2019

We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules. We evaluate a standard Sequence to Sequence model on tests designed to assess these two properties in NMT. We quantitatively demonstrate that inadequate temporal processing, in the form of poor encoder representations is a bottleneck for both Productivity and Systematicity. We propose a simple pre-training mechanism which alleviates model performance on the two properties and leads to a significant improvement in BLEU scores.

الحساب واللغة التعلم الآلي