Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

بيانات صغيرة؟لا مشكلة!استكشاف جدوى نماذج اللغة متعددة اللغات الاحترام لغات منخفضة الموارد

441 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

وقد تبين أن نماذج اللغة متعددة اللغات المحددة تعمل بشكل جيد في العديد من اللغات لمجموعة متنوعة من مهام NLP المصب. ومع ذلك، من المعروف أن هذه النماذج تتطلب الكثير من البيانات التدريبية. وبالتالي يترك هذا نسبة كبيرة من لغات العالم لأنها نقص الموارد. علاوة على ذلك، فإن الدافع الرئيسي وراء هذه النماذج هو أن لغات الموارد المنخفضة تستفيد من التدريب المشترك بلغات الموارد العليا. في هذا العمل، نتحدى هذا الافتراض وتقديم المحاولة الأولى لتدريب نموذج لغة متعددة اللغات على لغات الموارد المنخفضة فقط. نظهر أنه من الممكن تدريب نماذج لغة متعددة اللغات التنافسية على أقل من 1 غيغابايت من النص. يغطي نموذجنا، يدعى Afriberta، 11 لغة إفريقية، بما في ذلك نموذج اللغة الأول لمدة 4 من هذه اللغات. التقييمات حول التعرف على الكيان المسمى وتصنيف النص يشير إلى 10 لغات تظهر أن النموذج لدينا تفوقت على mbert و xlm-rin عدة لغات وتنافسية للغاية بشكل عام. تشير النتائج إلى أن نهج بياناتنا الصغير بناء على لغات مماثلة قد يعمل في بعض الأحيان أفضل من التدريب المشترك على مجموعات البيانات الكبيرة مع لغات موارد عالية. يتم إصدار الرمز والبيانات والنماذج في https://github.com/keleog/afriberta.

Pretrained multilingual language models have been shown to work well on many languages for a variety of downstream NLP tasks. However, these models are known to require a lot of training data. This consequently leaves out a huge percentage of the world's languages as they are under-resourced. Furthermore, a major motivation behind these models is that lower-resource languages benefit from joint training with higher-resource languages. In this work, we challenge this assumption and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than 1 GB of text. Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. Evaluations on named entity recognition and text classification spanning 10 languages show that our model outperforms mBERT and XLM-Rin several languages and is very competitive overall. Results suggest that our small data'' approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages. Code, data and models are released at https://github.com/keleog/afriberta.

References used

https://aclanthology.org/

rate research

CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task

1064 - Association for Computation Linguistics 2021 مقالة

This paper describes Charles University sub-mission for Terminology translation shared task at WMT21. The objective of this task is to design a system which translates certain terms based on a provided terminology database, while preserving high over all translation quality. We competed in English-French language pair. Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms. We lemmatize the terms both during the training and inference, to allow the model to learn how to produce correct surface forms of the words, when they differ from the forms provided in the terminology database.

multilingual low-resource translation indo-european languages shared languages shared task الترجمة متعددة اللغات منخفضة الموارد اللغات الهندية الأوروبية مشتركة المهام المشتركة لغات صناعة حمض الفوسفور المزيد..

Attentive fine-tuning of Transformers for Translation of low-resourced languages @LoResMT 2021

560 - Association for Computation Linguistics 2021 مقالة

This paper reports the Machine Translation (MT) systems submitted by the IIITT team for the English→Marathi and English⇔Irish language pairs LoResMT 2021 shared task. The task focuses on getting exceptional translations for rather low-resourced langu ages like Irish and Marathi. We fine-tune IndicTrans, a pretrained multilingual NMT model for English→Marathi, using external parallel corpus as input for additional training. We have used a pretrained Helsinki-NLP Opus MT English⇔Irish model for the latter language pair. Our approaches yield relatively promising results on the BLEU metrics. Under the team name IIITT, our systems ranked 1, 1, and 2 in English→Marathi, Irish→English, and English→Irish respectively. The codes for our systems are published1 .

fine-tuning of transformers attentive fine-tuning صقل من المحولات اليقظ قليلا ضبط صناعة حمض الفوسفور

TenTrans Multilingual Low-Resource Translation System for WMT21 Indo-European Languages Task

606 - Association for Computation Linguistics 2021 مقالة

This paper describes TenTrans' submission to WMT21 Multilingual Low-Resource Translation shared task for the Romance language pairs. This task focuses on improving translation quality from Catalan to Occitan, Romanian and Italian, with the assistance of related high-resource languages. We mainly utilize back-translation, pivot-based methods, multilingual models, pre-trained model fine-tuning, and in-domain knowledge transfer to improve the translation quality. On the test set, our best-submitted system achieves an average of 43.45 case-sensitive BLEU scores across all low-resource pairs. Our data, code, and pre-trained models used in this work are available in TenTrans evaluation examples.

متعددة اللغات NMT. indo-european languages task multilingual low-resource مهمة اللغات الهندية الأوروبية متعدد اللغات منخفضة الموارد صناعة حمض الفوسفور

Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks

700 - Association for Computation Linguistics 2021 مقالة

Masked language models have quickly become the de facto standard when processing text. Recently, several approaches have been proposed to further enrich word representations with external knowledge sources such as knowledge graphs. However, these mod els are devised and evaluated in a monolingual setting only. In this work, we propose a language-independent entity prediction task as an intermediate training procedure to ground word representations on entity semantics and bridge the gap across different languages by means of a shared vocabulary of entities. We show that our approach effectively injects new lexical-semantic knowledge into neural models, improving their performance on different semantic tasks in the zero-shot crosslingual setting. As an additional advantage, our intermediate training does not require any supplementary input, allowing our models to be applied to new datasets right away. In our experiments, we use Wikipedia articles in up to 100 languages and already observe consistent gains compared to strong baselines when predicting entities using only the English Wikipedia. Further adding extra languages lead to improvements in most tasks up to a certain point, but overall we found it non-trivial to scale improvements in model transferability by training on ever increasing amounts of Wikipedia languages.

grounding multilingual language grounding multilingual predicting wikipedia hyperlinks التأريض لغة متعددة اللغات التأريض متعدد اللغات التنبؤ بيكيبيديا الارتباطات التشعبية صناعة حمض الفوسفور المزيد..

Language Model Pretraining and Transfer Learning for Very Low Resource Languages

602 - Association for Computation Linguistics 2021 مقالة

This paper describes our submission for the shared task on Unsupervised MT and Very Low Resource Supervised MT at WMT 2021. We submitted systems for two language pairs: German ↔ Upper Sorbian (de ↔ hsb) and German-Lower Sorbian (de ↔ dsb). For de ↔ h sb, we pretrain our system using MASS (Masked Sequence to Sequence) objective and then finetune using iterative back-translation. Final finetunng is performed using the parallel data provided for translation objective. For de ↔ dsb, no parallel data is provided in the task, we use final de ↔ hsb model as initialization of the de ↔ dsb model and train it further using iterative back-translation, using the same vocabulary as used in the de ↔ hsb model.

السامية العليا low resource languages pretraining and transfer لغات الموارد المنخفضة محاكاة ونقل صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

بيانات صغيرة؟لا مشكلة!استكشاف جدوى نماذج اللغة متعددة اللغات الاحترام لغات منخفضة الموارد

Ask ChatGPT about the research

Read More

suggested questions