New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Discovering Representation Sprachbund For Multilingual Pre-Training

اكتشاف التمثيل Sprachbund للتدريب المسبق متعدد اللغات

212 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

أظهرت النماذج المتعددة اللغات المدربة مسبقا فعاليتها في العديد من مهام NLP متعددة اللغات وتمكن من نقل الصفر أو القليل من التحويلات من لغات الموارد العالية إلى الموارد المنخفضة. ومع ذلك، نظرا للاختلافات والتناقضات النموذجية الكبرى بين بعض اللغات، عادة ما تؤدي هذه النماذج بشكل سيء على العديد من اللغات والإعدادات المتبادلة، والتي تظهر صعوبة في تعلم نموذج واحد للتعامل مع لغات متنوعة واسعة النظافة في نفس الوقت. لتخفيف هذه المشكلة، نقدم خط أنابيب ما قبل اللغات متعددة اللغات. نقترح إنشاء تمثيل لغة من النموذج متعدد اللغات مسبقا وإجراء التحليل اللغوي لإظهار أن تشابه تمثيل اللغة يعكس التشابه اللغوي من وجهات نظر متعددة، بما في ذلك أسرة اللغة، Sprachbund الجغرافية، وقاميا، وبناء جملة. ثم نحن ألمع جميع اللغات المستهدفة في مجموعات متعددة وتسمية كل مجموعة كتمثيل Sprachbund. وهكذا، من المفترض أن تعزز اللغات في نفس التمثيل SPRACHBUND بعضها البعض في كل من التدريب المسبق والضبط بشكل جيد لأنها تشترك في التشابه اللغوي الغني. نحن ندرج مسبقا نموذج واحد متعدد اللغات لكل تمثيل Sprachbund. يتم إجراء التجارب على المعايير المتبادلة وتحقيق تحسينات كبيرة مقارنة مع خطوط الأساس القوية.

Multilingual pre-trained models have demonstrated their effectiveness in many multilingual NLP tasks and enabled zero-shot or few-shot transfer from high-resource languages to low-resource ones. However, due to significant typological differences and contradictions between some languages, such models usually perform poorly on many languages and cross-lingual settings, which shows the difficulty of learning a single model to handle massive diverse languages well at the same time. To alleviate this issue, we present a new multilingual pre-training pipeline. We propose to generate language representation from multilingual pre-trained model and conduct linguistic analysis to show that language representation similarity reflects linguistic similarity from multiple perspectives, including language family, geographical sprachbund, lexicostatistics, and syntax. Then we cluster all the target languages into multiple groups and name each group as a representation sprachbund. Thus, languages in the same representation sprachbund are supposed to boost each other in both pre-training and fine-tuning as they share rich linguistic similarity. We pre-train one multilingual model for each representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.

References used

https://aclanthology.org/

rate research

Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

679 - Association for Computation Linguistics 2021 مقالة

We propose a method to distill a language-agnostic meaning embedding from a multilingual sentence encoder. By removing language-specific information from the original embedding, we retrieve an embedding that fully represents the sentence's meaning. T he proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows efficient cross-lingual sentence similarity estimation by simple cosine similarity calculation. Experimental results on both quality estimation of machine translation and cross-lingual semantic textual similarity tasks reveal that our method consistently outperforms the strong baselines using the original multilingual embedding. Our method consistently improves the performance of any pre-trained multilingual sentence encoder, even in low-resource language pairs where only tens of thousands of parallel sentence pairs are available.

language-agnostic representation multilingual sentence encoder sentence encoder التمثيل اللغوي اللاوردي جملة متعددة اللغات التشفير الجملة التشفير صناعة حمض الفوسفور المزيد..

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

286 - Association for Computation Linguistics 2021 مقالة

Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is genera lly acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose Ernie-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that Ernie-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.

enhanced multilingual representation enhanced multilingual aligning cross-lingual semantics تمثيل متعدد اللغات المعزز تعزيز متعدد اللغات محاذاة دلالات الصليب اللغات صناعة حمض الفوسفور المزيد..

Effectiveness of Pre-training for Few-shot Intent Classification

408 - Association for Computation Linguistics 2021 مقالة

This paper investigates the effectiveness of pre-training for few-shot intent classification. While existing paradigms commonly further pre-train language models such as BERT on a vast amount of unlabeled corpus, we find it highly effective and effic ient to simply fine-tune BERT with a small set of labeled utterances from public datasets. Specifically, fine-tuning BERT with roughly 1,000 labeled data yields a pre-trained model -- IntentBERT, which can easily surpass the performance of existing pre-trained models for few-shot intent classification on novel domains with very different semantics. The high effectiveness of IntentBERT confirms the feasibility and practicality of few-shot intent detection, and its high generalization ability across different domains suggests that intent classification tasks may share a similar underlying structure, which can be efficiently learned from a small set of labeled data. The source code can be found at https://github.com/hdzhang-code/IntentBERT.

بطاقة اليانصيب few-shot intent قلة الطلقة صناعة حمض الفوسفور

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

328 - Association for Computation Linguistics 2021 مقالة

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-s hot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.

zero-shot cross-lingual transfer cross-lingual transfer transfer of vision-language صفر النار نقل عبر اللغات نقل عبر اللغات نقل لغة الرؤية صناعة حمض الفوسفور المزيد..

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

326 - Association for Computation Linguistics 2021 مقالة

The recent Text-to-Text Transfer Transformer'' (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 th at was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent accidental translation'' in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

massively multilingual pre-trained english-language nlp tasks transfer transformer متعدد اللغات بشكل كبير مدرب مسبقا مهام NLP اللغة الإنجليزية نقل المحولات صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Discovering Representation Sprachbund For Multilingual Pre-Training

اكتشاف التمثيل Sprachbund للتدريب المسبق متعدد اللغات

Ask ChatGPT about the research

Read More

suggested questions