BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

86 0 0.0 ( 0 )

Download Cite

Added by Rifat Shahriyar

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Abhik Bhattacharjee - Tahmid Hasan - Kazi Samin

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper, we introduce ``Embedding Barrier, a phenomenon that limits the monolingual performance of multilingual models on low-resource languages having unique typologies. We build `BanglaBERT, a Bangla language model pretrained on 18.6 GB Internet-crawled data and benchmark on five standard NLU tasks. We discover a significant drop in the performance of the state-of-the-art multilingual model (XLM-R) from BanglaBERT and attribute this to the Embedding Barrier through comprehensive experiments. We identify that a multilingual models performance on a low-resource language is hurt when its writing script is not similar to any of the high-resource languages. To tackle the barrier, we propose a straightforward solution by transcribing languages to a common script, which can effectively improve the performance of a multilingual model for the Bangla language. As a bi-product of the standard NLU benchmarks, we introduce a new downstream dataset on natural language inference (NLI) and show that BanglaBERT outperforms previous state-of-the-art results on all tasks by up to 3.5%. We are making the BanglaBERT language model and the new Bangla NLI dataset publicly available in the hope of advancing the community. The resources can be found at url{https://github.com/csebuetnlp/banglabert}.

rate research

AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages

94 - Abteen Ebrahimi , Manuel Mager , Arturo Oncevay 2021

Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseen languages has largely been limited to low-level, syntactic tasks, and it remains unclear if zero-shot learning of high-level, semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, an extension of XNLI (Conneau et al., 2018) to 10 indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. Additionally, we explore model adaptation via continued pretraining and provide an analysis of the dataset by considering hypothesis-only models. We find that XLM-Rs zero-shot performance is poor for all 10 languages, with an average performance of 38.62%. Continued pretraining offers improvements, with an average accuracy of 44.05%. Surprisingly, training on poorly translated data by far outperforms all other methods with an accuracy of 48.72%.

Computation and Language

Phoneme Level Language Models for Sequence Based Low Resource ASR

79 - Siddharth Dalmia , Xinjian Li , Alan W Black 2019

Building multilingual and crosslingual models help bring different languages together in a language universal space. It allows models to share parameters and transfer knowledge across languages, enabling faster and better adaptation to a new language. These approaches are particularly useful for low resource languages. In this paper, we propose a phoneme-level language model that can be used multilingually and for crosslingual adaptation to a target language. We show that our model performs almost as well as the monolingual models by using six times fewer parameters, and is capable of better adaptation to languages not seen during training in a low resource scenario. We show that these phoneme-level language models can be used to decode sequence based Connectionist Temporal Classification (CTC) acoustic model outputs to obtain comparable word error rates with Weighted Finite State Transducer (WFST) based decoding in Babel languages. We also show that these phoneme-level language models outperform WFST decoding in various low-resource conditions like adapting to a new language and domain mismatch between training and testing data.

Computation and Language

Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models

180 - Nora Kassner , Philipp Dufter , Hinrich Schutze 2021

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as Paris is the capital of [MASK] are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERTs performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.

Computation and Language

Discrete Word Embedding for Logical Natural Language Understanding

111 - Masataro Asai , Zilu Tang 2020

We propose an unsupervised neural model for learning a discrete embedding of words. Unlike existing discrete embeddings, our binary embedding supports vector arithmetic operations similar to continuous embeddings. Our embedding represents each word as a set of propositional statements describing a transition rule in classical/STRIPS planning formalism. This makes the embedding directly compatible with symbolic, state of the art classical planning solvers.

Computation and Language Artificial Intelligence

Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition

398 - Yubei Xiao , Ke Gong , Pan Zhou 2020

Low-resource automatic speech recognition (ASR) is challenging, as the low-resource target language data cannot well train an ASR model. To solve this issue, meta-learning formulates ASR for each source language into many small ASR tasks and meta-learns a model initialization on all tasks from different source languages to access fast adaptation on unseen target languages. However, for different source languages, the quantity and difficulty vary greatly because of their different data scales and diverse phonological systems, which leads to task-quantity and task-difficulty imbalance issues and thus a failure of multilingual meta-learning ASR (MML-ASR). In this work, we solve this problem by developing a novel adversarial meta sampling (AMS) approach to improve MML-ASR. When sampling tasks in MML-ASR, AMS adaptively determines the task sampling probability for each source language. Specifically, for each source language, if the query loss is large, it means that its tasks are not well sampled to train ASR model in terms of its quantity and difficulty and thus should be sampled more frequently for extra learning. Inspired by this fact, we feed the historical task query loss of all source language domain into a network to learn a task sampling policy for adversarially increasing the current query loss of MML-ASR. Thus, the learnt task sampling policy can master the learning situation of each language and thus predicts good task sampling probability for each language for more effective learning. Finally, experiment results on two multilingual datasets show significant performance improvement when applying our AMS on MML-ASR, and also demonstrate the applicability of AMS to other low-resource speech tasks and transfer learning ASR approaches.

Computation and Language Sound Audio and Speech Processing

comments

Fetching comments

Sham Higher Institute of Forensic Sciences and the Arabic language and Islamic studies and research

Additional details More universities

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

Ask ChatGPT about the research

No Arabic abstract

Read More