BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding


الملخص بالإنكليزية

In this paper, we introduce ``Embedding Barrier, a phenomenon that limits the monolingual performance of multilingual models on low-resource languages having unique typologies. We build `BanglaBERT, a Bangla language model pretrained on 18.6 GB Internet-crawled data and benchmark on five standard NLU tasks. We discover a significant drop in the performance of the state-of-the-art multilingual model (XLM-R) from BanglaBERT and attribute this to the Embedding Barrier through comprehensive experiments. We identify that a multilingual models performance on a low-resource language is hurt when its writing script is not similar to any of the high-resource languages. To tackle the barrier, we propose a straightforward solution by transcribing languages to a common script, which can effectively improve the performance of a multilingual model for the Bangla language. As a bi-product of the standard NLU benchmarks, we introduce a new downstream dataset on natural language inference (NLI) and show that BanglaBERT outperforms previous state-of-the-art results on all tasks by up to 3.5%. We are making the BanglaBERT language model and the new Bangla NLI dataset publicly available in the hope of advancing the community. The resources can be found at url{https://github.com/csebuetnlp/banglabert}.

تحميل البحث