لا يزال التبديل (CS)، ظاهرة في كل مكان بسبب سهولة الاتصالات التي تقدمها في المجتمعات متعددة اللغات لا تزال مشكلة متفائلة في معالجة اللغة. الأسباب الرئيسية وراء ذلك هي: (1) الحد الأدنى من الجهود في الاستفادة من نماذج متعددة اللغات متعددة اللغات الكبيرة، و (2) عدم وجود بيانات مشروح. حالة التمييز بين الأداء المنخفض للنماذج متعددة اللغات في CS هي خلط اللغات داخل الجملة التي تؤدي إلى تبديل النقاط. نقوم أولا بقياس مهام وضع العلامات على التسلسل - POS و NER على 4 أزواج لغة مختلفة مع مجموعة من النماذج المحددة مسبقا لتحديد المشكلات وتحديد أفضل نموذج أداء شار Bert فيما بينها (معالجة (1)). ثم نقترح طريقة تدريب ذاتية لإعادة توجيه النماذج المحددة مسبقا باستخدام تحيز نقطة التبديل عن طريق الاستفادة من البيانات غير الموحدة (معالجة (2)). نوضح أخيرا أن نهجنا ينفذ جيدا على كلا المهام عن طريق تقليل الفجوة بين أداء نقطة التبديل مع الاحتفاظ بالأداء العام على أزواج لغتين متميزة في كلتا المهامتين. نحن نخطط لإطلاق سراح نماذجنا والرمز لجميع تجاربنا.
Code-switching (CS), a ubiquitous phenomenon due to the ease of communication it offers in multilingual communities still remains an understudied problem in language processing. The primary reasons behind this are: (1) minimal efforts in leveraging large pretrained multilingual models, and (2) the lack of annotated data. The distinguishing case of low performance of multilingual models in CS is the intra-sentence mixing of languages leading to switch points. We first benchmark two sequence labeling tasks -- POS and NER on 4 different language pairs with a suite of pretrained models to identify the problems and select the best performing char-BERT model among them (addressing (1)). We then propose a self training method to repurpose the existing pretrained models using a switch-point bias by leveraging unannotated data (addressing (2)). We finally demonstrate that our approach performs well on both tasks by reducing the gap between the switch point performance while retaining the overall performance on two distinct language pairs in both the tasks. We plan to release our models and the code for all our experiments.
References used
https://aclanthology.org/
Language models used in speech recognition are often either evaluated intrinsically using perplexity on test data, or extrinsically with an automatic speech recognition (ASR) system. The former evaluation does not always correlate well with ASR perfo
Paraphrase generation has benefited extensively from recent progress in the designing of training objectives and model architectures. However, previous explorations have largely focused on supervised methods, which require a large amount of labeled d
Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks, but they still require excessive labeled data in the fine-tuning stage. We study the problem of fine-tuning pre-trained LMs u
Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have shown that incorporating span-level information over consecutive words i
State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domain-specific rules has been shown to be useful in such settings to a