تحفز الوجود الواسع للغة الهجومية على وسائل التواصل الاجتماعي تطوير أنظمة قادرة على الاعتراف بهذا المحتوى تلقائيا.بصرف النظر عن بعض الاستثناءات البارزة، فإن معظم الأبحاث حول تحديد اللغة الهجومية التلقائية تعامل مع اللغة الإنجليزية.لمعالجة هذا القصور، نقدم العفن، مجموعة بيانات اللغة المهاراتية الهجومية.القالب هو أول مجموعة بيانات من نوعها مترجمة للأمراثي، مما يفتح مجالا جديدا للبحث في لغات Indo-Arian منخفضة الموارد.نقدم النتائج من العديد من تجارب التعلم الآلي على هذه البيانات، بما في ذلك تجارب التعلم الصفر القصيرة وغيرها من عمليات التعلم على المحولات عبر اللغات الحديثة من البيانات الحالية في البنغالية والإنجليزية والهندية.
The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.
References used
https://aclanthology.org/
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiri
In this paper and we explore different techniques of overcoming the challenges of low-resource in Neural Machine Translation (NMT) and specifically focusing on the case of English-Marathi NMT. NMT systems require a large amount of parallel corpora to
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously transla
In this work, we investigate methods for the challenging task of translating between low- resource language pairs that exhibit some level of similarity. In particular, we consider the utility of transfer learning for translating between several Indo-
This paper describes TenTrans' submission to WMT21 Multilingual Low-Resource Translation shared task for the Romance language pairs. This task focuses on improving translation quality from Catalan to Occitan, Romanian and Italian, with the assistance