تقترح هذه الورقة تنفيذ الكلمات الخمسة لغات جنوب إفريقيا، وهي SEPEDI و SETSWANA و TSHIVENDA و ISZULU و ISIXHOSA لإضافتها لفتح Wordnets متعدد اللغات (OMW) على مجموعة أدوات اللغة الطبيعية (NLTK).يتم تحويل Wordnets الأفريقي من Princeton Wordnet (PWN) 2.0 إلى 3.0 لتتناسب مع Synsets في PWN 3.0.بعد التحويل، كان هناك 7157 و 11972 و 1288 و 6380 و 9460 Lemmas لسيبيدي و Setswana و Tshivenda و Isizulu و ISIX- Hosa على التوالي.Setswana، ISIXHOSA، SEPEDI يحتوي على المزيد من الليمان مقارنة ب 8 لغات في OMW و ISZULU يحتوي على المزيد من الليمون مقارنة ب 7 لغات في OMW.تم نشر مكتبة للتطوير المستمر للملفات الأفريقية في OMW باستخدام NLTK.
This paper proposes the implementation of WordNets for five South African languages, namely, Sepedi, Setswana, Tshivenda, isiZulu and isiXhosa to be added to open multilingual WordNets (OMW) on natural language toolkit (NLTK). The African WordNets are converted from Princeton WordNet (PWN) 2.0 to 3.0 to match the synsets in PWN 3.0. After conversion, there were 7157, 11972, 1288, 6380, and 9460 lemmas for Sepedi, Setswana, Tshivenda, isiZulu and isiX- hosa respectively. Setswana, isiXhosa, Sepedi contains more lemmas compared to 8 languages in OMW and isiZulu contains more lemmas compared to 7 languages in OMW. A library has been published for continuous development of African WordNets in OMW using NLTK.
References used
https://aclanthology.org/
This paper presents the work in progress toward the creation of a family of WordNets for Sanskrit, Ancient Greek, and Latin. Building on previous attempts in the field, we elaborate these efforts bridging together WordNet relational semantics with th
This paper describes a methodology for syntactic knowledge transfer between high-resource languages to extremely low-resource languages. The methodology consists in leveraging multilingual BERT self-attention model pretrained on large datasets to dev
Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we fo
Computational resources such as semantically annotated corpora can play an important role in enabling speakers of indigenous minority languages to participate in government, education, and other domains of public life in their own language. However,
Abstract We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition