البنغالية هي لغة موارد منخفضة تفتقر إلى الأدوات والموارد الخاصة بالكشف عن المحتوى النصي النصي والفاحش.حتى الآن، لا يوجد معجم لكشف الفاحش في نص وسائل الإعلام الاجتماعية البنغالية.تقدم هذه الدراسة معجم بنغالي فاحشين يتكون من أكثر من 200 مصطلحات بنغالية، والتي يمكن اعتبارها قذرة أو عامية صلبة أو فاحشة أو فاحشة.يتم تقديم منهجية شبه أوتوماتيكية لتطوير المعجم الملحق الذي يهدف إلى تطور كائنات فاحشة وكلمة تضمين وكالة الكلام (POS).يحقق المعجم المطور تغطية حوالي 0.85 للكشف عن المحتوى الفاحش والمحتوى في مجموعة بيانات التقييم.تنطوي النتائج التجريبية على أن المعجم المطور فعال في تحديد الفحش في محتوى بنغالي وسائل التواصل الاجتماعي.
Bengali is a low-resource language that lacks tools and resources for profane and obscene textual content detection. Until now, no lexicon exists for detecting obscenity in Bengali social media text. This study introduces a Bengali obscene lexicon consisting of over 200 Bengali terms, which can be considered filthy, slang, profane or obscene. A semi-automatic methodology is presented for developing the profane lexicon that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The developed lexicon achieves coverage of around 0.85 for obscene and profane content detection in an evaluation dataset. The experimental results imply that the developed lexicon is effective at identifying obscenity in Bengali social media content.
References used
https://aclanthology.org/
Dictionary-based methods in sentiment analysis have received scholarly attention recently, the most comprehensive examples of which can be found in English. However, many other languages lack polarity dictionaries, or the existing ones are small in s
Quality Estimation (QE) for Machine Translation has been shown to reach relatively high accuracy in predicting sentence-level scores, relying on pretrained contextual embeddings and human-produced quality scores. However, the lack of explanations alo
Lexicon plays an essential role in natural language processing systems and
specially the machine translation systems, because it provides the system's
components with the necessary information for the translation process. Although there have been a number of researches in natural language processing field, not enough attention has been given to the importance of the lexicon and specially the Arabic lexicon.
This paper describes the model built for the SIGTYP 2021 Shared Task aimed at identifying 18 typologically different languages from speech recordings. Mel-frequency cepstral coefficients derived from audio files are transformed into spectrograms, whi
In this paper, we introduce FITAnnotator, a generic web-based tool for efficient text annotation. Benefiting from the fully modular architecture design, FITAnnotator provides a systematic solution for the annotation of a variety of natural language p