Do you want to publish a course? Click here

Arabic documents classification system

نظام تصنيف المستندات العربية حسب محتواها

3187   1   471   0 ( 0 )
 Publication date 2012
and research's language is العربية
 Created by Shadi Saleh




Ask ChatGPT about the research

No English abstract

References used
Larkey, L.S., L. Ballesteros, and M.E. Connell, Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis, in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval2002, ACM: Tampere, Finland. p. 275-282.
Al-Shammari, E.T. Improving Arabic document categorization: Introducing local stem. in Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on. 2010.
Porter, M.F., An algorithm for suffix stripping, in Readings in information retrieval, J. Karen Sparck and W. Peter, Editors. 1997, Morgan Kaufmann Publishers Inc. p. 313-316.
Lin, E.A.-S.a.J., A new Arabic stemming algorithm. In Proceedings of the 2008 ISCA Workshop on Experimental Linguistics, 2008
rate research

Read More

In this paper, we introduce an algorithm for grouping Arabic documents for building an ontology and its words. We execute the algorithm on five ontologies using Java. We manage the documents by getting 338667 words with its weights corresponding to each ontology. The algorithm had proved its efficiency in optimizing classifiers (SVM, NB) performance, which we tested in this study, comparing with former classifiers results for Arabic language.
In this paper, we present a Modern Standard Arabic (MSA) Sentence difficulty classifier, which predicts the difficulty of sentences for language learners using either the CEFR proficiency levels or the binary classification as simple or complex. We c ompare the use of sentence embeddings of different kinds (fastText, mBERT , XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. Our best results have been achieved using fined-tuned Arabic-BERT. The accuracy of our 3-way CEFR classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71 Spearman correlation for regression. Our binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for sentence-pair semantic similarity classifier.
Sentiment classification and sarcasm detection attract a lot of attention by the NLP research community. However, solving these two problems in Arabic and on the basis of social network data (i.e., Twitter) is still of lower interest. In this paper w e present designated solutions for sentiment classification and sarcasm detection tasks that were introduced as part of a shared task by Abu Farha et al. (2021). We adjust the existing state-of-the-art transformer pretrained models for our needs. In addition, we use a variety of machine-learning techniques such as down-sampling, augmentation, bagging, and usage of meta-features to improve the models performance. We achieve an F1-score of 0.75 over the sentiment classification problem where the F1-score is calculated over the positive and negative classes (the neutral class is not taken into account). We achieve an F1-score of 0.66 over the sarcasm detection problem where the F1-score is calculated over the sarcastic class only. In both cases, the above reported results are evaluated over the ArSarcasm-v2--an extended dataset of the ArSarcasm (Farha and Magdy, 2020) that was introduced as part of the shared task. This reflects an improvement to the state-of-the-art results in both tasks.
تحتل الدراسات التي تتناول حوسبة اللغة العربية أهمية كبيرة نظراً للانتشار الواسع للغة العربية , و اخترنا في هذه الدراسة العمل على معالجة اللغة العربية من خلال نظام استرجاع معلومات للمستندات باللغة العربية , الفكرة الأساسية لهذا النظام هو تحليل المستن دات والنصوص العربية و إنشاء فهارس للمصطلحات الواردة فيها , ومن ثم استخلاص أشعة أوزان تعبر عن هذه المستندات من أجل المعالجة اللاحقة للاستعلام و المقارنة مع هذه الأشعة للحصول على المستندات الموافقة لهذا الاستعلام . من خلال عملية تجريد للمصطلحات الواردة في المستندات تم الحصول على كفاءة استرجاع أفضل , و تعرضنا للعديد من خوارزميات التجريد التي وصلت إليها الدراسات السابقة . و تأتي عملية عنقدة المستندات كإضافة هامة , حيث يتمكن المستخدم من معرفة المستندات المشابهة لنتيجة البحث و التي لها صلة بـالاستعلام المدخل . في التطبيق العملي , تم العمل على نظام استرجاع معلومات مكتبي , يقوم بقراءة نصوص ذات أنواع مختلفة و عرض النتائج مع العناقيد الموافقة لها .
An expert system was developed to consider words' grammar case in Arabic phrases without diacritics. First, the system gets words' morphology and tags using Microsoft tool (ATK), then it depends on Arabic grammar to get words' grammar case in nominal phrases. The system gave a very good results as they compared with Arabic language expert.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا