Do you want to publish a course? Click here

Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT

تحديد الهياكل في تغريدات عربية محددة باستخدام تجزئة Farasa

194   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

This paper presents our approach to address the EACL WANLP-2021 Shared Task 1: Nuanced Arabic Dialect Identification (NADI). The task is aimed at developing a system that identifies the geographical location(country/province) from where an Arabic tweet in the form of modern standard Arabic or dialect comes from. We solve the task in two parts. The first part involves pre-processing the provided dataset by cleaning, adding and segmenting various parts of the text. This is followed by carrying out experiments with different versions of two Transformer based models, AraBERT and AraELECTRA. Our final approach achieved macro F1-scores of 0.216, 0.235, 0.054, and 0.043 in the four subtasks, and we were ranked second in MSA identification subtasks and fourth in DA identification subtasks.

References used
https://aclanthology.org/
rate research

Read More

This paper presents our strategy to tackle the EACL WANLP-2021 Shared Task 2: Sarcasm and Sentiment Detection. One of the subtasks aims at developing a system that identifies whether a given Arabic tweet is sarcastic in nature or not, while the other aims to identify the sentiment of the Arabic tweet. We approach the task in two steps. The first step involves pre processing the provided dataset by performing insertions, deletions and segmentation operations on various parts of the text. The second step involves experimenting with multiple variants of two transformer based models, AraELECTRA and AraBERT. Our final approach was ranked seventh and fourth in the Sarcasm and Sentiment Detection subtasks respectively.
The spread of COVID-19 has been accompanied with widespread misinformation on social media. In particular, Twitterverse has seen a huge increase in dissemination of distorted facts and figures. The present work aims at identifying tweets regarding CO VID-19 which contains harmful and false information. We have experimented with a number of Deep Learning-based models, including different word embeddings, such as Glove, ELMo, among others. BERTweet model achieved the best overall F1-score of 0.881 and secured the third rank on the above task.
We present the findings and results of theSecond Nuanced Arabic Dialect IdentificationShared Task (NADI 2021). This Shared Taskincludes four subtasks: country-level ModernStandard Arabic (MSA) identification (Subtask1.1), country-level dialect identi fication (Subtask1.2), province-level MSA identification (Subtask2.1), and province-level sub-dialect identifica-tion (Subtask 2.2). The shared task dataset cov-ers a total of 100 provinces from 21 Arab coun-tries, collected from the Twitter domain. A totalof 53 teams from 23 countries registered to par-ticipate in the tasks, thus reflecting the interestof the community in this area. We received 16submissions for Subtask 1.1 from five teams, 27submissions for Subtask 1.2 from eight teams,12 submissions for Subtask 2.1 from four teams,and 13 Submissions for subtask 2.2 from fourteams.
Irony and Sentiment detection is important to understand people's behavior and thoughts. Thus it has become a popular task in natural language processing (NLP). This paper presents results and main findings in WANLP 2021 shared tasks one and two. The task was based on the ArSarcasm-v2 dataset (Abu Farha et al., 2021). In this paper, we describe our system Multi-headed-LSTM-CNN-GRU and also MARBERT (Abdul-Mageed et al., 2021) submitted for the shared task, ranked 10 out of 27 in shared task one achieving 0.5662 F1-Sarcasm and ranked 3 out of 22 in shared task two achieving 0.7321 F1-PN under CodaLab username rematchka''. We experimented with various models and the two best performing models are a Multi-headed CNN-LSTM-GRU in which we used prepossessed text and emoji presented from tweets and MARBERT.
Over the past few months, there were huge numbers of circulating tweets and discussions about Coronavirus (COVID-19) in the Arab region. It is important for policy makers and many people to identify types of shared tweets to better understand public behavior, topics of interest, requests from governments, sources of tweets, etc. It is also crucial to prevent spreading of rumors and misinformation about the virus or bad cures. To this end, we present the largest manually annotated dataset of Arabic tweets related to COVID-19. We describe annotation guidelines, analyze our dataset and build effective machine learning and transformer based models for classification.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا