Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

تقديم مجموعة بيانات تونسية كبيرة تونسية لتحليل المعنويات

554 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.

References used

https://aclanthology.org/

rate research

SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts

884 - Association for Computation Linguistics 2021 مقالة

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, educat ion, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models. We have made the dataset and accompanying models presented in this paper publicly available at https://git.io/JuuNB.

noisy bangla texts bangla texts written bangla texts نصوص البنغالية الصاخبة نصوص البنغالية نصوص البنغالية المكتوبة صناعة حمض الفوسفور المزيد..

AraFacts: The First Large Arabic Dataset of Naturally Occurring Claims

532 - Association for Computation Linguistics 2021 مقالة

We introduce AraFacts, the first large Arabic dataset of naturally occurring claims collected from 5 Arabic fact-checking websites, e.g., Fatabyyano and Misbar, and covering claims since 2016. Our dataset consists of 6,121 claims along with their fac tual labels and additional metadata, such as fact-checking article content, topical category, and links to posts or Web pages spreading the claim. Since the data is obtained from various fact-checking websites, we standardize the original claim labels to provide a unified label rating for all claims. Moreover, we provide revealing dataset statistics and motivate its use by suggesting possible research applications. The dataset is made publicly available for the research community.

large arabic dataset naturally occurring claims large arabic مجموعة بيانات عربية كبيرة المطالبات التي تحدث بشكل طبيعي عربي كبير صناعة حمض الفوسفور المزيد..

A Large-Scale Dataset for Empathetic Response Generation

724 - Association for Computation Linguistics 2021 مقالة

Recent development in NLP shows a strong trend towards refining pre-trained models with a domain-specific dataset. This is especially the case for response generation where emotion plays an important role. However, existing empathetic datasets remain small, delaying research efforts in this area, for example, the development of emotion-aware chatbots. One main technical challenge has been the cost of manually annotating dialogues with the right emotion labels. In this paper, we describe a large-scale silver dataset consisting of 1M dialogues annotated with 32 fine-grained emotions, eight empathetic response intents, and the Neutral category. To achieve this goal, we have developed a novel data curation pipeline starting with a small seed of manually annotated data and eventually scaling it to a satisfactory size. We compare its quality against a state-of-the-art gold dataset using both offline experiments and visual validation methods. The resultant procedure can be used to create similar datasets in the same domain as well as in other domains.

قواعد المنطق empathetic response استجابة متعاطف صناعة حمض الفوسفور

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

842 - Association for Computation Linguistics 2021 مقالة

This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic description s as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MediaSum can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.

large-scale media interview media interview dataset large-scale media مقابلة وسائط واسعة النطاق بيانات مقابلة مع وسائل الإعلام وسائل الإعلام واسعة النطاق صناعة حمض الفوسفور المزيد..

ONE: Toward ONE model, ONE algorithm, ONE corpus dedicated to sentiment analysis of Arabic/Arabizi and its dialects

616 - Association for Computation Linguistics 2021 مقالة

Arabic is the official language of 22 countries, spoken by more than 400 million speakers. Each one of this country use at least on dialect for daily life conversation. Then, Arabic has at least 22 dialects. Each dialect can be written in Arabic or A rabizi Scripts. The most recent researches focus on constructing a language model and a training corpus for each dialect, in each script. Following this technique means constructing 46 different resources (by including the Modern Standard Arabic, MSA) for handling only one language. In this paper, we extract ONE corpus, and we propose ONE algorithm to automatically construct ONE training corpus using ONE classification model architecture for sentiment analysis MSA and different dialects. After manually reviewing the training corpus, the obtained results outperform all the research literature results for the targeted test corpora.

كلمات الأغاني تنقل modern standard arabic corpus معيار الحديثة العربية جسم صناعة حمض الفوسفور

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

تقديم مجموعة بيانات تونسية كبيرة تونسية لتحليل المعنويات

Ask ChatGPT about the research

Read More

suggested questions