بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments

60 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Bharathi Raja Chakravarthi

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Bharathi Raja Chakravarthi - Ruba Priyadharshini - Rahul Ponnusamy

الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The increased proliferation of abusive content on social media platforms has a negative impact on online users. The dread, dislike, discomfort, or mistrust of lesbian, gay, transgender or bisexual persons is defined as homophobia/transphobia. Homophobic/transphobic speech is a type of offensive language that may be summarized as hate speech directed toward LGBT+ people, and it has been a growing concern in recent years. Online homophobia/transphobia is a severe societal problem that can make online platforms poisonous and unwelcome to LGBT+ people while also attempting to eliminate equality, diversity, and inclusion. We provide a new hierarchical taxonomy for online homophobia and transphobia, as well as an expert-labelled dataset that will allow homophobic/transphobic content to be automatically identified. We educated annotators and supplied them with comprehensive annotation rules because this is a sensitive issue, and we previously discovered that untrained crowdsourcing annotators struggle with diagnosing homophobia due to cultural and other prejudices. The dataset comprises 15,141 annotated multilingual comments. This paper describes the process of building the dataset, qualitative analysis of data, and inter-annotator agreement. In addition, we create baseline models for the dataset. To the best of our knowledge, our dataset is the first such dataset created. Warning: This paper contains explicit statements of homophobia, transphobia, stereotypes which may be distressing to some readers.

قيم البحث

58 - Tooba Tehreem 2021

Sentiment analysis is a vast area in the Machine learning domain. A lot of work is done on datasets and their analysis of the English Language. In Pakistan, a huge amount of data is in roman Urdu language, it is scattered all over the social sites in cluding Twitter, YouTube, Facebook and similar applications. In this study the focus domain of dataset gathering is YouTube comments. The Dataset contains the comments of people over different Pakistani dramas and TV shows. The Dataset contains multi-class classification that is grouped The comments into positive, negative and neutral sentiment. In this Study comparative analysis is done for five supervised learning Algorithms including linear regression, SVM, KNN, Multi layer Perceptron and Naive Bayes classifier. Accuracy, recall, precision and F-measure are used for measuring performance. Results show that accuracy of SVM is 64 percent, which is better than the rest of the list.

الحساب واللغة الذكاء الاصطناعي

A Dataset and Baselines for Multilingual Reply Suggestion

244 - Mozhi Zhang , Wei Wang , Budhaditya Deb 2021

Reply suggestion models help users process emails and chats faster. Previous work only studies English reply suggestion. Instead, we present MRS, a multilingual reply suggestion dataset with ten languages. MRS can be used to compare two families of m odels: 1) retrieval models that select the reply from a fixed set and 2) generation models that produce the reply from scratch. Therefore, MRS complements existing cross-lingual generalization benchmarks that focus on classification and sequence labeling tasks. We build a generation model and a retrieval model as baselines for MRS. The two models have different strengths in the monolingual setting, and they require different strategies to generalize across languages. MRS is publicly available at https://github.com/zhangmozhi/mrs.

الحساب واللغة التعلم الآلي

DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction

117 - Abhyuday Bhartiya , Kartikeya Badola , Mausam 2021

Distant supervision (DS) is a well established technique for creating large-scale datasets for relation extraction (RE) without using human annotations. However, research in DS-RE has been mostly limited to the English language. Constraining RE to a single language inhibits utilization of large amounts of data in other languages which could allow extraction of more diverse facts. Very recently, a dataset for multilingual DS-RE has been released. However, our analysis reveals that the proposed dataset exhibits unrealistic characteristics such as 1) lack of sentences that do not express any relation, and 2) all sentences for a given entity pair expressing exactly one relation. We show that these characteristics lead to a gross overestimation of the model performance. In response, we propose a new dataset, DiS-ReX, which alleviates these issues. Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class. We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE. Unlike the competing dataset, we show that our dataset is challenging and leaves enough room for future research to take place in this field.

الحساب واللغة التعلم الآلي

Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis

110 - Jo~ao A. Leite , Diego F. Silva , Kalina Bontcheva 2020

Hate speech and toxic comments are a common concern of social media platform users. Although these comments are, fortunately, the minority in these platforms, they are still capable of causing harm. Therefore, identifying these comments is an importa nt task for studying and preventing the proliferation of toxicity in social media. Previous work in automatically detecting toxic comments focus mainly in English, with very few work in languages like Brazilian Portuguese. In this paper, we propose a new large-scale dataset for Brazilian Portuguese with tweets annotated as either toxic or non-toxic or in different types of toxicity. We present our dataset collection and annotation process, where we aimed to select candidates covering multiple demographic groups. State-of-the-art BERT models were able to achieve 76% macro-F1 score using monolingual data in the binary case. We also show that large-scale monolingual data is still needed to create more accurate models, despite recent advances in multilingual approaches. An error analysis and experiments with multi-label classification show the difficulty of classifying certain types of toxic comments that appear less frequently in our data and highlights the need to develop models that are aware of different categories of toxicity.

الحساب واللغة التعلم الآلي الشبكات الاجتماعية والمعلومات

MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset

100 - Marina Fomicheva , Shuo Sun , Erick Fonseca 2020

We present MLQE-PE, a new dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains eleven language pairs, with human labels for up to 10,000 translations per language pair in the following fo rmats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.

الحساب واللغة

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة بابل

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً