New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Don't Discard All the Biased Instances: Investigating a Core Assumption in Dataset Bias Mitigation Techniques

لا تجاهل جميع الحالات المتحيزة: التحقيق في الافتراض الأساسي في تقنيات تخفيف البيانات DataSet

196 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

غالبا ما تصل التقنيات الحالية لتخفيف DataSet Bias إلى نموذج متحيز لتحديد مثيلات منحازة. ثم يتم تخفيض دور هذه الحالات المتحيزة خلال تدريب النموذج الرئيسي لتعزيز متانة البيانات الخاصة به ببيانات خارج التوزيع. إن الافتراض الأساسي المشترك لهذه التقنيات هو أن النموذج الرئيسي يتعامل مع حالات متحيزة بالمثل للنموذج المتحيز، في أنه سوف يلجأ إلى التحيزات كلما كان ذلك متاحا. في هذه الورقة، نوضح أن هذا الافتراض لا يمسك بشكل عام. نقوم بإجراء تحقيق حاسم على مجموعة من مجموعات عمليتين مشهورة في المجال، MNLI و FEVER، إلى جانب طريقتين للكشف عن مثيل متحيز، وإدخال جزئي ونماذج ذات سعة محدودة. تظهر تجاربنا أنه في حوالي الثلث إلى نصف الحالات، لا يتمكن النموذج المتحيز من التنبؤ بسلوك النموذج الرئيسي، مع إبرازها بواسطة الأجزاء المختلفة بشكل كبير من المدخلات التي يضمونها قراراتهم. بناء على التحقق الدليلي، نوضح أيضا أن هذا التقدير يتماشى للغاية مع التفسير البشري. تشير النتائج التي توصلنا إليها إلى أن ترزز المثيلات التي تم اكتشافها بواسطة طرق اكتشاف التحيز، وهي إجراءات تمارس على نطاق واسع، هي مضيعة لا لزوم لها من البيانات التدريبية. نطلق سرد علاماتنا لتسهيل الإنتاجية والبحوث المستقبلية.

Existing techniques for mitigating dataset bias often leverage a biased model to identify biased instances. The role of these biased instances is then reduced during the training of the main model to enhance its robustness to out-of-distribution data. A common core assumption of these techniques is that the main model handles biased instances similarly to the biased model, in that it will resort to biases whenever available. In this paper, we show that this assumption does not hold in general. We carry out a critical investigation on two well-known datasets in the domain, MNLI and FEVER, along with two biased instance detection methods, partial-input and limited-capacity models. Our experiments show that in around a third to a half of instances, the biased model is unable to predict the main model's behavior, highlighted by the significantly different parts of the input on which they base their decisions. Based on a manual validation, we also show that this estimate is highly in line with human interpretation. Our findings suggest that down-weighting of instances detected by bias detection methods, which is a widely-practiced procedure, is an unnecessary waste of training data. We release our code to facilitate reproducibility and future research.

References used

https://aclanthology.org/

rate research

Mitigation of Diachronic Bias in Fake News Detection Dataset

247 - Association for Computation Linguistics 2021 مقالة

Fake news causes significant damage to society. To deal with these fake news, several studies on building detection models and arranging datasets have been conducted. Most of the fake news datasets depend on a specific time period. Consequently, the detection models trained on such a dataset have difficulty detecting novel fake news generated by political changes and social changes; they may possibly result in biased output from the input, including specific person names and organizational names. We refer to this problem as Diachronic Bias because it is caused by the creation date of news in each dataset. In this study, we confirm the bias, especially proper nouns including person names, from the deviation of phrase appearances in each dataset. Based on these findings, we propose masking methods using Wikidata to mitigate the influence of person names and validate whether they make fake news detection models robust through experiments with in-domain and out-of-domain data.

fake diachronic bias detection models مزورة التحيز DIACHRONIC. نماذج الكشف صناعة حمض الفوسفور المزيد..

``Don't discuss'': Investigating Semantic and Argumentative Features for Supervised Propagandist Message Detection and Classification

154 - Association for Computation Linguistics 2021 مقالة

One of the mechanisms through which disinformation is spreading online, in particular through social media, is by employing propaganda techniques. These include specific rhetorical and psychological strategies, ranging from leveraging on emotions to exploiting logical fallacies. In this paper, our goal is to push forward research on propaganda detection based on text analysis, given the crucial role these methods may play to address this main societal issue. More precisely, we propose a supervised approach to classify textual snippets both as propaganda messages and according to the precise applied propaganda technique, as well as a detailed linguistic analysis of the features characterising propaganda information in text (e.g., semantic, sentiment and argumentation features). Extensive experiments conducted on two available propagandist resources (i.e., NLP4IF'19 and SemEval'20-Task 11 datasets) show that the proposed approach, leveraging different language models and the investigated linguistic features, achieves very promising results on propaganda classification, both at sentence- and at fragment-level.

investigating semantic argumentative features propagandist message detection التحقيق الدلالي ميزات جدلية كشف الرسائل الدعائية صناعة حمض الفوسفور المزيد..

Investigating Annotator Bias in Abusive Language Datasets

312 - Association for Computation Linguistics 2021 مقالة

Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator b ias caused by the annotator's subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.

abusive language datasets language datasets مجموعات بيانات اللغة المسيئة مجموعات البيانات اللغة صناعة حمض الفوسفور

Single-dataset Experts for Multi-dataset Question Answering

360 - Association for Computation Linguistics 2021 مقالة

Many datasets have been created for training reading comprehension models, and a natural question is whether we can combine them to build models that (1) perform better on all of the training datasets and (2) generalize and transfer better to new dat asets. Prior work has addressed this goal by training one network simultaneously on multiple datasets, which works well on average but is prone to over- or under-fitting different sub- distributions and might transfer worse compared to source models with more overlap with the target dataset. Our approach is to model multi-dataset question answering with an ensemble of single-dataset experts, by training a collection of lightweight, dataset-specific adapter modules (Houlsby et al., 2019) that share an underlying Transformer model. We find that these Multi-Adapter Dataset Experts (MADE) outperform all our baselines in terms of in-distribution accuracy, and simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance, offering a strong and versatile starting point for building new reading comprehension systems.

multi-dataset question answering multi-dataset question استجابة سؤال متعددة البيانات سؤال متعدد البيانات صناعة حمض الفوسفور

Data Compression Techniques

4915 - Tishreen University 2013 مشروع تخرج

خلال العقد الأخير من القرن العشرين ظهرت مجموعة من المتغيرات التكنولوجية المتقدمة في مجالات نظم المعلومات المرتبطة بالحاسبات الآلية و وسائل الاتصال و ضغط البيانات و نقلها عبر شبكات الحاسب الآلي. حيث انتقلت نظم المعلومات من اعتمادها على النص و بعض الرس ومات البيانية البسيطة إلى اعتمادها على استخدام الوسائط المتعددة التي تعمل على توصيل المعلومات في أشكال مختلفة من خلال ترابط و تكامل مجموعة متباينة من التكنولوجيات المختلفة (الصوت, الصور, النص, الفيديو, ..الخ). و قد كان تطور تلك النظم في البداية مقصوراً على الاستخدام المنفرد, و لكن نظراً لأهمية نظم الاتصالات و تطور شبكة الانترنت و استخدام نظم الوسائط المتعددة من قبل مستخدمين متعددين في أماكن مختلفة من حيث الموقع الجغرافي, ظهرت أهمية المشاركة في بيانات الوسائط المتعددة, و بالتالي حتمية تداولها من خلال شبكات الحاسب الآلي. و من هنا ظهرت الحاجة إلى ظهور شبكات ذات مواصفات خاصة يمكنها التعامل مع عناصر الوسائط المتعددة بكفاءة عالية. و من جانب آخر ظهرت أهمية وجود نظم وسائط متعددة لديها القدرة على التعامل مع شبكات الحاسب الآلي. من ذلك نرى بأن هذه النظم سوف تتسم بكبر حجم بياناتها إضافة إلى الصعوبة الحقيقية في نقل هذه البيانات و خاصة عبر شبكات الحاسب. لذلك فقد دعت مشاكل تخزين أحجام كبيرة من البيانات مقارنة مع صغر سعة الأجهزة التخزينية و مشاكل نقل كميات كبيرة منها عبر الشبكات إلى تطوير تقنيات لتخفيض (اختصار) أحجام البيانات قدر الإمكان مما يساعد على توفير في المساحات التخزينية من جهة و توفير الوقت عند إرسال البيانات من جهة ثانية

ضغط البيانات هوفمان خوارزميات الضغط الساكنة خوارزميات الضغط الديناميكية LZW LZ77

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Don't Discard All the Biased Instances: Investigating a Core Assumption in Dataset Bias Mitigation Techniques

لا تجاهل جميع الحالات المتحيزة: التحقيق في الافتراض الأساسي في تقنيات تخفيف البيانات DataSet

Ask ChatGPT about the research

Read More

suggested questions