Do you want to publish a course? Click here

Don't Discard All the Biased Instances: Investigating a Core Assumption in Dataset Bias Mitigation Techniques

لا تجاهل جميع الحالات المتحيزة: التحقيق في الافتراض الأساسي في تقنيات تخفيف البيانات DataSet

184   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

Existing techniques for mitigating dataset bias often leverage a biased model to identify biased instances. The role of these biased instances is then reduced during the training of the main model to enhance its robustness to out-of-distribution data. A common core assumption of these techniques is that the main model handles biased instances similarly to the biased model, in that it will resort to biases whenever available. In this paper, we show that this assumption does not hold in general. We carry out a critical investigation on two well-known datasets in the domain, MNLI and FEVER, along with two biased instance detection methods, partial-input and limited-capacity models. Our experiments show that in around a third to a half of instances, the biased model is unable to predict the main model's behavior, highlighted by the significantly different parts of the input on which they base their decisions. Based on a manual validation, we also show that this estimate is highly in line with human interpretation. Our findings suggest that down-weighting of instances detected by bias detection methods, which is a widely-practiced procedure, is an unnecessary waste of training data. We release our code to facilitate reproducibility and future research.



References used
https://aclanthology.org/
rate research

Read More

Fake news causes significant damage to society. To deal with these fake news, several studies on building detection models and arranging datasets have been conducted. Most of the fake news datasets depend on a specific time period. Consequently, the detection models trained on such a dataset have difficulty detecting novel fake news generated by political changes and social changes; they may possibly result in biased output from the input, including specific person names and organizational names. We refer to this problem as Diachronic Bias because it is caused by the creation date of news in each dataset. In this study, we confirm the bias, especially proper nouns including person names, from the deviation of phrase appearances in each dataset. Based on these findings, we propose masking methods using Wikidata to mitigate the influence of person names and validate whether they make fake news detection models robust through experiments with in-domain and out-of-domain data.
One of the mechanisms through which disinformation is spreading online, in particular through social media, is by employing propaganda techniques. These include specific rhetorical and psychological strategies, ranging from leveraging on emotions to exploiting logical fallacies. In this paper, our goal is to push forward research on propaganda detection based on text analysis, given the crucial role these methods may play to address this main societal issue. More precisely, we propose a supervised approach to classify textual snippets both as propaganda messages and according to the precise applied propaganda technique, as well as a detailed linguistic analysis of the features characterising propaganda information in text (e.g., semantic, sentiment and argumentation features). Extensive experiments conducted on two available propagandist resources (i.e., NLP4IF'19 and SemEval'20-Task 11 datasets) show that the proposed approach, leveraging different language models and the investigated linguistic features, achieves very promising results on propaganda classification, both at sentence- and at fragment-level.
Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator b ias caused by the annotator's subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.
Many datasets have been created for training reading comprehension models, and a natural question is whether we can combine them to build models that (1) perform better on all of the training datasets and (2) generalize and transfer better to new dat asets. Prior work has addressed this goal by training one network simultaneously on multiple datasets, which works well on average but is prone to over- or under-fitting different sub- distributions and might transfer worse compared to source models with more overlap with the target dataset. Our approach is to model multi-dataset question answering with an ensemble of single-dataset experts, by training a collection of lightweight, dataset-specific adapter modules (Houlsby et al., 2019) that share an underlying Transformer model. We find that these Multi-Adapter Dataset Experts (MADE) outperform all our baselines in terms of in-distribution accuracy, and simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance, offering a strong and versatile starting point for building new reading comprehension systems.
خلال العقد الأخير من القرن العشرين ظهرت مجموعة من المتغيرات التكنولوجية المتقدمة في مجالات نظم المعلومات المرتبطة بالحاسبات الآلية و وسائل الاتصال و ضغط البيانات و نقلها عبر شبكات الحاسب الآلي. حيث انتقلت نظم المعلومات من اعتمادها على النص و بعض الرس ومات البيانية البسيطة إلى اعتمادها على استخدام الوسائط المتعددة التي تعمل على توصيل المعلومات في أشكال مختلفة من خلال ترابط و تكامل مجموعة متباينة من التكنولوجيات المختلفة (الصوت, الصور, النص, الفيديو, ..الخ). و قد كان تطور تلك النظم في البداية مقصوراً على الاستخدام المنفرد, و لكن نظراً لأهمية نظم الاتصالات و تطور شبكة الانترنت و استخدام نظم الوسائط المتعددة من قبل مستخدمين متعددين في أماكن مختلفة من حيث الموقع الجغرافي, ظهرت أهمية المشاركة في بيانات الوسائط المتعددة, و بالتالي حتمية تداولها من خلال شبكات الحاسب الآلي. و من هنا ظهرت الحاجة إلى ظهور شبكات ذات مواصفات خاصة يمكنها التعامل مع عناصر الوسائط المتعددة بكفاءة عالية. و من جانب آخر ظهرت أهمية وجود نظم وسائط متعددة لديها القدرة على التعامل مع شبكات الحاسب الآلي. من ذلك نرى بأن هذه النظم سوف تتسم بكبر حجم بياناتها إضافة إلى الصعوبة الحقيقية في نقل هذه البيانات و خاصة عبر شبكات الحاسب. لذلك فقد دعت مشاكل تخزين أحجام كبيرة من البيانات مقارنة مع صغر سعة الأجهزة التخزينية و مشاكل نقل كميات كبيرة منها عبر الشبكات إلى تطوير تقنيات لتخفيض (اختصار) أحجام البيانات قدر الإمكان مما يساعد على توفير في المساحات التخزينية من جهة و توفير الوقت عند إرسال البيانات من جهة ثانية

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا