ترغب بنشر مسار تعليمي؟ اضغط هنا

Understanding the Origins of Bias in Word Embeddings

152   0   0.0 ( 0 )
 نشر من قبل Marc-Etienne Brunet
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The power of machine learning systems not only promises great technical progress, but risks societal harm. As a recent example, researchers have shown that popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these algorithms in machine learning systems, from automated translation services to curriculum vitae scanners, can amplify stereotypes in important contexts. Although methods have been developed to measure these biases and alter word embeddings to mitigate their biased representations, there is a lack of understanding in how word embedding bias depends on the training data. In this work, we develop a technique for understanding the origins of bias in word embeddings. Given a word embedding trained on a corpus, our method identifies how perturbing the corpus will affect the bias of the resulting embedding. This can be used to trace the origins of word embedding bias back to the original training documents. Using our method, one can investigate trends in the bias of the underlying corpus and identify subsets of documents whose removal would most reduce bias. We demonstrate our techniques on both a New York Times and Wikipedia corpus and find that our influence function-based approximations are very accurate.



قيم البحث

اقرأ أيضاً

In this paper, we quantify, analyze and mitigate gender bias exhibited in ELMos contextualized word vectors. First, we conduct several intrinsic analyses and find that (1) training data for ELMo contains significantly more male than female entities, (2) the trained ELMo embeddings systematically encode gender information and (3) ELMo unequally encodes gender information about male and female entities. Then, we show that a state-of-the-art coreference system that depends on ELMo inherits its bias and demonstrates significant bias on the WinoBias probing corpus. Finally, we explore two methods to mitigate such gender bias and show that the bias demonstrated on WinoBias can be eliminated.
Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in. In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias. Our analysis includes different measures previously applied in the literature to standard word embeddings. Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased.
239 - Yu Bai , Song Mei , Huan Wang 2021
Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression -- a vanilla algorithm for learn ing quantiles with asymptotic guarantees -- tends to emph{under-cover} than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for $alpha>0.5$ and small $d/n$, the $alpha$-quantile learned by quantile regression roughly achieves coverage $alpha - (alpha-1/2)cdot d/n$ regardless of the noise distribution, where $d$ is the input dimension and $n$ is the number of training data. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.
Word embedding spaces are powerful tools for capturing latent semantic relationships between terms in corpora, and have become widely popular for building state-of-the-art natural language processing algorithms. However, studies have shown that socie tal biases present in text corpora may be incorporated into the word embedding spaces learned from them. Thus, there is an ethical concern that human-like biases contained in the corpora and their derived embedding spaces might be propagated, or even amplified with the usage of the biased embedding spaces in downstream applications. In an attempt to quantify these biases so that they may be better understood and studied, several bias metrics have been proposed. We explore the statistical properties of these proposed measures in the context of their cited applications as well as their supposed utilities. We find that there are caveats to the simple interpretation of these metrics as proposed. We find that the bias metric proposed by Bolukbasi et al. 2016 is highly sensitive to embedding hyper-parameter selection, and that in many cases, the variance due to the selection of some hyper-parameters is greater than the variance in the metric due to corpus selection, while in fewer cases the bias rankings of corpora vary with hyper-parameter selection. In light of these observations, it may be the case that bias estimates should not be thought to directly measure the properties of the underlying corpus, but rather the properties of the specific embedding spaces in question, particularly in the context of hyper-parameter selections used to generate them. Hence, bias metrics of spaces generated with differing hyper-parameters should be compared only with explicit consideration of the embedding-learning algorithms particular configurations.
Multilingual representations embed words from many languages into a single semantic space such that words with similar meanings are close to each other regardless of the language. These embeddings have been widely used in various settings, such as cr oss-lingual transfer, where a natural language processing (NLP) model trained on one language is deployed to another language. While the cross-lingual transfer techniques are powerful, they carry gender bias from the source to target languages. In this paper, we study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications. We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations from both the intrinsic and extrinsic perspectives. Experimental results show that the magnitude of bias in the multilingual representations changes differently when we align the embeddings to different target spaces and that the alignment direction can also have an influence on the bias in transfer learning. We further provide recommendations for using the multilingual word representations for downstream tasks.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا