No Arabic abstract
We survey 146 papers analyzing bias in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing bias is an inherently normative process. We further find that these papers proposed quantitative techniques for measuring or mitigating bias are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. Based on these findings, we describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing bias in NLP systems. These recommendations rest on a greater recognition of the relationships between language and social hierarchies, encouraging researchers and practitioners to articulate their conceptualizations of bias---i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, as well as the normative reasoning underlying these statements---and to center work around the lived experiences of members of communities affected by NLP systems, while interrogating and reimagining the power relations between technologists and such communities.
Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models trained on two different corpora (news articles, and Wikipedia) exhibit considerable levels of bias. We then propose embedding and sentiment prediction-derived regularization on the language models latent representations. The regularizations improve fairness metrics while retaining comparable levels of perplexity and semantic similarity.
Recent advances in reinforcement learning have shown its potential to tackle complex real-life tasks. However, as the dimensionality of the task increases, reinforcement learning methods tend to struggle. To overcome this, we explore methods for representing the semantic information embedded in the state. While previous methods focused on information in its raw form (e.g., raw visual input), we propose to represent the state using natural language. Language can represent complex scenarios and concepts, making it a favorable candidate for representation. Empirical evidence, within the domain of ViZDoom, suggests that natural language based agents are more robust, converge faster and perform better than vision based agents, showing the benefit of using natural language representations for reinforcement learning.
NLP systems rarely give special consideration to numbers found in text. This starkly contrasts with the consensus in neuroscience that, in the brain, numbers are represented differently from words. We arrange recent NLP work on numeracy into a comprehensive taxonomy of tasks and methods. We break down the subjective notion of numeracy into 7 subtasks, arranged along two dimensions: granularity (exact vs approximate) and units (abstract vs grounded). We analyze the myriad representational choices made by 18 previously published number encoders and decoders. We synthesize best practices for representing numbers in text and articulate a vision for holistic numeracy in NLP, comprised of design trade-offs and a unified evaluation.
Ideas from forensic linguistics are now being used frequently in Natural Language Processing (NLP), using machine learning techniques. While the role of forensic linguistics was more benign earlier, it is now being used for purposes which are questionable. Certain methods from forensic linguistics are employed, without considering their scientific limitations and ethical concerns. While we take the specific case of forensic linguistics as an example of such trends in NLP and machine learning, the issue is a larger one and present in many other scientific and data-driven domains. We suggest that such trends indicate that some of the applied sciences are exceeding their legal and scientific briefs. We highlight how carelessly implemented practices are serving to short-circuit the due processes of law as well breach ethical codes.
Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via the moral philosophy definition of social good, propose a framework to evaluate the direct and indirect real-world impact of NLP tasks, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and code are available at http://github.com/zhijing-jin/nlp4sg_acl2021. In addition, we curate a list of papers and resources on NLP for social good at https://github.com/zhijing-jin/NLP4SocialGood_Papers.