Do you want to publish a course? Click here

Towards a Better Understanding of Noise in Natural Language Processing

نحو فهم أفضل للضوضاء في معالجة اللغة الطبيعية

428   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

In this paper, we propose a definition and taxonomy of various types of non-standard textual content -- generally referred to as noise'' -- in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially when dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content -- which should not always be considered as noise'' -- and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through standard'' pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace non-standard content.



References used
https://aclanthology.org/
rate research

Read More

There are thousands of papers about natural language processing and computational linguistics, but very few textbooks. I describe the motivation and process for writing a college textbook on natural language processing, and offer advice and encouragement for readers who may be interested in writing a textbook of their own.
Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge in a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications, little is understood about how one KD algorithm compares to another and whether these approaches can be complimentary to each other. In this work, we evaluate various KD algorithms on in-domain, out-of-domain and adversarial testing. We propose a framework to assess adversarial robustness of multiple KD algorithms. Moreover, we introduce a new KD algorithm, Combined-KD, which takes advantage of two promising approaches (better training scheme and more efficient data augmentation). Our extensive experimental results show that Combined-KD achieves state-of-the-art results on the GLUE benchmark, out-of-domain generalization, and adversarial robustness compared to competitive methods.
How can we design Natural Language Processing (NLP) systems that learn from human feedback? There is a growing research body of Human-in-the-loop (HITL) NLP frameworks that continuously integrate human feedback to improve the model itself. HITL NLP r esearch is nascent but multifarious---solving various NLP problems, collecting diverse feedback from different people, and applying different methods to learn from human feedback. We present a survey of HITL NLP work from both Machine Learning (ML) and Human-computer Interaction (HCI) communities that highlights its short yet inspiring history, and thoroughly summarize recent frameworks focusing on their tasks, goals, human interactions, and feedback learning methods. Finally, we discuss future studies for integrating human feedback in the NLP development loop.
This article explores the potential for Natural Language Processing (NLP) to enable a more effective, prevention focused and less confrontational policing model that has hitherto been too resource consuming to implement at scale. Problem-Oriented Pol icing (POP) is a potential replacement, at least in part, for traditional policing which adopts a reactive approach, relying heavily on the criminal justice system. By contrast, POP seeks to prevent crime by manipulating the underlying conditions that allow crimes to be committed. Identifying these underlying conditions requires a detailed understanding of crime events - tacit knowledge that is often held by police officers but which can be challenging to derive from structured police data. One potential source of insight exists in unstructured free text data commonly collected by police for the purposes of investigation or administration. Yet police agencies do not typically have the skills or resources to analyse these data at scale. In this article we argue that NLP offers the potential to unlock these unstructured data and by doing so allow police to implement more POP initiatives. However we caution that using NLP models without adequate knowledge may either allow or perpetuate bias within the data potentially leading to unfavourable outcomes.
Recent studies show that many NLP systems are sensitive and vulnerable to a small perturbation of inputs and do not generalize well across different datasets. This lack of robustness derails the use of NLP systems in real-world applications. This tut orial aims at bringing awareness of practical concerns about NLP robustness. It targets NLP researchers and practitioners who are interested in building reliable NLP systems. In particular, we will review recent studies on analyzing the weakness of NLP systems when facing adversarial inputs and data with a distribution shift. We will provide the audience with a holistic view of 1) how to use adversarial examples to examine the weakness of NLP models and facilitate debugging; 2) how to enhance the robustness of existing NLP models and defense against adversarial inputs; and 3) how the consideration of robustness affects the real-world NLP applications used in our daily lives. We will conclude the tutorial by outlining future research directions in this area.

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا