Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data

71 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل David Lowell

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف David Lowell - Brian E. Howard - Zachary C. Lipton

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a models predictions on (a) observed (unlabeled) examples; and (b) corresponding noised examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and over how to extend the method to sequence labeling tasks. This method has recently gained traction for text classification. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these complex perturbation models. Furthermore, we find that applying its consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short: UDA need not be unsupervised, and does not require complex data augmentation to be effective.

قيم البحث

135 - Sosuke Kobayashi 2018

We propose a novel data augmentation for labeled sentences called contextual augmentation. We assume an invariance that sentences are natural even if the words in the sentences are replaced with other words with paradigmatic relations. We stochastica lly replace words with other words that are predicted by a bi-directional language model at the word positions. Words predicted according to a context are numerous but appropriate for the augmentation of the original words. Furthermore, we retrofit a language model with a label-conditional architecture, which allows the model to augment sentences without breaking the label-compatibility. Through the experiments for six various different text classification tasks, we demonstrate that the proposed method improves classifiers based on the convolutional or recurrent neural networks.

الحساب واللغة التعلم الآلي

Data Augmentation for Text Generation Without Any Augmented Data

103 - Wei Bi , Huayang Li , Jiacheng Huang 2021

Data augmentation is an effective way to improve the performance of many neural text generation models. However, current data augmentation methods need to define or choose proper data mapping functions that map the original samples into the augmented samples. In this work, we derive an objective to formulate the problem of data augmentation on text generation tasks without any use of augmented data constructed by specific mapping functions. Our proposed objective can be efficiently optimized and applied to popular loss functions on text generation tasks with a convergence rate guarantee. Experiments on five datasets of two text generation tasks show that our approach can approximate or even surpass popular data augmentation methods.

الحساب واللغة الذكاء الاصطناعي

Data Augmentation for Hypernymy Detection

114 - Thomas Kober , Julie Weeds , Lorenzo Bertolini 2020

The automatic detection of hypernymy relationships represents a challenging problem in NLP. The successful application of state-of-the-art supervised approaches using distributed representations has generally been impeded by the limited availability of high quality training data. We have developed two novel data augmentation techniques which generate new training examples from existing ones. First, we combine the linguistic principles of hypernym transitivity and intersective modifier-noun composition to generate additional pairs of vectors, such as small dog - dog or small dog - animal, for which a hypernymy relationship can be assumed. Second, we use generative adversarial networks (GANs) to generate pairs of vectors for which the hypernymy relation can also be assumed. We furthermore present two complementary strategies for extending an existing dataset by leveraging linguistic resources such as WordNet. Using an evaluation across 3 different datasets for hypernymy detection and 2 different vector spaces, we demonstrate that both of the proposed automatic data augmentation and dataset extension strategies substantially improve classifier performance.

الحساب واللغة التعلم الآلي

SDA: Improving Text Generation with Self Data Augmentation

105 - Ping Yu , Ruiyi Zhang , Yang Zhao 2021

Data augmentation has been widely used to improve deep neural networks in many research fields, such as computer vision. However, less work has been done in the context of text, partially due to its discrete nature and the complexity of natural langu ages. In this paper, we propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, which are only applied to specific models, our method is more general and could be easily adapted to any MLE-based training procedure. In addition, our framework allows task-specific evaluation metrics to be designed to flexibly control the generated sentences, for example, in terms of controlling vocabulary usage and avoiding nontrivial repetitions. Extensive experimental results demonstrate the superiority of our method on two synthetic and several standard real datasets, significantly improving related baselines.

الحساب واللغة التعلم الآلي

Sequence-Level Mixed Sample Data Augmentation

118 - Demi Guo , Yoon Kim , Alexander M. Rush 2020

Despite their empirical success, neural networks still have difficulty capturing compositional aspects of natural language. This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-se quence problems. Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set. We connect this approach to existing techniques such as SwitchOut and word dropout, and show that these techniques are all approximating variants of a single objective. SeqMix consistently yields approximately 1.0 BLEU improvement on five different translation datasets over strong Transformer baselines. On tasks that require strong compositional generalization such as SCAN and semantic parsing, SeqMix also offers further improvements.

الحساب واللغة التعلم الآلي