Analysing the Noise Model Error for Realistic Noisy Label Data

198 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Michael A. Hedderich

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Michael A. Hedderich - Dawei Zhu - Dietrich Klakow

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Distant and weak supervision allow to obtain large amounts of labeled training data quickly and cheaply, but these automatic annotations tend to contain a high amount of errors. A popular technique to overcome the negative effects of these noisy labels is noise modelling where the underlying noise process is modelled. In this work, we study the quality of these estimated noise models from the theoretical side by deriving the expected error of the noise model. Apart from evaluating the theoretical results on commonly used synthetic noise, we also publish NoisyNER, a new noisy label dataset from the NLP domain that was obtained through a realistic distant supervision technique. It provides seven sets of labels with differing noise patterns to evaluate different noise levels on the same instances. Parallel, clean labels are available making it possible to study scenarios where a small amount of gold-standard data can be leveraged. Our theoretical results and the corresponding experiments give insights into the factors that influence the noise model estimation like the noise distribution and the sampling technique.

قيم البحث

338 - Keren Gu , Xander Masotto , Vandana Bachani 2021

We propose a simulation framework for generating realistic instance-dependent noisy labels via a pseudo-labeling paradigm. We show that the distribution of the synthetic noisy labels generated with our framework is closer to human labels compared to independent and class-conditional random flipping. Equipped with controllable label noise, we study the negative impact of noisy labels across a few realistic settings to understand when label noise is more problematic. We also benchmark several existing algorithms for learning with noisy labels and compare their behavior on our synthetic datasets and on the datasets with independent random label noise. Additionally, with the availability of annotator information from our simulation framework, we propose a new technique, Label Quality Model (LQM), that leverages annotator features to predict and correct against noisy labels. We show that by adding LQM as a label correction step before applying existing noisy label techniques, we can further improve the models performance.

التعلم الآلي تفاعل الإنسان والحاسوب

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

65 - Daniel Keysers , Nathanael Scharli , Nathan Scales 2019

State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.

التعلم الآلي الحساب واللغة التعلم الالي

Label Dependent Deep Variational Paraphrase Generation

94 - Siamak Shakeri , Abhinav Sethy 2019

Generating paraphrases that are lexically similar but semantically different is a challenging task. Paraphrases of this form can be used to augment data sets for various NLP tasks such as machine reading comprehension and question answering with non- trivial negative examples. In this article, we propose a deep variational model to generate paraphrases conditioned on a label that specifies whether the paraphrases are semantically related or not. We also present new training recipes and KL regularization techniques that improve the performance of variational paraphrasing models. Our proposed model demonstrates promising results in enhancing the generative power of the model by employing label-dependent generation on paraphrasing datasets.

التعلم الآلي الحساب واللغة التعلم الالي

Distilling Effective Supervision from Severe Label Noise

144 - Zizhao Zhang , Han Zhang , Sercan O. Arik 2019

Collecting large-scale data with clean labels for supervised training of neural networks is practically challenging. Although noisy labels are usually cheap to acquire, existing methods suffer a lot from label noise. This paper targets at the challen ge of robust training at high label noise regimes. The key insight to achieve this goal is to wisely leverage a small trusted set to estimate exemplar weights and pseudo labels for noisy data in order to reuse them for supervised training. We present a holistic framework to train deep neural networks in a way that is highly invulnerable to label noise. Our method sets the new state of the art on various types of label noise and achieves excellent performance on large-scale datasets with real-world label noise. For instance, on CIFAR100 with a $40%$ uniform noise ratio and only 10 trusted labeled data per class, our method achieves $80.2{pm}0.3%$ classification accuracy, where the error rate is only $1.4%$ higher than a neural network trained without label noise. Moreover, increasing the noise ratio to $80%$, our method still maintains a high accuracy of $75.5{pm}0.2%$, compared to the previous best accuracy $48.2%$. Source code available: https://github.com/google-research/google-research/tree/master/ieg

التعلم الآلي الرؤية الحاسوبية وتمييز الأنماط التعلم الالي

Part-dependent Label Noise: Towards Instance-dependent Label Noise

266 - Xiaobo Xia , Tongliang Liu , Bo Han 2020

Learning with the textit{instance-dependent} label noise is challenging, because it is hard to model such real-world noise. Note that there are psychological and physiological evidences showing that we humans perceive instances by decomposing them in to parts. Annotators are therefore more likely to annotate instances based on the parts rather than the whole instances, where a wrong mapping from parts to classes may cause the instance-dependent label noise. Motivated by this human cognition, in this paper, we approximate the instance-dependent label noise by exploiting textit{part-dependent} label noise. Specifically, since instances can be approximately reconstructed by a combination of parts, we approximate the instance-dependent textit{transition matrix} for an instance by a combination of the transition matrices for the parts of the instance. The transition matrices for parts can be learned by exploiting anchor points (i.e., data points that belong to a specific class almost surely). Empirical evaluations on synthetic and real-world datasets demonstrate our method is superior to the state-of-the-art approaches for learning from the instance-dependent label noise.

التعلم الآلي التعلم الالي