No Arabic abstract
Recently, a variety of regularization techniques have been widely applied in deep neural networks, such as dropout, batch normalization, data augmentation, and so on. These methods mainly focus on the regularization of weight parameters to prevent overfitting effectively. In addition, label regularization techniques such as label smoothing and label disturbance have also been proposed with the motivation of adding a stochastic perturbation to labels. In this paper, we propose a novel adaptive label regularization method, which enables the neural network to learn from the erroneous experience and update the optimal label representation online. On the other hand, compared with knowledge distillation, which learns the correlation of categories using teacher network, our proposed method requires only a minuscule increase in parameters without cumbersome teacher network. Furthermore, we evaluate our method on CIFAR-10/CIFAR-100/ImageNet datasets for image recognition tasks and AGNews/Yahoo/Yelp-Full datasets for text classification tasks. The empirical results show significant improvement under all experimental settings.
Recent studies on the memorization effects of deep neural networks on noisy labels show that the networks first fit the correctly-labeled training samples before memorizing the mislabeled samples. Motivated by this early-learning phenomenon, we propose a novel method to prevent memorization of the mislabeled samples. Unlike the existing approaches which use the model output to identify or ignore the mislabeled samples, we introduce an indicator branch to the original model and enable the model to produce a confidence value for each sample. The confidence values are incorporated in our loss function which is learned to assign large confidence values to correctly-labeled samples and small confidence values to mislabeled samples. We also propose an auxiliary regularization term to further improve the robustness of the model. To improve the performance, we gradually correct the noisy labels with a well-designed target estimation strategy. We provide the theoretical analysis and conduct the experiments on synthetic and real-world datasets, demonstrating that our approach achieves comparable results to the state-of-the-art methods.
Label Smoothing (LS) is an effective regularizer to improve the generalization of state-of-the-art deep models. For each training sample the LS strategy smooths the one-hot encoded training signal by distributing its distribution mass over the non ground-truth classes, aiming to penalize the networks from generating overconfident output distributions. This paper introduces a novel label smoothing technique called Pairwise Label Smoothing (PLS). The PLS takes a pair of samples as input. Smoothing with a pair of ground-truth labels enables the PLS to preserve the relative distance between the two truth labels while further soften that between the truth labels and the other targets, resulting in models producing much less confident predictions than the LS strategy. Also, unlike current LS methods, which typically require to find a global smoothing distribution mass through cross-validation search, PLS automatically learns the distribution mass for each input pair during training. We empirically show that PLS significantly outperforms LS and the baseline models, achieving up to 30% of relative classification error reduction. We also visually show that when achieving such accuracy gains the PLS tends to produce very low winning softmax scores.
Robust loss minimization is an important strategy for handling robust learning issue on noisy labels. Current robust loss functions, however, inevitably involve hyperparameter(s) to be tuned, manually or heuristically through cross validation, which makes them fairly hard to be generally applied in practice. Besides, the non-convexity brought by the loss as well as the complicated network architecture makes it easily trapped into an unexpected solution with poor generalization capability. To address above issues, we propose a meta-learning method capable of adaptively learning hyperparameter in robust loss functions. Specifically, through mutual amelioration between robust loss hyperparameter and network parameters in our method, both of them can be simultaneously finely learned and coordinated to attain solutions with good generalization capability. Four kinds of SOTA robust loss functions are attempted to be integrated into our algorithm, and comprehensive experiments substantiate the general availability and effectiveness of the proposed method in both its accuracy and generalization performance, as compared with conventional hyperparameter tuning strategy, even with carefully tuned hyperparameters.
The main goal of this work is equipping convex and nonconvex problems with Barzilai-Borwein (BB) step size. With the adaptivity of BB step sizes granted, they can fail when the objective function is not strongly convex. To overcome this challenge, the key idea here is to bridge (non)convex problems and strongly convex ones via regularization. The proposed regularization schemes are textit{simple} yet effective. Wedding the BB step size with a variance reduction method, known as SARAH, offers a free lunch compared with vanilla SARAH in convex problems. The convergence of BB step sizes in nonconvex problems is also established and its complexity is no worse than other adaptive step sizes such as AdaGrad. As a byproduct, our regularized SARAH methods for convex functions ensure that the complexity to find $mathbb{E}[| abla f(mathbf{x}) |^2]leq epsilon$ is ${cal O}big( (n+frac{1}{sqrt{epsilon}})ln{frac{1}{epsilon}}big)$, improving $epsilon$ dependence over existing results. Numerical tests further validate the merits of proposed approaches.
Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning.