No Arabic abstract
Overfitting frequently occurs in deep learning. In this paper, we propose a novel regularization method called Drop-Activation to reduce overfitting and improve generalization. The key idea is to drop nonlinear activation functions by setting them to be identity functions randomly during training time. During testing, we use a deterministic network with a new activation function to encode the average effect of dropping activations randomly. Our theoretical analyses support the regularization effect of Drop-Activation as implicit parameter reduction and verify its capability to be used together with Batch Normalization (Ioffe and Szegedy 2015). The experimental results on CIFAR-10, CIFAR-100, SVHN, EMNIST, and ImageNet show that Drop-Activation generally improves the performance of popular neural network architectures for the image classification task. Furthermore, as a regularizer Drop-Activation can be used in harmony with standard training and regularization techniques such as Batch Normalization and Auto Augment (Cubuk et al. 2019). The code is available at url{https://github.com/LeungSamWai/Drop-Activation}.
Multi-task learning (MTL) is a common paradigm that seeks to improve the generalization performance of task learning by training related tasks simultaneously. However, it is still a challenging problem to search the flexible and accurate architecture that can be shared among multiple tasks. In this paper, we propose a novel deep learning model called Task Adaptive Activation Network (TAAN) that can automatically learn the optimal network architecture for MTL. The main principle of TAAN is to derive flexible activation functions for different tasks from the data with other parameters of the network fully shared. We further propose two functional regularization methods that improve the MTL performance of TAAN. The improved performance of both TAAN and the regularization methods is demonstrated by comprehensive experiments.
The main goal of this work is equipping convex and nonconvex problems with Barzilai-Borwein (BB) step size. With the adaptivity of BB step sizes granted, they can fail when the objective function is not strongly convex. To overcome this challenge, the key idea here is to bridge (non)convex problems and strongly convex ones via regularization. The proposed regularization schemes are textit{simple} yet effective. Wedding the BB step size with a variance reduction method, known as SARAH, offers a free lunch compared with vanilla SARAH in convex problems. The convergence of BB step sizes in nonconvex problems is also established and its complexity is no worse than other adaptive step sizes such as AdaGrad. As a byproduct, our regularized SARAH methods for convex functions ensure that the complexity to find $mathbb{E}[| abla f(mathbf{x}) |^2]leq epsilon$ is ${cal O}big( (n+frac{1}{sqrt{epsilon}})ln{frac{1}{epsilon}}big)$, improving $epsilon$ dependence over existing results. Numerical tests further validate the merits of proposed approaches.
Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We provide here a theoretical justification based on analysis of the associated gradient flow. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or Weight Normalization (WN) are used together with Weight Decay (WD). The main property of the minimizers that bounds their expected error is the norm: we prove that among all the close-to-interpolating solutions, the ones associated with smaller Frobenius norms of the unnormalized weight matrices have better margin and better bounds on the expected classification error. With BN but in the absence of WD, the dynamical system is singular. Implicit dynamical regularization -- that is zero-initial conditions biasing the dynamics towards high margin solutions -- is also possible in the no-BN and no-WD case. The theory yields several predictions, including the role of BN and weight decay, aspects of Papyan, Han and Donohos Neural Collapse and the constraints induced by BN on the network weights.
In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can implicitly regularize complexity measures. We empirically investigate the ability of these measures to explain different observed phenomena in deep learning. We further study the invariances in neural networks, suggest complexity measures and optimization algorithms that have similar invariances to those in neural networks and evaluate them on a number of learning tasks.
It is well-known that simple short-sighted algorithms, such as gradient descent, generalize well in the over-parameterized learning tasks, due to their implicit regularization. However, it is unknown whether the implicit regularization of these algorithms can be extended to robust learning tasks, where a subset of samples may be grossly corrupted with noise. In this work, we provide a positive answer to this question in the context of robust matrix recovery problem. In particular, we consider the problem of recovering a low-rank matrix from a number of linear measurements, where a subset of measurements are corrupted with large noise. We show that a simple sub-gradient method converges to the true low-rank solution efficiently, when it is applied to the over-parameterized l1-loss function without any explicit regularization or rank constraint. Moreover, by building upon a new notion of restricted isometry property, called sign-RIP, we prove the robustness of the sub-gradient method against outliers in the over-parameterized regime. In particular, we show that, with Gaussian measurements, the sub-gradient method is guaranteed to converge to the true low-rank solution, even if an arbitrary fraction of the measurements are grossly corrupted with noise.