ترغب بنشر مسار تعليمي؟ اضغط هنا

254 - Difan Zou , Yuan Cao , Yuanzhi Li 2021
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly wor se test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the inferior generalization performance of Adam is fundamentally tied to the nonconvex landscape of deep learning optimization.
Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand t hese issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.
We consider a binary classification problem when the data comes from a mixture of two rotationally symmetric distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that ther e exists a universal constant $C_{mathrm{err}}>0$ such that if a pseudolabeler $boldsymbol{beta}_{mathrm{pl}}$ can achieve classification error at most $C_{mathrm{err}}$, then for any $varepsilon>0$, an iterative self-training algorithm initialized at $boldsymbol{beta}_0 := boldsymbol{beta}_{mathrm{pl}}$ using pseudolabels $hat y = mathrm{sgn}(langle boldsymbol{beta}_t, mathbf{x}rangle)$ and using at most $tilde O(d/varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $varepsilon$ error, where $d$ is the ambient dimension. That is, self-training converts weak learners to strong learners using only unlabeled examples. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbol{beta}_{mathrm{pl}}$ with classification error $C_{mathrm{err}}$ using only $O(d)$ labeled examples (i.e., independent of $varepsilon$). Together our results imply that mixture models can be learned to within $varepsilon$ of the Bayes-optimal accuracy using at most $O(d)$ labeled examples and $tilde O(d/varepsilon^2)$ unlabeled examples by way of a semi-supervised self-training algorithm.
We analyze the properties of adversarial training for learning adversarially robust halfspaces in the presence of agnostic label noise. Denoting $mathsf{OPT}_{p,r}$ as the best robust classification error achieved by a halfspace that is robust to per turbations of $ell_{p}$ balls of radius $r$, we show that adversarial training on the standard binary cross-entropy loss yields adversarially robust halfspaces up to (robust) classification error $tilde O(sqrt{mathsf{OPT}_{2,r}})$ for $p=2$, and $tilde O(d^{1/4} sqrt{mathsf{OPT}_{infty, r}} + d^{1/2} mathsf{OPT}_{infty,r})$ when $p=infty$. Our results hold for distributions satisfying anti-concentration properties enjoyed by log-concave isotropic distributions among others. We additionally show that if one instead uses a nonconvex sigmoidal loss, adversarial training yields halfspaces with an improved robust classification error of $O(mathsf{OPT}_{2,r})$ for $p=2$, and $O(d^{1/4}mathsf{OPT}_{infty, r})$ when $p=infty$. To the best of our knowledge, this is the first work to show that adversarial training provably yields robust classifiers in the presence of noise.
There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gr adient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting: constant-stepsize SGD (with iterate averaging) for linear regression in the overparameterized regime. Our main result provides a sharp excess risk bound, stated in terms of the full eigenspectrum of the data covariance matrix, that reveals a bias-variance decomposition characterizing when generalization is possible: (i) the variance bound is characterized in terms of an effective dimension (specific for SGD) and (ii) the bias bound provides a sharp geometric characterization in terms of the location of the initial iterate (and how it aligns with the data covariance matrix). We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares (minimum-norm interpolation) and ridge regression.
331 - Difan Zou , Ziniu Hu , Yewen Wang 2019
Graph convolutional networks (GCNs) have recently received wide attentions, due to their successful applications in different graph tasks and different domains. Training GCNs for a large graph, however, is still a challenge. Original full-batch GCN t raining requires calculating the representation of all the nodes in the graph per GCN layer, which brings in high computation and memory costs. To alleviate this issue, several sampling-based methods have been proposed to train GCNs on a subset of nodes. Among them, the node-wise neighbor-sampling method recursively samples a fixed number of neighbor nodes, and thus its computation cost suffers from exponential growing neighbor size; while the layer-wise importance-sampling method discards the neighbor-dependent constraints, and thus the nodes sampled across layer suffer from sparse connection problem. To deal with the above two problems, we propose a new effective sampling algorithm called LAyer-Dependent ImportancE Sampling (LADIES). Based on the sampled nodes in the upper layer, LADIES selects their neighborhood nodes, constructs a bipartite subgraph and computes the importance probability accordingly. Then, it samples a fixed number of nodes by the calculated probability, and recursively conducts such procedure per layer to construct the whole computation graph. We prove theoretically and experimentally, that our proposed sampling algorithm outperforms the previous sampling methods in terms of both time and memory costs. Furthermore, LADIES is shown to have better generalization accuracy than original full-batch GCN, due to its stochastic nature.
Consider a ultraviolet (UV) scattering communication system where the position of the transmitter is fixed and the receiver can move around on the ground. To obtain the link gain effectively and economically, we propose an algorithm based on one-dime nsional (1D) numerical integration and an off-line data library. Moreover, we analyze the 2D scattering intensity distributions for both LED and laser, and observe that the contours can be well fitted by elliptic models. The relationships between the characteristics of fitting ellipses and the source parameters are provided by numerical results.
We characterize the practical photon-counting receiver in optical scattering communication with finite sampling rate and electrical noise. In the receiver side, the detected signal can be characterized as a series of pulses generated by photon-multip lier (PMT) detector and held by the pulse-holding circuits, which are then sampled by the analog-to-digit convertor (ADC) with finite sampling rate and counted by a rising-edge pulse detector. However, the finite small pulse width incurs the dead time effect that may lead to sub-Poisson distribution on the recorded pulses. We analyze first-order and second-order moments on the number of recorded pulses with finite sampling rate at the receiver side under two cases where the sampling period is shorter than or equal to the pulse width as well as longer than the pulse width. Moreover, we adopt the maximum likelihood (ML) detection. In order to simplify the analysis, we adopt binomial distribution approximation on the number of recorded pulses in each slot. A tractable holding time and decision threshold selection rule is provided aiming to maximize the minimal Kullback-Leibler (KL) distance between the two distributions. The performance of proposed sub-Poisson distribution and the binomial approximation are verified by the experimental results. The equivalent arrival rate and holding time predicted by the of sub-Poisson model and the associated proposed binomial distribution on finite sampling rate and the electrical noise are validated by the simulation results. The proposed the holding time and decision threshold selection rule performs close to the optimal one.
In optical wireless scattering communication, received signal in each symbol interval is captured by a photomultiplier tube (PMT) and then sampled through very short but finite interval sampling. The resulting samples form a signal vector for symbol detection. The upper and lower bounds on transmission rate of such a processing system are studied. It is shown that the gap between two bounds approaches zero as the thermal noise and shot noise variances approach zero. The maximum a posteriori (MAP) signal detection is performed and a low computational complexity receiver is derived under piecewise polynomial approximation. Meanwhile, the threshold based signal detection is also studied, where two threshold selection rules are proposed based on the detection error probability and the Kullback-Leibler (KL) distance. For the latter, it is shown that the KL distance is not sensitive to the threshold selection for small shot and thermal noise variances, and thus the threshold can be selected among a wide range without significant loss from the optimal KL distance. The performances of the transmission rate bounds, the signal detection, and the threshold selection approaches are evaluated by the numerical results.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا