Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks

72 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ilias Diakonikolas

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ilias Diakonikolas - Daniel M. Kane - Vasilis Kontonis

التعلم الآلي بنى وهياكل البيانات والخوارزميات نظرية الإحصاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the problem of PAC learning one-hidden-layer ReLU networks with $k$ hidden units on $mathbb{R}^d$ under Gaussian marginals in the presence of additive label noise. For the case of positive coefficients, we give the first polynomial-time algorithm for this learning problem for $k$ up to $tilde{O}(sqrt{log d})$. Previously, no polynomial time algorithm was known, even for $k=3$. This answers an open question posed by~cite{Kliv17}. Importantly, our algorithm does not require any assumptions about the rank of the weight matrix and its complexity is independent of its condition number. On the negative side, for the more general task of PAC learning one-hidden-layer ReLU networks with arbitrary real coefficients, we prove a Statistical Query lower bound of $d^{Omega(k)}$. Thus, we provide a separation between the two classes in terms of efficient learnability. Our upper and lower bounds are general, extending to broader families of activation functions.

قيم البحث

101 - Santosh Vempala , John Wilmes 2018

We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs. We give an agn ostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in $2$-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, for any $k$, the size of the network and number of iterations needed are both bounded by $n^{O(k)}log(1/epsilon)$. In particular, this applies to training networks of unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components. We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality $n$ must use $n^{Omega(k)}$ queries even when the target functions are restricted to a set of $n^{O(k)}$ degree-$k$ polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only $Theta(k log n)$. Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.

التعلم الآلي التعلم الالي

Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals

76 - Ilias Diakonikolas , Daniel M. Kane , Nikos Zarifis 2020

We study the fundamental problems of agnostically learning halfspaces and ReLUs under Gaussian marginals. In the former problem, given labeled examples $(mathbf{x}, y)$ from an unknown distribution on $mathbb{R}^d times { pm 1}$, whose marginal distr ibution on $mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with 0-1 loss $mathrm{OPT}+epsilon$, where $mathrm{OPT}$ is the 0-1 loss of the best-fitting halfspace. In the latter problem, given labeled examples $(mathbf{x}, y)$ from an unknown distribution on $mathbb{R}^d times mathbb{R}$, whose marginal distribution on $mathbf{x}$ is the standard Gaussian and the labels $y$ can be arbitrary, the goal is to output a hypothesis with square loss $mathrm{OPT}+epsilon$, where $mathrm{OPT}$ is the square loss of the best-fitting ReLU. We prove Statistical Query (SQ) lower bounds of $d^{mathrm{poly}(1/epsilon)}$ for both of these problems. Our SQ lower bounds provide strong evidence that current upper bounds for these tasks are essentially best possible.

التعلم الآلي بنى وهياكل البيانات والخوارزميات نظرية الإحصاء

Minimax Lower Bounds for Transfer Learning with Linear and One-hidden Layer Neural Networks

114 - Seyed Mohammadreza Mousavi Kalan , Zalan Fabian , A. Salman Avestimehr 2020

Transfer learning has emerged as a powerful technique for improving the performance of machine learning models on new domains where labeled training data may be scarce. In this approach a model trained for a source task, where plenty of labeled train ing data is available, is used as a starting point for training a model on a related target task with only few labeled training data. Despite recent empirical success of transfer learning approaches, the benefits and fundamental limits of transfer learning are poorly understood. In this paper we develop a statistical minimax framework to characterize the fundamental limits of transfer learning in the context of regression with linear and one-hidden layer neural network models. Specifically, we derive a lower-bound for the target generalization error achievable by any algorithm as a function of the number of labeled source and target data as well as appropriate notions of similarity between the source and target tasks. Our lower bound provides new insights into the benefits and limitations of transfer learning. We further corroborate our theoretical finding with various experiments.

التعلم الآلي نظرية المعلومات نظرية المعلومات

Lower Bounds for Adversarially Robust PAC Learning

70 - Dimitrios I. Diochnos , Saeed Mahloujifar , Mohammad Mahmoody 2019

In this work, we initiate a formal study of probably approximately correct (PAC) learning under evasion attacks, where the adversarys goal is to emph{misclassify} the adversarially perturbed sample point $widetilde{x}$, i.e., $h(widetilde{x}) eq c(wi detilde{x})$, where $c$ is the ground truth concept and $h$ is the learned hypothesis. Previous work on PAC learning of adversarial examples have all modeled adversarial examples as corrupted inputs in which the goal of the adversary is to achieve $h(widetilde{x}) eq c(x)$, where $x$ is the original untampered instance. These two definitions of adversarial risk coincide for many natural distributions, such as images, but are incomparable in general. We first prove that for many theoretically natural input spaces of high dimension $n$ (e.g., isotropic Gaussian in dimension $n$ under $ell_2$ perturbations), if the adversary is allowed to apply up to a sublinear $o(||x||)$ amount of perturbations on the test instances, PAC learning requires sample complexity that is exponential in $n$. This is in contrast with results proved using the corrupted-input framework, in which the sample complexity of robust learning is only polynomially more. We then formalize hybrid attacks in which the evasion attack is preceded by a poisoning attack. This is perhaps reminiscent of trapdoor attacks in which a poisoning phase is involved as well, but the evasion phase here uses the error-region definition of risk that aims at misclassifying the perturbed instances. In this case, we show PAC learning is sometimes impossible all together, even when it is possible without the attack (e.g., due to the bounded VC dimension).

التعلم الآلي التعقيد الحسابي التشفير والأمن

Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations

192 - Pranjal Awasthi , Alex Tang , Aravindan Vijayaraghavan 2021

We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form $f(x) = {a}^{mathsf{T}}sigma({W}^mathsf{T}x+b)$, where $x$ is drawn from the Gaussian distribution, and $sigma(t) := max(t,0)$ is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias $b$ is zero. In order to deal with the presence of the bias terms, our proposed algorithm consists of robustly decomposing multiple higher order tensors arising from the Hermite expansion of the function $f(x)$. Using these ideas we also establish identifiability of the network parameters under minimal assumptions.

التعلم الآلي بنى وهياكل البيانات والخوارزميات التعلم الالي