ﻻ يوجد ملخص باللغة العربية
How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of Hierarchical Learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. We formally analyze how multi-layer neural networks can perform such hierarchical learning efficiently and automatically by applying SGD. On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how deep neural networks can still be sample and time efficient using SGD on certain hierarchical learning tasks, when NO KNOWN existing algorithm is efficient. We establish a new principle called backward feature correction, where training higher-level layers in the network can improve the features of lower-level ones. We believe this is the key to understand the deep learning process in multi-layer neural networks. On the technical side, we show for regression and even binary classification, for every input dimension $d>0$, there is a concept class of degree $omega(1)$ polynomials so that, using $omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $mathsf{poly}(d)$ time and sample complexity to any $frac{1}{mathsf{poly}(d)}$ error, through learning to represent it as a composition of $omega(1)$ layers of quadratic functions. In contrast, we do not know any other simple algorithm (including layer-wise training or applying kernel method sequentially) that can learn this concept class in $mathsf{poly}(d)$ time even to any $d^{-0.01}$ error. As a side result, we prove $d^{omega(1)}$ lower bounds for several non-hierarchical learners, including any kernel methods, neural tangent or neural compositional kernels.
Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with
Why and how that deep learning works well on different tasks remains a mystery from a theoretical perspective. In this paper we draw a geometric picture of the deep learning system by finding its analogies with two existing geometric structures, the
We present a study using a class of post-hoc local explanation methods i.e., feature importance methods for understanding a deep learning (DL) emulator of climate. Specifically, we consider a multiple-input-single-output emulator that uses a DenseNet
We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dime