No Arabic abstract
Motivated by the resurgence of neural networks in being able to solve complex learning tasks we undertake a study of high depth networks using ReLU gates which implement the function $x mapsto max{0,x}$. We try to understand the role of depth in such neural networks by showing size lowerbounds against such network architectures in parameter regimes hitherto unexplored. In particular we show the following two main results about neural nets computing Boolean functions of input dimension $n$, 1. We use the method of random restrictions to show almost linear, $Omega(epsilon^{2(1-delta)}n^{1-delta})$, lower bound for completely weight unrestricted LTF-of-ReLU circuits to match the Andreev function on at least $frac{1}{2} +epsilon$ fraction of the inputs for $epsilon > sqrt{2frac{log^{frac {2}{2-delta}}(n)}{n}}$ for any $delta in (0,frac 1 2)$ 2. We use the method of sign-rank to show exponential in dimension lower bounds for ReLU circuits ending in a LTF gate and of depths upto $O(n^{xi})$ with $xi < frac{1}{8}$ with some restrictions on the weights in the bottom most layer. All other weights in these circuits are kept unrestricted. This in turns also implies the same lowerbounds for LTF circuits with the same architecture and the same weight restrictions on their bottom most layer. Along the way we also show that there exists a $mathbb{R}^ nrightarrow mathbb{R}$ Sum-of-ReLU-of-ReLU function which Sum-of-ReLU neural nets can never represent no matter how large they are allowed to be.
This paper considers the growth in the length of one-dimensional trajectories as they are passed through deep ReLU neural networks, which, among other things, is one measure of the expressivity of deep networks. We generalise existing results, providing an alternative, simpler method for lower bounding expected trajectory growth through random networks, for a more general class of weights distributions, including sparsely connected networks. We illustrate this approach by deriving bounds for sparse-Gaussian, sparse-uniform, and sparse-discrete-valued random nets. We prove that trajectory growth can remain exponential in depth with these new distributions, including their sparse variants, with the sparsity parameter appearing in the base of the exponent.
A recent line of work has analyzed the theoretical properties of deep neural networks via the Neural Tangent Kernel (NTK). In particular, the smallest eigenvalue of the NTK has been related to the memorization capacity, the global convergence of gradient descent algorithms and the generalization of deep nets. However, existing results either provide bounds in the two-layer setting or assume that the spectrum of the NTK matrices is bounded away from 0 for multi-layer networks. In this paper, we provide tight bounds on the smallest eigenvalue of NTK matrices for deep ReLU nets, both in the limiting case of infinite widths and for finite widths. In the finite-width setting, the network architectures we consider are fairly general: we require the existence of a wide layer with roughly order of $N$ neurons, $N$ being the number of data samples; and the scaling of the remaining layer widths is arbitrary (up to logarithmic factors). To obtain our results, we analyze various quantities of independent interest: we give lower bounds on the smallest singular value of hidden feature matrices, and upper bounds on the Lipschitz constant of input-output feature maps.
Let $mathcal{F}_{n}^*$ be the set of Boolean functions depending on all $n$ variables. We prove that for any $fin mathcal{F}_{n}^*$, $f|_{x_i=0}$ or $f|_{x_i=1}$ depends on the remaining $n-1$ variables, for some variable $x_i$. This existent result suggests a possible way to deal with general Boolean functions via its subfunctions of some restrictions. As an application, we consider the degree lower bound of representing polynomials over finite rings. Let $fin mathcal{F}_{n}^*$ and denote the exact representing degree over the ring $mathbb{Z}_m$ (with the integer $m>2$) as $d_m(f)$. Let $m=Pi_{i=1}^{r}p_i^{e_i}$, where $p_i$s are distinct primes, and $r$ and $e_i$s are positive integers. If $f$ is symmetric, then $mcdot d_{p_1^{e_1}}(f)... d_{p_r^{e_r}}(f) > n$. If $f$ is non-symmetric, by the second moment method we prove almost always $mcdot d_{p_1^{e_1}}(f)... d_{p_r^{e_r}}(f) > lg{n}-1$. In particular, as $m=pq$ where $p$ and $q$ are arbitrary distinct primes, we have $d_p(f)d_q(f)=Omega(n)$ for symmetric $f$ and $d_p(f)d_q(f)=Omega(lg{n}-1)$ almost always for non-symmetric $f$. Hence any $n$-variate symmetric Boolean function can have exact representing degree $o(sqrt{n})$ in at most one finite field, and for non-symmetric functions, with $o(sqrt{lg{n}})$-degree in at most one finite field.
We study ReLU deep neural networks (DNNs) by investigating their connections with the hierarchical basis method in finite element methods. First, we show that the approximation schemes of ReLU DNNs for $x^2$ and $xy$ are compositio
In this paper, we construct neural networks with ReLU, sine and $2^x$ as activation functions. For general continuous $f$ defined on $[0,1]^d$ with continuity modulus $omega_f(cdot)$, we construct ReLU-sine-$2^x$ networks that enjoy an approximation rate $mathcal{O}(omega_f(sqrt{d})cdot2^{-M}+omega_{f}left(frac{sqrt{d}}{N}right))$, where $M,Nin mathbb{N}^{+}$ denote the hyperparameters related to widths of the networks. As a consequence, we can construct ReLU-sine-$2^x$ network with the depth $5$ and width $maxleft{leftlceil2d^{3/2}left(frac{3mu}{epsilon}right)^{1/{alpha}}rightrceil,2leftlceillog_2frac{3mu d^{alpha/2}}{2epsilon}rightrceil+2right}$ that approximates $fin mathcal{H}_{mu}^{alpha}([0,1]^d)$ within a given tolerance $epsilon >0$ measured in $L^p$ norm $pin[1,infty)$, where $mathcal{H}_{mu}^{alpha}([0,1]^d)$ denotes the Holder continuous function class defined on $[0,1]^d$ with order $alpha in (0,1]$ and constant $mu > 0$. Therefore, the ReLU-sine-$2^x$ networks overcome the curse of dimensionality on $mathcal{H}_{mu}^{alpha}([0,1]^d)$. In addition to its supper expressive power, functions implemented by ReLU-sine-$2^x$ networks are (generalized) differentiable, enabling us to apply SGD to train.