ترغب بنشر مسار تعليمي؟ اضغط هنا

On the Power of Differentiable Learning versus PAC and SQ Learning

86   0   0.0 ( 0 )
 نشر من قبل Pritish Kamath
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $rho$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $rho$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.



قيم البحث

اقرأ أيضاً

We study the problem of PAC learning one-hidden-layer ReLU networks with $k$ hidden units on $mathbb{R}^d$ under Gaussian marginals in the presence of additive label noise. For the case of positive coefficients, we give the first polynomial-time algo rithm for this learning problem for $k$ up to $tilde{O}(sqrt{log d})$. Previously, no polynomial time algorithm was known, even for $k=3$. This answers an open question posed by~cite{Kliv17}. Importantly, our algorithm does not require any assumptions about the rank of the weight matrix and its complexity is independent of its condition number. On the negative side, for the more general task of PAC learning one-hidden-layer ReLU networks with arbitrary real coefficients, we prove a Statistical Query lower bound of $d^{Omega(k)}$. Thus, we provide a separation between the two classes in terms of efficient learnability. Our upper and lower bounds are general, extending to broader families of activation functions.
The study of strategic or adversarial manipulation of testing data to fool a classifier has attracted much recent attention. Most previous works have focused on two extreme situations where any testing data point either is completely adversarial or a lways equally prefers the positive label. In this paper, we generalize both of these through a unified framework for strategic classification, and introduce the notion of strategic VC-dimension (SVC) to capture the PAC-learnability in our general strategic setup. SVC provably generalizes the recent concept of adversarial VC-dimension (AVC) introduced by Cullina et al. arXiv:1806.01471. We instantiate our framework for the fundamental strategic linear classification problem. We fully characterize: (1) the statistical learnability of linear classifiers by pinning down its SVC; (2) its computational tractability by pinning down the complexity of the empirical risk minimization problem. Interestingly, the SVC of linear classifiers is always upper bounded by its standard VC-dimension. This characterization also strictly generalizes the AVC bound for linear classifiers in arXiv:1806.01471.
Bayesian structure learning allows inferring Bayesian network structure from data while reasoning about the epistemic uncertainty -- a key element towards enabling active causal discovery and designing interventions in real world systems. In this wor k, we propose a general, fully differentiable framework for Bayesian structure learning (DiBS) that operates in the continuous space of a latent probabilistic graph representation. Contrary to existing work, DiBS is agnostic to the form of the local conditional distributions and allows for joint posterior inference of both the graph structure and the conditional distribution parameters. This makes DiBS directly applicable to posterior inference of nonstandard Bayesian network models, e.g., with nonlinear dependencies encoded by neural networks. Building on recent advances in variational inference, we use DiBS to devise an efficient general purpose method for approximating posteriors over structural models. In evaluations on simulated and real-world data, our method significantly outperforms related approaches to joint posterior inference.
We introduce a new and rigorously-formulated PAC-Bayes few-shot meta-learning algorithm that implicitly learns a prior distribution of the model of interest. Our proposed method extends the PAC-Bayes framework from a single task setting to the few-sh ot learning setting to upper-bound generalisation errors on unseen tasks and samples. We also propose a generative-based approach to model the shared prior and the posterior of task-specific model parameters more expressively compared to the usual diagonal Gaussian assumption. We show that the models trained with our proposed meta-learning algorithm are well calibrated and accurate, with state-of-the-art calibration and classification results on few-shot classification (mini-ImageNet and tiered-ImageNet) and regression (multi-modal task-distribution regression) benchmarks.
PAC-learning usually aims to compute a small subset ($varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $varep silonin(0,1)$. Coresets generalize this idea to support multiplicative error $1pmvarepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML17] and $k$-PCA [NIPS16]. Experimental results with open source code are provided.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا