ترغب بنشر مسار تعليمي؟ اضغط هنا

Faster PAC Learning and Smaller Coresets via Smoothed Analysis

134   0   0.0 ( 0 )
 نشر من قبل Alaa Maalouf
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

PAC-learning usually aims to compute a small subset ($varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $varepsilonin(0,1)$. Coresets generalize this idea to support multiplicative error $1pmvarepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML17] and $k$-PCA [NIPS16]. Experimental results with open source code are provided.

قيم البحث

اقرأ أيضاً

The study of strategic or adversarial manipulation of testing data to fool a classifier has attracted much recent attention. Most previous works have focused on two extreme situations where any testing data point either is completely adversarial or a lways equally prefers the positive label. In this paper, we generalize both of these through a unified framework for strategic classification, and introduce the notion of strategic VC-dimension (SVC) to capture the PAC-learnability in our general strategic setup. SVC provably generalizes the recent concept of adversarial VC-dimension (AVC) introduced by Cullina et al. arXiv:1806.01471. We instantiate our framework for the fundamental strategic linear classification problem. We fully characterize: (1) the statistical learnability of linear classifiers by pinning down its SVC; (2) its computational tractability by pinning down the complexity of the empirical risk minimization problem. Interestingly, the SVC of linear classifiers is always upper bounded by its standard VC-dimension. This characterization also strictly generalizes the AVC bound for linear classifiers in arXiv:1806.01471.
We consider the problem of PAC-learning decision trees, i.e., learning a decision tree over the n-dimensional hypercube from independent random labeled examples. Despite significant effort, no polynomial-time algorithm is known for learning polynomia l-sized decision trees (even trees of any super-constant size), even when examples are assumed to be drawn from the uniform distribution on {0,1}^n. We give an algorithm that learns arbitrary polynomial-sized decision trees for {em most product distributions}. In particular, consider a random product distribution where the bias of each bit is chosen independently and uniformly from, say, [.49,.51]. Then with high probability over the parameters of the product distribution and the random examples drawn from it, the algorithm will learn any tree. More generally, in the spirit of smoothed analysis, we consider an arbitrary product distribution whose parameters are specified only up to a [-c,c] accuracy (perturbation), for an arbitrarily small positive constant c.
The theory of spectral filtering is a remarkable tool to understand the statistical properties of learning with kernels. For least squares, it allows to derive various regularization schemes that yield faster convergence rates of the excess risk than with Tikhonov regularization. This is typically achieved by leveraging classical assumptions called source and capacity conditions, which characterize the difficulty of the learning task. In order to understand estimators derived from other loss functions, Marteau-Ferey et al. have extended the theory of Tikhonov regularization to generalized self concordant loss functions (GSC), which contain, e.g., the logistic loss. In this paper, we go a step further and show that fast and optimal rates can be achieved for GSC by using the iterated Tikhonov regularization scheme, which is intrinsically related to the proximal point method in optimization, and overcomes the limitation of the classical Tikhonov regularization.
We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt u sing these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $rho$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $rho$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.
We introduce a new and rigorously-formulated PAC-Bayes few-shot meta-learning algorithm that implicitly learns a prior distribution of the model of interest. Our proposed method extends the PAC-Bayes framework from a single task setting to the few-sh ot learning setting to upper-bound generalisation errors on unseen tasks and samples. We also propose a generative-based approach to model the shared prior and the posterior of task-specific model parameters more expressively compared to the usual diagonal Gaussian assumption. We show that the models trained with our proposed meta-learning algorithm are well calibrated and accurate, with state-of-the-art calibration and classification results on few-shot classification (mini-ImageNet and tiered-ImageNet) and regression (multi-modal task-distribution regression) benchmarks.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا