ترغب بنشر مسار تعليمي؟ اضغط هنا

ID3 Learns Juntas for Smoothed Product Distributions

183   0   0.0 ( 0 )
 نشر من قبل Eran Malach
 تاريخ النشر 2019
والبحث باللغة English




اسأل ChatGPT حول البحث

In recent years, there are many attempts to understand popular heuristics. An example of such a heuristic algorithm is the ID3 algorithm for learning decision trees. This algorithm is commonly used in practice, but there are very few theoretical works studying its behavior. In this paper, we analyze the ID3 algorithm, when the target function is a $k$-Junta, a function that depends on $k$ out of $n$ variables of the input. We prove that when $k = log n$, the ID3 algorithm learns in polynomial time $k$-Juntas, in the smoothed analysis model of Kalai & Teng. That is, we show a learnability result when the observed distribution is a noisy variant of the original distribution.

قيم البحث

اقرأ أيضاً

We consider the problem of PAC-learning decision trees, i.e., learning a decision tree over the n-dimensional hypercube from independent random labeled examples. Despite significant effort, no polynomial-time algorithm is known for learning polynomia l-sized decision trees (even trees of any super-constant size), even when examples are assumed to be drawn from the uniform distribution on {0,1}^n. We give an algorithm that learns arbitrary polynomial-sized decision trees for {em most product distributions}. In particular, consider a random product distribution where the bias of each bit is chosen independently and uniformly from, say, [.49,.51]. Then with high probability over the parameters of the product distribution and the random examples drawn from it, the algorithm will learn any tree. More generally, in the spirit of smoothed analysis, we consider an arbitrary product distribution whose parameters are specified only up to a [-c,c] accuracy (perturbation), for an arbitrarily small positive constant c.
Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we introduce a novel metric of a decision tree algorithms performance, called mean iteration statistical consistency (MIC), which measures optimality of trees generated by ID3. As opposed to previous metrics, MIC can differentiate between different decision tree algorithms and compare their performance. We provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), has near-optimal MIC in various settings for learning read-once DNFs under product distributions. In contrast, another widely used variant of ID3 has MIC which is not near-optimal. We show that the MIC analysis predicts well the performance of these algorithms in practice. Our results present a novel view of decision tree algorithms which may lead to better and more practical guarantees for these algorithms.
We propose $tt RandUCB$, a bandit strategy that builds on theoretically derived confidence intervals similar to upper confidence bound (UCB) algorithms, but akin to Thompson sampling (TS), it uses randomization to trade off exploration and exploitati on. In the $K$-armed bandit setting, we show that there are infinitely many variants of $tt RandUCB$, all of which achieve the minimax-optimal $widetilde{O}(sqrt{K T})$ regret after $T$ rounds. Moreover, for a specific multi-armed bandit setting, we show that both UCB and TS can be recovered as special cases of $tt RandUCB$. For structured bandits, where each arm is associated with a $d$-dimensional feature vector and rewards are distributed according to a linear or generalized linear model, we prove that $tt RandUCB$ achieves the minimax-optimal $widetilde{O}(d sqrt{T})$ regret even in the case of infinitely many arms. Through experiments in both the multi-armed and structured bandit settings, we demonstrate that $tt RandUCB$ matches or outperforms TS and other randomized exploration strategies. Our theoretical and empirical results together imply that $tt RandUCB$ achieves the best of both worlds.
The Gaussian-smoothed optimal transport (GOT) framework, recently proposed by Goldfeld et al., scales to high dimensions in estimation and provides an alternative to entropy regularization. This paper provides convergence guarantees for estimating th e GOT distance under more general settings. For the Gaussian-smoothed $p$-Wasserstein distance in $d$ dimensions, our results require only the existence of a moment greater than $d + 2p$. For the special case of sub-gamma distributions, we quantify the dependence on the dimension $d$ and establish a phase transition with respect to the scale parameter. We also prove convergence for dependent samples, only requiring a condition on the pairwise dependence of the samples measured by the covariance of the feature map of a kernel space. A key step in our analysis is to show that the GOT distance is dominated by a family of kernel maximum mean discrepancy (MMD) distances with a kernel that depends on the cost function as well as the amount of Gaussian smoothing. This insight provides further interpretability for the GOT framework and also introduces a class of kernel MMD distances with desirable properties. The theoretical results are supported by numerical experiments.
A recent technique of randomized smoothing has shown that the worst-case (adversarial) $ell_2$-robustness can be transformed into the average-case Gaussian-robustness by smoothing a classifier, i.e., by considering the averaged prediction over Gaussi an noise. In this paradigm, one should rethink the notion of adversarial robustness in terms of generalization ability of a classifier under noisy observations. We found that the trade-off between accuracy and certified robustness of smoothed classifiers can be greatly controlled by simply regularizing the prediction consistency over noise. This relationship allows us to design a robust training objective without approximating a non-existing smoothed classifier, e.g., via soft smoothing. Our experiments under various deep neural network architectures and datasets show that the certified $ell_2$-robustness can be dramatically improved with the proposed regularization, even achieving better or comparable results to the state-of-the-art approaches with significantly less training costs and hyperparameters.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا