When are Deep Networks really better than Random Forests at small sample sizes?

121 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Haoyin Xu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Haoyin Xu - Michael Ainsworth - Yu-Chung Peng

التعلم الآلي الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Random forests (RF) and deep networks (DN) are two of the most popular machine learning methods in the current scientific literature and yield differing levels of performance on different data modalities. We wish to further explore and establish the conditions and domains in which each approach excels, particularly in the context of sample size and feature dimension. To address these issues, we tested the performance of these approaches across tabular, image, and audio settings using varying model parameters and architectures. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found RF to excel at tabular and structured data (image and audio) with small sample sizes, whereas DN performed better on structured data with larger sample sizes. Although we plan to continue updating this technical report in the coming months, we believe the current preliminary results may be of interest to others.

قيم البحث

68 - Chulhee Yun , Suvrit Sra , Ali Jadbabaie 2019

Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear pre dictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the near-identity regions of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.

التعلم الآلي التحسين والتحكم التعلم الالي

EigenGame Unloaded: When playing games is better than optimizing

187 - Ian Gemp , Brian McWilliams , Claire Vernade 2021

We build on the recently proposed EigenGame that views eigendecomposition as a competitive game. EigenGames updates are biased if computed using minibatches of data, which hinders convergence and more sophisticated parallelism in the stochastic setti ng. In this work, we propose an unbiased stochastic update that is asymptotically equivalent to EigenGame, enjoys greater parallelism allowing computation on datasets of larger sample sizes, and outperforms EigenGame in experiments. We present applications to finding the principal components of massive datasets and performing spectral clustering of graphs. We analyze and discuss our proposed update in the context of EigenGame and the shift in perspective from optimization to games.

التعلم الالي الذكاء الاصطناعي التعلم الآلي

Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

92 - Pan Zhou , Jiashi Feng , Chao Ma 2020

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local co nvergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

التعلم الآلي الذكاء الاصطناعي التحسين والتحكم

Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks

73 - Stefan Lee , Senthil Purushwalkam , Michael Cogswell 2015

Convolutional Neural Networks have achieved state-of-the-art performance on a wide range of tasks. Most benchmarks are led by ensembles of these powerful learners, but ensembling is typically treated as a post-hoc procedure implemented by averaging i ndependently trained models with model variation induced by bagging or random initialization. In this paper, we rigorously treat ensembling as a first-class problem to explicitly address the question: what are the best strategies to create an ensemble? We first compare a large number of ensembling strategies, and then propose and evaluate novel strategies, such as parameter sharing (through a new family of models we call TreeNets) as well as training under ensemble-aware and diversity-encouraging losses. We demonstrate that TreeNets can improve ensemble performance and that diverse ensembles can be trained end-to-end under a unified loss, achieving significantly higher oracle accuracies than classical ensembles.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي الحوسبة العصبية والتطورية

Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks

183 - Ronan Perry , Adam Li , Chester Huynh 2019

Decision forests (Forests), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, Forests dominate other methods in tabula r data, that is, when the feature space is unstructured, so that the signal is invariant to a permutation of the feature indices. However, in structured data lying on a manifold (such as images, text, and speech) deep networks (Networks), specifically convolutional deep networks (ConvNets), tend to outperform Forests. We conjecture that at least part of the reason for this is that the input to Networks is not simply the feature magnitudes, but also their indices. In contrast, naive Forest implementations fail to explicitly consider feature indices. A recently proposed Forest approach demonstrates that Forests, for each node, implicitly sample a random matrix from some specific distribution. These Forests, like some classes of Networks, learn by partitioning the feature space into convex polytopes corresponding to linear functions. We build on that approach and show that one can choose distributions in a manifold-aware fashion to incorporate feature locality. We demonstrate the empirical performance on data whose features live on three different manifolds: a torus, images, and time-series. Moreover, we demonstrate its strength in multivariate simulated settings and also show superiority in predicting surgical outcome in epilepsy patients and predicting movement direction from raw stereotactic EEG data from non-motor brain regions. In all simulations and real data, Manifold Oblique Random Forest (MORF) algorithm outperforms approaches that ignore feature space structure and challenges the performance of ConvNets. Moreover, MORF runs fast and maintains interpretability and theoretical justification.

التعلم الآلي التعلم الالي