Breimans two cultures: You dont have to choose sides

142 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Andrew Miller

تاريخ النشر 2021

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Andrew C. Miller - Nicholas J. Foti - Emily B. Fox

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Breimans classic paper casts data analysis as a choice between two cultures: data modelers and algorithmic modelers. Stated broadly, data modelers use simple, interpretable models with well-understood theoretical properties to analyze data. Algorithmic modelers prioritize predictive accuracy and use more flexible function approximations to analyze data. This dichotomy overlooks a third set of models $-$ mechanistic models derived from scientific theories (e.g., ODE/SDE simulators). Mechanistic models encode application-specific scientific knowledge about the data. And while these categories represent extreme points in model space, modern computational and algorithmic tools enable us to interpolate between these points, producing flexible, interpretable, and scientifically-informed hybrids that can enjoy accurate and robust predictions, and resolve issues with data analysis that Breiman describes, such as the Rashomon effect and Occams dilemma. Challenges still remain in finding an appropriate point in model space, with many choices on how to compose model components and the degree to which each component informs inferences.

قيم البحث

112 - Jelena Bradic , Yinchu Zhu 2021

Breiman challenged statisticians to think more broadly, to step into the unknown, model-free learning world, with him paving the way forward. Statistics community responded with slight optimism, some skepticism, and plenty of disbelief. Today, we are at the same crossroad anew. Faced with the enormous practical success of model-free, deep, and machine learning, we are naturally inclined to think that everything is resolved. A new frontier has emerged; the one where the role, impact, or stability of the {it learning} algorithms is no longer measured by prediction quality, but an inferential one -- asking the questions of {it why} and {it if} can no longer be safely ignored.

التعلم الالي المنهجية

Comments on Two Cultures: What have changed over 20 years?

178 - Xuming He , Jingshen Wang 2021

Twenty years ago Breiman (2001) called to our attention a significant cultural division in modeling and data analysis between the stochastic data models and the algorithmic models. Out of his deep concern that the statistical community was so deeply and almost exclusively committed to the former, Breiman warned that we were losing our abilities to solve many real-world problems. Breiman was not the first, and certainly not the only statistician, to sound the alarm; we may refer to none other than John Tukey who wrote almost 60 years ago data analysis is intrinsically an empirical science. However, the bluntness and timeliness of Breimans article made it uniquely influential. It prepared us for the data science era and encouraged a new generation of statisticians to embrace a more broadly defined discipline. Some might argue that The cultural division between these two statistical learning frameworks has been growing at a steady pace in recent years, to quote Mukhopadhyay and Wang (2020). In this commentary, we focus on some of the positive changes over the past 20 years and offer an optimistic outlook for our profession.

إحصاء

Two-sample Testing Using Deep Learning

139 - Matthias Kirchler , Shahryar Khorasani , Marius Kloft 2019

We propose a two-sample testing procedure based on learned deep neural network representations. To this end, we define two test statistics that perform an asymptotic location test on data samples mapped onto a hidden layer. The tests are consistent a nd asymptotically control the type-1 error rate. Their test statistics can be evaluated in linear time (in the sample size). Suitable data representations are obtained in a data-driven way, by solving a supervised or unsupervised transfer-learning task on an auxiliary (potentially distinct) data set. If no auxiliary data is available, we split the data into two chunks: one for learning representations and one for computing the test statistic. In experiments on audio samples, natural images and three-dimensional neuroimaging data our tests yield significant decreases in type-2 error rate (up to 35 percentage points) compared to state-of-the-art two-sample tests such as kernel-methods and classifier two-sample tests.

التعلم الالي التعلم الآلي المنهجية

Learning Deep Kernels for Non-Parametric Two-Sample Tests

109 - Feng Liu , Wenkai Xu , Jie Lu 2020

We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution. Our tests are constructed from kernels parameterized by deep neural nets, trained to maximize test power. Th ese tests adapt to variations in distribution smoothness and shape over space, and are especially suited to high dimensions and complex data. By contrast, the simpler kernels used in prior kernel testing work are spatially homogeneous, and adaptive only in lengthscale. We explain how this scheme includes popular classifier-based two-sample tests as a special case, but improves on them in general. We provide the first proof of consistency for the proposed adaptation method, which applies both to kernels on deep features and to simpler radial basis kernels or multiple kernel learning. In experiments, we establish the superior performance of our deep kernels in hypothesis testing on benchmark and real-world data. The code of our deep-kernel-based two sample tests is available at https://github.com/fengliu90/DK-for-TST.

التعلم الالي التعلم الآلي المنهجية

Semantic Redundancies in Image-Classification Datasets: The 10% You Dont Need

76 - Vighnesh Birodkar , Hossein Mobahi , Samy Bengio 2019

Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, t he potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي التعلم الالي