No Arabic abstract
Obtaining accurate estimates of machine learning model uncertainties on newly predicted data is essential for understanding the accuracy of the model and whether its predictions can be trusted. A common approach to such uncertainty quantification is to estimate the variance from an ensemble of models, which are often generated by the generally applicable bootstrap method. In this work, we demonstrate that the direct bootstrap ensemble standard deviation is not an accurate estimate of uncertainty and propose a calibration method to dramatically improve its accuracy. We demonstrate the effectiveness of this calibration method for both synthetic data and physical datasets from the field of Materials Science and Engineering. The approach is motivated by applications in physical and biological science but is quite general and should be applicable for uncertainty quantification in a wide range of machine learning regression models.
Reliable models of the thermodynamic properties of materials are critical for industrially relevant applications that require a good understanding of equilibrium phase diagrams, thermal and chemical transport, and microstructure evolution. The goal of thermodynamic models is to capture data from both experimental and computational studies and then make reliable predictions when extrapolating to new regions of parameter space. These predictions will be impacted by artifacts present in real data sets such as outliers, systematics errors and unreliable or missing uncertainty bounds. Such issues increase the probability of the thermodynamic model producing erroneous predictions. We present a Bayesian framework for the selection, calibration and quantification of uncertainty of thermodynamic property models. The modular framework addresses numerous concerns regarding thermodynamic models including thermodynamic consistency, robustness to outliers and systematic errors by the use of hyperparameter weightings and robust Likelihood and Prior distribution choices. Furthermore, the frameworks inherent transparency (e.g. our choice of probability functions and associated parameters) enables insights into the complex process of thermodynamic assessment. We introduce these concepts through examples where the true property model is known. In addition, we demonstrate the utility of the framework through the creation of a property model from a large set of experimental specific heat and enthalpy measurements of Hafnium metal from 0 to 4900K.
Mass cytometry technology enables the simultaneous measurement of over 40 proteins on single cells. This has helped immunologists to increase their understanding of heterogeneity, complexity, and lineage relationships of white blood cells. Current statistical methods often collapse the rich single-cell data into summary statistics before proceeding with downstream analysis, discarding the information in these multivariate datasets. In this article, our aim is to exhibit the use of statistical analyses on the raw, uncompressed data thus improving replicability, and exposing multivariate patterns and their associated uncertainty profiles. We show that multivariate generative models are a valid alternative to univariate hypothesis testing. We propose two models: a multivariate Poisson log-normal mixed model and a logistic linear mixed model. We show that these models are complementary and that either model can account for different confounders. We use Hamiltonian Monte Carlo to provide Bayesian uncertainty quantification. Our models applied to a recent pregnancy study successfully reproduce key findings while quantifying increased overall protein-to-protein correlations between first and third trimester.
Statistical inference in high dimensional settings has recently attracted enormous attention within the literature. However, most published work focuses on the parametric linear regression problem. This paper considers an important extension of this problem: statistical inference for high dimensional sparse nonparametric additive models. To be more precise, this paper develops a methodology for constructing a probability density function on the set of all candidate models. This methodology can also be applied to construct confidence intervals for various quantities of interest (such as noise variance) and confidence bands for the additive functions. This methodology is derived using a generalized fiducial inference framework. It is shown that results produced by the proposed methodology enjoy correct asymptotic frequentist properties. Empirical results obtained from numerical experimentation verify this theoretical claim. Lastly, the methodology is applied to a gene expression data set and discovers new findings for which most existing methods based on parametric linear modeling failed to observe.
Due to their accuracies, methods based on ensembles of regression trees are a popular approach for making predictions. Some common examples include Bayesian additive regression trees, boosting and random forests. This paper focuses on honest random forests, which add honesty to the original form of random forests and are proved to have better statistical properties. The main contribution is a new method that quantifies the uncertainties of the estimates and predictions produced by honest random forests. The proposed method is based on the generalized fiducial methodology, and provides a fiducial density function that measures how likely each single honest tree is the true model. With such a density function, estimates and predictions, as well as their confidence/prediction intervals, can be obtained. The promising empirical properties of the proposed method are demonstrated by numerical comparisons with several state-of-the-art methods, and by applications to a few real data sets. Lastly, the proposed method is theoretically backed up by a strong asymptotic guarantee.
This work affords new insights into Bayesian CART in the context of structured wavelet shrinkage. The main thrust is to develop a formal inferential framework for Bayesian tree-based regression. We reframe Bayesian CART as a g-type prior which departs from the typical wavelet product priors by harnessing correlation induced by the tree topology. The practically used Bayesian CART priors are shown to attain adaptive near rate-minimax posterior concentration in the supremum norm in regression models. For the fundamental goal of uncertainty quantification, we construct adaptive confidence bands for the regression function with uniform coverage under self-similarity. In addition, we show that tree-posteriors enable optimal inference in the form of efficient confidence sets for smooth functionals of the regression function.