ترغب بنشر مسار تعليمي؟ اضغط هنا

Efficient nonparametric statistical inference on population feature importance using Shapley values

204   0   0.0 ( 0 )
 نشر من قبل Brian Williamson
 تاريخ النشر 2020
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the Shapley Population Variable Importance Measure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only $Theta(n)$ feature subsets given $n$ observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.



قيم البحث

اقرأ أيضاً

Distribution function is essential in statistical inference, and connected with samples to form a directed closed loop by the correspondence theorem in measure theory and the Glivenko-Cantelli and Donsker properties. This connection creates a paradig m for statistical inference. However, existing distribution functions are defined in Euclidean spaces and no longer convenient to use in rapidly evolving data objects of complex nature. It is imperative to develop the concept of distribution function in a more general space to meet emerging needs. Note that the linearity allows us to use hypercubes to define the distribution function in a Euclidean space, but without the linearity in a metric space, we must work with the metric to investigate the probability measure. We introduce a class of metric distribution functions through the metric between random objects and a fixed location in metric spaces. We overcome this challenging step by proving the correspondence theorem and the Glivenko-Cantelli theorem for metric distribution functions in metric spaces that lie the foundation for conducting rational statistical inference for metric space-valued data. Then, we develop homogeneity test and mutual independence test for non-Euclidean random objects, and present comprehensive empirical evidence to support the performance of our proposed methods.
Most epidemiologic cohorts are composed of volunteers who do not represent the general population. To enable population inference from cohorts, we and others have proposed utilizing probability survey samples as external references to develop a prope nsity score (PS) for membership in the cohort versus survey. Herein we develop a unified framework for PS-based weighting (such as inverse PS weighting (IPSW)) and matching methods (such as kernel-weighting (KW) method). We identify a fundamental Strong Exchangeability Assumption (SEA) underlying existing PS-based matching methods whose failure invalidates inference even if the PS-model is correctly specified. We relax the SEA to a Weak Exchangeability Assumption (WEA) for the matching method. Also, we propose IPSW.S and KW.S methods that reduce the variance of PS-based estimators by scaling the survey weights used in the PS estimation. We prove consistency of the IPSW.S and KW.S estimators of population means and prevalences under WEA, and provide asymptotic variances and consistent variance estimators. In simulations, the KW.S and IPSW.S estimators had smallest MSE. In our data example, the original KW estimates had large bias, whereas the KW.S estimates had the smallest MSE.
205 - Daniil Ryabko 2012
In this work a method for statistical analysis of time series is proposed, which is used to obtain solutions to some classical problems of mathematical statistics under the only assumption that the process generating the data is stationary ergodic. N amely, three problems are considered: goodness-of-fit (or identity) testing, process classification, and the change point problem. For each of the problems a test is constructed that is asymptotically accurate for the case when the data is generated by stationary ergodic processes. The tests are based on empirical estimates of distributional distance.
125 - Ramin Okhrati , Aldo Lipani 2020
Shapley values are great analytical tools in game theory to measure the importance of a player in a game. Due to their axiomatic and desirable properties such as efficiency, they have become popular for feature importance analysis in data science and machine learning. However, the time complexity to compute Shapley values based on the original formula is exponential, and as the number of features increases, this becomes infeasible. Castro et al. [1] developed a sampling algorithm, to estimate Shapley values. In this work, we propose a new sampling method based on a multilinear extension technique as applied in game theory. The aim is to provide a more efficient (sampling) method for estimating Shapley values. Our method is applicable to any machine learning model, in particular for either multi-class classifications or regression problems. We apply the method to estimate Shapley values for multilayer perceptrons (MLPs) and through experimentation on two datasets, we demonstrate that our method provides more accurate estimations of the Shapley values by reducing the variance of the sampling statistics.
Classification is the task of assigning a new instance to one of a set of predefined categories based on the attributes of the instance. A classification tree is one of the most commonly used techniques in the area of classification. In this paper, w e introduce a novel classification tree algorithm which we call Direct Nonparametric Predictive Inference (D-NPI) classification algorithm. The D-NPI algorithm is completely based on the Nonparametric Predictive Inference (NPI) approach, and it does not use any other assumption or information. The NPI is a statistical methodology which learns from data in the absence of prior knowledge and uses only few modelling assumptions, enabled by the use of lower and upper probabilities to quantify uncertainty. Due to the predictive nature of NPI, it is well suited for classification, as the nature of classification is explicitly predictive as well. The D-NPI algorithm uses a new split criterion called Correct Indication (CI). The CI is about the informativity that the attribute variables will indicate, hence, if the attribute is very informative, it gives high lower and upper probabilities for CI. The CI reports the strength of the evidence that the attribute variables will indicate, based on the data. The CI is completely based on the NPI, and it does not use any additional concepts such as entropy. The performance of the D-NPI classification algorithm is tested against several classification algorithms using classification accuracy, in-sample accuracy and tree size on different datasets from the UCI machine learning repository. The experimental results indicate that the D-NPI classification algorithm performs well in terms of classification accuracy and in-sample accuracy.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا