ترغب بنشر مسار تعليمي؟ اضغط هنا

Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process

124   0   0.0 ( 0 )
 نشر من قبل Christoph Molnar
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. However, their model parameters usually cannot be easily related to the data generating process. To learn about the modeled relationships, partial dependence (PD) plots and permutation feature importance (PFI) are often used as interpretation methods. However, PD and PFI lack a theory that relates them to the data generating process. We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process. We show that PD and PFI estimates deviate from this ground truth due to statistical biases, model variance and Monte Carlo approximation errors. To account for model variance in PD and PFI estimation, we propose the learner-PD and the learner-PFI based on model refits, and propose corrected variance and confidence interval estimators.



قيم البحث

اقرأ أيضاً

Going beyond correlations, the understanding and identification of causal relationships in observational time series, an important subfield of Causal Discovery, poses a major challenge. The lack of access to a well-defined ground truth for real-world data creates the need to rely on synthetic data for the evaluation of these methods. Existing benchmarks are limited in their scope, as they either are restricted to a static selection of data sets, or do not allow for a granular assessment of the methods performance when commonly made assumptions are violated. We propose a flexible and simple to use framework for generating time series data, which is aimed at developing, evaluating, and benchmarking time series causal discovery methods. In particular, the framework can be used to fine tune novel methods on vast amounts of data, without overfitting them to a benchmark, but rather so they perform well in real-world use cases. Using our framework, we evaluate prominent time series causal discovery methods and demonstrate a notable degradation in performance when their assumptions are invalidated and their sensitivity to choice of hyperparameters. Finally, we propose future research directions and how our framework can support both researchers and practitioners.
104 - Zhengze Zhou , Giles Hooker 2019
We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential sp lits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools.
We propose a procedure for supervised classification that is based on potential functions. The potential of a class is defined as a kernel density estimate multiplied by the classs prior probability. The method transforms the data to a potential-pote ntial (pot-pot) plot, where each data point is mapped to a vector of potentials. Separation of the classes, as well as classification of new data points, is performed on this plot. For this, either the $alpha$-procedure ($alpha$-P) or $k$-nearest neighbors ($k$-NN) are employed. For data that are generated from continuous distributions, these classifiers prove to be strongly Bayes-consistent. The potentials depend on the kernel and its bandwidth used in the density estimate. We investigate several variants of bandwidth selection, including joint and separate pre-scaling and a bandwidth regression approach. The new method is applied to benchmark data from the literature, including simulated data sets as well as 50 sets of real data. It compares favorably to known classification methods such as LDA, QDA, max kernel density estimates, $k$-NN, and $DD$-plot classification using depth functions.
In recent years, a large amount of model-agnostic methods to improve the transparency, trustability and interpretability of machine learning models have been developed. We introduce local feature importance as a local version of a recent model-agnost ic global feature importance method. Based on local feature importance, we propose two visual tools: partial importance (PI) and individual conditional importance (ICI) plots which visualize how changes in a feature affect the model performance on average, as well as for individual observations. Our proposed methods are related to partial dependence (PD) and individual conditional expectation (ICE) plots, but visualize the expected (conditional) feature importance instead of the expected (conditional) prediction. Furthermore, we show that averaging ICI curves across observations yields a PI curve, and integrating the PI curve with respect to the distribution of the considered feature results in the global feature importance. Another contribution of our paper is the Shapley feature importance, which fairly distributes the overall performance of a model among the features according to the marginal contributions and which can be used to compare the feature importance across different models.
The interpretation of feature importance in machine learning models is challenging when features are dependent. Permutation feature importance (PFI) ignores such dependencies, which can cause misleading interpretations due to extrapolation. A possibl e remedy is more advanced conditional PFI approaches that enable the assessment of feature importance conditional on all other features. Due to this shift in perspective and in order to enable correct interpretations, it is therefore important that the conditioning is transparent and humanly comprehensible. In this paper, we propose a new sampling mechanism for the conditional distribution based on permutations in conditional subgroups. As these subgroups are constructed using decision trees (transformation trees), the conditioning becomes inherently interpretable. This not only provides a simple and effective estimator of conditional PFI, but also local PFI estimates within the subgroups. In addition, we apply the conditional subgroups approach to partial dependence plots (PDP), a popular method for describing feature effects that can also suffer from extrapolation when features are dependent and interactions are present in the model. We show that PFI and PDP based on conditional subgroups often outperform methods such as conditional PFI based on knockoffs, or accumulated local effect plots. Furthermore, our approach allows for a more fine-grained interpretation of feature effects and importance within the conditional subgroups.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا