The selection of grouped variables using the random forest algorithm is considered. First a new importance measure adapted for groups of variables is proposed. Theoretical insights into this criterion are given for additive regression models. Second, an original method for selecting functional variables based on the grouped variable importance measure is developed. Using a wavelet basis, it is proposed to regroup all of the wavelet coefficients for a given functional variable and use a wrapper selection algorithm with these groups. Various other groupings which take advantage of the frequency and time localization of the wavelet basis are proposed. An extensive simulation study is performed to illustrate the use of the grouped importance measure in this context. The method is applied to a real life problem coming from aviation safety.
Random forests are one of the most popular machine learning methods due to their accuracy and variable importance assessment. However, random forests only provide variable importance in a global sense. There is an increasing need for such assessments at a local level, motivated by applications in personalized medicine, policy-making, and bioinformatics. We propose a new nonparametric estimator that pairs the flexible random forest kernel with local sufficient dimension reduction to adapt to a regression functions local structure. This allows us to estimate a meaningful directional local variable importance measure at each prediction point. We develop a computationally efficient fitting procedure and provide sufficient conditions for the recovery of the splitting directions. We demonstrate significant accuracy gains of our proposed estimator over competing methods on simulated and real regression problems. Finally, we apply the proposed method to seasonal particulate matter concentration data collected in Beijing, China, which yields meaningful local importance measures. The methods presented here are available in the drforest Python package.
Smart metering infrastructures collect data almost continuously in the form of fine-grained long time series. These massive time series often have common daily patterns that are repeated between similar days or seasons and shared between grouped meters. Within this context, we propose a method to highlight individuals with abnormal daily dependency patterns, which we term evolution outliers. To this end, we approach the problem from the standpoint of Functional Data Analysis (FDA), by treating each daily record as a function or curve. We then focus on the morphological aspects of the observed curves, such as daily magnitude, daily shape, derivatives, and inter-day evolution. The proposed method for evolution outliers relies on the concept of functional depth, which has been a cornerstone in the literature of FDA to build shape and magnitude outlier detection methods. In conjunction with our evolution outlier proposal, these methods provide an outlier detection toolbox for smart meter data that covers a wide palette of functional outliers classes. We illustrate the outlier identification ability of this toolbox using actual smart metering data corresponding to photovoltaic energy generation and circuit voltage records.
We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends from single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.
We propose a multivariate functional responses low rank regression model with possible high dimensional functional responses and scalar covariates. By expanding the slope functions on a set of sieve basis, we reconstruct the basis coefficients as a matrix. To estimate these coefficients, we propose an efficient procedure using nuclear norm regularization. We also derive error bounds for our estimates and evaluate our method using simulations. We further apply our method to the Human Connectome Project neuroimaging data to predict cortical surface motor task-evoked functional magnetic resonance imaging signals using various clinical covariates to illustrate the usefulness of our results.
Motivated by modeling and analysis of mass-spectrometry data, a semi- and nonparametric model is proposed that consists of a linear parametric component for individual location and scale and a nonparametric regression function for the common shape. A multi-step approach is developed that simultaneously estimates the parametric components and the nonparametric function. Under certain regularity conditions, it is shown that the resulting estimators is consistent and asymptotic normal for the parametric part and achieve the optimal rate of convergence for the nonparametric part when the bandwidth is suitably chosen. Simulation results are presented to demonstrate the effectiveness and finite-sample performance of the method. The method is also applied to a SELDI-TOF mass spectrometry data set from a study of liver cancer patients.
Baptiste Gregorutti
,Bertrand Michel
,Philippe Saint-Pierre
.
(2014)
.
"Grouped variable importance with random forests and application to multiple functional data analysis"
.
Baptiste Gregorutti
هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا