No Arabic abstract
We consider the counting rate estimation of an unknown radioactive source, which emits photons at times modeled by an homogeneous Poisson process. A spectrometer converts the energy of incoming photons into electrical pulses, whose number provides a rough estimate of the intensity of the Poisson process. When the activity of the source is high, a physical phenomenon known as pileup effect distorts direct measurements, resulting in a significant bias to the standard estimators of the source activities used so far in the field. We show in this paper that the problem of counting rate estimation can be interpreted as a sparse regression problem. We suggest a post-processed, non-negative, version of the Least Absolute Shrinkage and Selection Operator (LASSO) to estimate the photon arrival times. The main difficulty in this problem is that no theoretical conditions can guarantee consistency in sparsity of LASSO, because the dictionary is not ideal and the signal is sampled. We therefore derive theoretical conditions and bounds which illustrate that the proposed method can none the less provide a good, close to the best attainable, estimate of the counting rate activity. The good performances of the proposed approach are studied on simulations and real datasets.
Quantile regression is studied in combination with a penalty which promotes structured (or group) sparsity. A mixed $ell_{1,infty}$-norm on the parameter vector is used to impose structured sparsity on the traditional quantile regression problem. An algorithm is derived to calculate the piece-wise linear solution path of the corresponding minimization problem. A Matlab implementation of the proposed algorithm is provided and some applications of the methods are also studied.
Bayesian Additive Regression Trees (BART) are non-parametric models that can capture complex exogenous variable effects. In any regression problem, it is often of interest to learn which variables are most active. Variable activity in BART is usually measured by counting the number of times a tree splits for each variable. Such one-way counts have the advantage of fast computations. Despite their convenience, one-way counts have several issues. They are statistically unjustified, cannot distinguish between main effects and interaction effects, and become inflated when measuring interaction effects. An alternative method well-established in the literature is Sobol indices, a variance-based global sensitivity analysis technique. However, these indices often require Monte Carlo integration, which can be computationally expensive. This paper provides analytic expressions for Sobol indices for BART posterior samples. These expressions are easy to interpret and are computationally feasible. Furthermore, we will show a fascinating connection between first-order (main-effects) Sobol indices and one-way counts. We also introduce a novel ranking method, and use this to demonstrate that the proposed indices preserve the Sobol-based rank order of variable importance. Finally, we compare these methods using analytic test functions and the En-ROADS climate impacts simulator.
This paper investigates the high-dimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factor-adjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spike-and-slab prior is then proposed for parameter estimation and model selection. Simulation studies show that our Bayesian method outperforms its lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price in the no correlation case, and scales up well with increasing sample size, dimensionality and sparsity. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators lend strong support to our methodology.
Motivation: Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. The problem of assigning tumours to a known class is a particularly important example that has received considerable attention in the last ten years. Many of the classification methods proposed recently require some form of dimension-reduction of the problem. These methods provide a single model as an output and, in most cases, rely on the likelihood function in order to achieve variable selection. Results: We propose a prediction-based objective function that can be tailored to the requirements of practitioners and can be used to assess and interpret a given problem. The direct optimization of such a function can be very difficult because the problem is potentially discontinuous and nonconvex. We therefore propose a general procedure for variable selection that resembles importance sampling to explore the feature space. Our proposal compares favorably with competing alternatives when applied to two cancer data sets in that smaller models are obtained for better or at least comparable classification errors. Furthermore by providing a set of selected models instead of a single one, we construct a network of possible models for a target prediction accuracy level.