No Arabic abstract
In this paper we study the kernel multiple ridge regression framework, which we refer to as multi-task regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples.
In high-dimensional regression, we attempt to estimate a parameter vector ${boldsymbol beta}_0in{mathbb R}^p$ from $nlesssim p$ observations ${(y_i,{boldsymbol x}_i)}_{ile n}$ where ${boldsymbol x}_iin{mathbb R}^p$ is a vector of predictors and $y_i$ is a response variable. A well-estabilished approach uses convex regularizers to promote specific structures (e.g. sparsity) of the estimate $widehat{boldsymbol beta}$, while allowing for practical algorithms. Theoretical analysis implies that convex penalization schemes have nearly optimal estimation properties in certain settings. However, in general the gaps between statistically optimal estimation (with unbounded computational resources) and convex methods are poorly understood. We show that, in general, a large gap exists between the best performance achieved by emph{any convex regularizer} and the optimal statistical error. Remarkably, we demonstrate that this gap is generic as soon as we try to incorporate very simple structural information about the empirical distribution of the entries of ${boldsymbol beta}_0$. Our results follow from a detailed study of standard Gaussian designs, a setting that is normally considered particularly friendly to convex regularization schemes such as the Lasso. We prove a lower bound on the estimation error achieved by any convex regularizer which is invariant under permutations of the coordinates of its argument. This bound is expected to be generally tight, and indeed we prove tightness under certain conditions. Further, it implies a gap with respect to Bayes-optimal estimation that can be precisely quantified and persists if the prior distribution of the signal ${boldsymbol beta}_0$ is known to the statistician. Our results provide rigorous evidence towards a broad conjecture regarding computational-statistical gaps in high-dimensional estimation.
In this paper we study multi-task kernel ridge regression and try to understand when the multi-task procedure performs better than the single-task one, in terms of averaged quadratic risk. In order to do so, we compare the risks of the estimators with perfect calibration, the emph{oracle risk}. We are able to give explicit settings, favorable to the multi-task procedure, where the multi-task oracle performs better than the single-task one. In situations where the multi-task procedure is conjectured to perform badly, we also show the oracle does so. We then complete our study with simulated examples, where we can compare both oracle risks in more natural situations. A consequence of our result is that the multi-task ridge estimator has a lower risk than any single-task estimator, in favorable situations.
We consider a sparse multi-task regression framework for fitting a collection of related sparse models. Representing models as nodes in a graph with edges between related models, a framework that fuses lasso regressions with the total variation penalty is investigated. Under a form of restricted eigenvalue assumption, bounds on prediction and squared error are given that depend upon the sparsity of each model and the differences between related models. This assumption relates to the smallest eigenvalue restricted to the intersection of two cone sets of the covariance matrix constructed from each of the agents covariances. We show that this assumption can be satisfied if the constructed covariance matrix satisfies a restricted isometry property. In the case of a grid topology high-probability bounds are given that match, up to log factors, the no-communication setting of fitting a lasso on each model, divided by the number of agents. A decentralised dual method that exploits a convex-concave formulation of the penalised problem is proposed to fit the models and its effectiveness demonstrated on simulations against the group lasso and variants.
Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from the data. We propose a completely data-driven calibration algorithm for this parameter in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birge and Massart (2007) in the context of penalized least squares for Gaussian homoscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a data-driven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general.
This paper is about optimal estimation of the additive components of a nonparametric, additive isotone regression model. It is shown that asymptotically up to first order, each additive component can be estimated as well as it could be by a least squares estimator if the other components were known. The algorithm for the calculation of the estimator uses backfitting. Convergence of the algorithm is shown. Finite sample properties are also compared through simulation experiments.