No Arabic abstract
Modern multivariate machine learning and statistical methodologies estimate parameters of interest while leveraging prior knowledge of the association between outcome variables. The methods that do allow for estimation of relationships do so typically through an error covariance matrix in multivariate regression which does not scale to other types of models. In this article we proposed the MinPEN framework to simultaneously estimate regression coefficients associated with the multivariate regression model and the relationships between outcome variables using mild assumptions. The MinPen framework utilizes a novel penalty based on the minimum function to exploit detected relationships between responses. An iterative algorithm that generalizes current state of the art methods is proposed as a solution to the non-convex optimization that is required to obtain estimates. Theoretical results such as high dimensional convergence rates, model selection consistency, and a framework for post selection inference are provided. We extend the proposed MinPen framework to other exponential family loss functions, with a specific focus on multiple binomial responses. Tuning parameter selection is also addressed. Finally, simulations and two data examples are presented to show the finite sample properties of this framework.
A new family of penalty functions, adaptive to likelihood, is introduced for model selection in general regression models. It arises naturally through assuming certain types of prior distribution on the regression parameters. To study stability properties of the penalized maximum likelihood estimator, two types of asymptotic stability are defined. Theoretical properties, including the parameter estimation consistency, model selection consistency, and asymptotic stability, are established under suitable regularity conditions. An efficient coordinate-descent algorithm is proposed. Simulation results and real data analysis show that the proposed method has competitive performance in comparison with existing ones.
The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the Shapley Population Variable Importance Measure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only $Theta(n)$ feature subsets given $n$ observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.
In modern applications, statisticians are faced with integrating heterogeneous data modalities relevant for an inference, prediction, or decision problem. In such circumstances, it is convenient to use a graphical model to represent the statistical dependencies, via a set of connected modules, each relating to a specific data modality, and drawing on specific domain expertise in their development. In principle, given data, the conventional statistical update then allows for coherent uncertainty quantification and information propagation through and across the modules. However, misspecification of any module can contaminate the estimate and update of others, often in unpredictable ways. In various settings, particularly when certain modules are trusted more than others, practitioners have preferred to avoid learning with the full model in favor of approaches that restrict the information propagation between modules, for example by restricting propagation to only particular directions along the edges of the graph. In this article, we investigate why these modular approaches might be preferable to the full model in misspecified settings. We propose principled criteria to choose between modular and full-model approaches. The question arises in many applied settings, including large stochastic dynamical systems, meta-analysis, epidemiological models, air pollution models, pharmacokinetics-pharmacodynamics, and causal inference with propensity scores.
In clinical trials, minimum clinically important difference (MCID) has attracted increasing interest as an important supportive clinical and statistical inference tool. Many estimation methods have been developed based on various intuitions, while little theoretical justification has been established. This paper proposes a new estimation framework of MCID using both diagnostic measurements and patient-reported outcomes (PROs). It first provides a precise definition of population-based MCID so that estimating such a MCID can be formulated as a large margin classification problem. The framework is then extended to personalized MCID to allow individualized thresholding value for patients whose clinical profiles may affect their PRO responses. More importantly, we show that the proposed estimation framework is asymptotically consistent, and a finite-sample upper bound is established for its prediction accuracy compared against the ideal MCID. The advantage of our proposed method is also demonstrated in a variety of simulated experiments as well as applications to two benchmark datasets and two phase-3 clinical trials.
We consider a finite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. With this framework, we define an intuitive family of control policies that include Thompson sampling (TS) and the Bayesian optimal policy as endpoints. Analogous to TS, which, at each decision epoch pulls an arm that is best with respect to the randomly sampled parameters, our algorithms sample entire future reward realizations and take the corresponding best action. However, this is done in the presence of penalties that seek to compensate for the availability of future information. We develop several novel policies and performance bounds for MAB problems that vary in terms of improving performance and increasing computational complexity between the two endpoints. Our policies can be viewed as natural generalizations of TS that simultaneously incorporate knowledge of the time horizon and explicitly consider the exploration-exploitation trade-off. We prove associated structural results on performance bounds and suboptimality gaps. Numerical experiments suggest that this new class of policies perform well, in particular in settings where the finite time horizon introduces significant exploration-exploitation tension into the problem. Finally, inspired by the finite-horizon Gittins index, we propose an index policy that builds on our framework that particularly outperforms the state-of-the-art algorithms in our numerical experiments.