No Arabic abstract
Multi-task learning is increasingly used to investigate the association structure between multiple responses and a single set of predictor variables in many applications. In the era of big data, the coexistence of incomplete outcomes, large number of responses, and high dimensionality in predictors poses unprecedented challenges in estimation, prediction, and computation. In this paper, we propose a scalable and computationally efficient procedure, called PEER, for large-scale multi-response regression with incomplete outcomes, where both the numbers of responses and predictors can be high-dimensional. Motivated by sparse factor regression, we convert the multi-response regression into a set of univariate-response regressions, which can be efficiently implemented in parallel. Under some mild regularity conditions, we show that PEER enjoys nice sampling properties including consistency in estimation, prediction, and variable selection. Extensive simulation studies show that our proposal compares favorably with several existing methods in estimation accuracy, variable selection, and computation efficiency.
We propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. We develop a data integration procedure for statistical estimation and inference of regression parameters that is implemented in a fully distributed and parallelized computational scheme. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, we propose to analyze each data source using Qu, Lindsay and Li (2000)s quadratic inference functions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources using a combined meta-estimator in a similar spirit to Hansen (1982)s generalised method of moments. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We illustrate the proposed methodology with the joint integrative analysis of the association between smoking and metabolites in a large multi-cohort study and provide an R package for ease of implementation.
In many health domains such as substance-use, outcomes are often counts with an excessive number of zeros (EZ) - count data having zero counts at a rate significantly higher than that expected of a standard count distribution (e.g., Poisson). However, an important gap exists in sample size estimation methodology for planning sequential multiple assignment randomized trials (SMARTs) for comparing dynamic treatment regimens (DTRs) using longitudinal count data. DTRs, also known as treatment algorithms or adaptive interventions, mimic the individualized and evolving nature of patient care through the specification of decision rules guiding the type, timing and modality of delivery, and dosage of treatments to address the unique and changing needs of individuals. To close this gap, we develop a Monte Carlo-based approach to sample size estimation. A SMART for engaging alcohol and cocaine-dependent patients in treatment is used as motivation.
Statistical techniques used in air pollution modelling usually lack the possibility to understand which predictors affect air pollution in which functional form; and are not able to regress on exceedances over certain thresholds imposed by authorities directly. The latter naturally induce conditional quantiles and reflect the seriousness of particular events. In the present paper we focus on this important aspect by developing quantile regression models further. We propose a general Bayesian effect selection approach for additive quantile regression within a highly interpretable framework. We place separate normal beta prime spike and slab priors on the scalar importance parameters of effect parts and implement a fast Gibbs sampling scheme. Specifically, it enables to study quantile-specific covariate effects, allows these covariates to be of general functional form using additive predictors, and facilitates the analysts decision whether an effect should be included linearly, non-linearly or not at all in the quantiles of interest. In a detailed analysis on air pollution data in Madrid (Spain) we find the added value of modelling extreme nitrogen dioxide (NO2) concentrations and how thresholds are driven differently by several climatological variables and traffic as a spatial proxy. Our results underpin the need of enhanced statistical models to support short-term decisions and enable local authorities to mitigate or even prevent exceedances of NO2 concentration limits.
Corrupted data sets containing noisy or missing observations are prevalent in various contemporary applications such as economics, finance and bioinformatics. Despite the recent methodological and algorithmic advances in high-dimensional multi-response regression, how to achieve scalable and interpretable estimation under contaminated covariates is unclear. In this paper, we develop a new methodology called convex conditioned sequential sparse learning (COSS) for error-in-variables multi-response regression under both additive measurement errors and random missing data. It combines the strengths of the recently developed sequential sparse factor regression and the nearest positive semi-definite matrix projection, thus enjoying stepwise convexity and scalability in large-scale association analyses. Comprehensive theoretical guarantees are provided and we demonstrate the effectiveness of the proposed methodology through numerical studies.
In this paper, a functional partial quantile regression approach, a quantile regression analog of the functional partial least squares regression, is proposed to estimate the function-on-function linear quantile regression model. A partial quantile covariance function is first used to extract the functional partial quantile regression basis functions. The extracted basis functions are then used to obtain the functional partial quantile regression components and estimate the final model. In our proposal, the functional forms of the discretely observed random variables are first constructed via a finite-dimensional basis function expansion method. The functional partial quantile regression constructed using the functional random variables is approximated via the partial quantile regression constructed using the basis expansion coefficients. The proposed method uses an iterative procedure to extract the partial quantile regression components. A Bayesian information criterion is used to determine the optimum number of retained components. The proposed functional partial quantile regression model allows for more than one functional predictor in the model. However, the true form of the proposed model is unspecified, as the relevant predictors for the model are unknown in practice. Thus, a forward variable selection procedure is used to determine the significant predictors for the proposed model. Moreover, a case-sampling-based bootstrap procedure is used to construct pointwise prediction intervals for the functional response. The predictive performance of the proposed method is evaluated using several Monte Carlo experiments under different data generation processes and error distributions. Through an empirical data example, air quality data are analyzed to demonstrate the effectiveness of the proposed method.