No Arabic abstract
Quantitatively predicting phenotype variables by the expression changes in a set of candidate genes is of great interest in molecular biology but it is also a challenging task for several reasons. First, the collected biological observations might be heterogeneous and correspond to different biological mechanisms. Secondly, the gene expression variables used to predict the phenotype are potentially highly correlated since genes interact though unknown regulatory networks. In this paper, we present a novel approach designed to predict quantitative trait from transcriptomic data, taking into account the heterogeneity in biological samples and the hidden gene regulatory networks underlying different biological mechanisms. The proposed model performs well on prediction but it is also fully parametric, which facilitates the downstream biological interpretation. The model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals. We perform numerical simulations to demonstrate that our model is competitive with other prediction models, and we demonstrate the predictive performance and the interpretability of our model to predict alcohol sensitivity from transcriptomic data on real data from Drosophila Melanogaster Genetic Reference Panel (DGRP).
One of the classic concerns in statistics is determining if two samples come from thesame population, i.e. homogeneity testing. In this paper, we propose a homogeneitytest in the context of Functional Data Analysis, adopting an idea from multivariatedata analysis: the data depth plot (DD-plot). This DD-plot is a generalization of theunivariate Q-Q plot (quantile-quantile plot). We propose some statistics based onthese DD-plots, and we use bootstrapping techniques to estimate their distributions.We estimate the finite-sample size and power of our test via simulation, obtainingbetter results than other homogeneity test proposed in the literature. Finally, weillustrate the procedure in samples of real heterogeneous data and get consistent results.
We develop a new methodology for spatial regression of aggregated outputs on multi-resolution covariates. Such problems often occur with spatial data, for example in crop yield prediction, where the output is spatially-aggregated over an area and the covariates may be observed at multiple resolutions. Building upon previous work on aggregated output regression, we propose a regression framework to synthesise the effects of the covariates at different resolutions on the output and provide uncertainty estimation. We show that, for a crop yield prediction problem, our approach is more scalable, via variational inference, than existing multi-resolution regression models. We also show that our framework yields good predictive performance, compared to existing multi-resolution crop yield models, whilst being able to provide estimation of the underlying spatial effects.
In the process of clinical diagnosis and treatment, the restricted mean survival time (RMST), which reflects the life expectancy of patients up to a specified time, can be used as an appropriate outcome measure. However, the RMST only calculates the mean survival time of patients within a period of time after the start of follow-up and may not accurately portray the change in a patients life expectancy over time. The life expectancy can be adjusted for the time the patient has already survived and defined as the conditional restricted mean survival time (cRMST). A dynamic RMST model based on the cRMST can be established by incorporating time-dependent covariates and covariates with time-varying effects. We analysed data from a study of primary biliary cirrhosis (PBC) to illustrate the use of the dynamic RMST model. The predictive performance was evaluated using the C-index and the prediction error. The proposed dynamic RMST model, which can explore the dynamic effects of prognostic factors on survival time, has better predictive performance than the RMST model. Three PBC patient examples were used to illustrate how the predicted cRMST changed at different prediction times during follow-up. The use of the dynamic RMST model based on the cRMST allows for optimization of evidence-based decision-making by updating personalized dynamic life expectancy for patients.
We develop a distribution-free, unsupervised anomaly detection method called ECAD, which wraps around any regression algorithm and sequentially detects anomalies. Rooted in conformal prediction, ECAD does not require data exchangeability but approximately controls the Type-I error when data are normal. Computationally, it involves no data-splitting and efficiently trains ensemble predictors to increase statistical power. We demonstrate the superior performance of ECAD on detecting anomalous spatio-temporal traffic flow.
Competing risks data are common in medical studies, and the sub-distribution hazard (SDH) ratio is considered an appropriate measure. However, because the limitations of hazard itself are not easy to interpret clinically and because the SDH ratio is valid only under the proportional SDH assumption, this article introduced an alternative index under competing risks, named restricted mean time lost (RMTL). Several test procedures were also constructed based on RMTL. First, we introduced the definition and estimation of RMTL based on Aalen-Johansen cumulative incidence functions. Then, we considered several combined tests based on the SDH and the RMTL difference (RMTLd). The statistical properties of the methods are evaluated using simulations and are applied to two examples. The type I errors of combined tests are close to the nominal level. All combined tests show acceptable power in all situations. In conclusion, RMTL can meaningfully summarize treatment effects for clinical decision making, and three combined tests have robust power under various conditions, which can be considered for statistical inference in real data analysis.