No Arabic abstract
We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018], initially introduced for sparse linear models, to the sparse corruptions problem. We give theoretical guarantees on the sign recovery of the parameters for a slightly simplified version of the estimator, called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased for variable selection with missing values in the covariates. In addition to not requiring the specification of a model for the covariates, nor estimating their covariance matrix or the noise variance, the method has the great advantage of handling missing not-at random values without specifying a parametric model. Numerical experiments and a medical application underline the relevance of Robust Lasso-Zero in such a context with few available competitors. The method is easy to use and implemented in the R library lass0.
The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry, which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real-time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small and large scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper, we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the north-east Pacific in the spring of 2017.
Microbiome data analyses require statistical models that can simultaneously decode microbes reactions to the environment and interactions among microbes. While a multiresponse linear regression model seems like a straightforward solution, we argue that treating it as a graphical model is flawed given that the regression coefficient matrix does not encode the conditional dependence structure between response and predictor nodes because it does not represent the adjacency matrix. This observation is especially important in biological settings when we have prior knowledge on the edges from specific experimental interventions that can only be properly encoded under a conditional dependence model. Here, we propose a chain graph model with two sets of nodes (predictors and responses) whose solution yields a graph with edges that indeed represent conditional dependence and thus, agrees with the experimenters intuition on the average behavior of nodes under treatment. The solution to our model is sparse via Bayesian LASSO and is also guaranteed to be the sparse solution to a Conditional Auto-Regressive (CAR) model. In addition, we propose an adaptive extension so that different shrinkage can be applied to different edges to incorporate edge-specific prior knowledge. Our model is computationally inexpensive through an efficient Gibbs sampling algorithm and can account for binary, counting, and compositional responses via appropriate hierarchical structure. We apply our model to a human gut and a soil microbial compositional datasets and we highlight that CAR-LASSO can estimate biologically meaningful network structures in the data. The CAR-LASSO software is available as an R package at https://github.com/YunyiShen/CAR-LASSO.
We develop a distribution-free, unsupervised anomaly detection method called ECAD, which wraps around any regression algorithm and sequentially detects anomalies. Rooted in conformal prediction, ECAD does not require data exchangeability but approximately controls the Type-I error when data are normal. Computationally, it involves no data-splitting and efficiently trains ensemble predictors to increase statistical power. We demonstrate the superior performance of ECAD on detecting anomalous spatio-temporal traffic flow.
This paper studies the problem of accurately recovering a sparse vector $beta^{star}$ from highly corrupted linear measurements $y = X beta^{star} + e^{star} + w$ where $e^{star}$ is a sparse error vector whose nonzero entries may be unbounded and $w$ is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both $beta^{star}$ and $e^{star}$. Our first result shows that the extended Lasso can faithfully recover both the regression as well as the corruption vector. Our analysis relies on the notion of extended restricted eigenvalue for the design matrix $X$. Our second set of results applies to a general class of Gaussian design matrix $X$ with i.i.d rows $oper N(0, Sigma)$, for which we can establish a surprising result: the extended Lasso can recover exact signed supports of both $beta^{star}$ and $e^{star}$ from only $Omega(k log p log n)$ observations, even when the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is indeed optimal.
High dimensional data has introduced challenges that are difficult to address when attempting to implement classical approaches of statistical process control. This has made it a topic of interest for research due in recent years. However, in many cases, data sets have underlying structures, such as in advanced manufacturing systems. If extracted correctly, efficient methods for process control can be developed. This paper proposes a robust sparse dimensionality reduction approach for correlated high-dimensional process monitoring to address the aforementioned issues. The developed monitoring technique uses robust sparse probabilistic PCA to reduce the dimensionality of the data stream while retaining interpretability. The proposed methodology utilizes Bayesian variational inference to obtain the estimates of a probabilistic representation of PCA. Simulation studies were conducted to verify the efficacy of the proposed methodology. Furthermore, we conducted a case study for change detection for in-line Raman spectroscopy to validate the efficiency of our proposed method in a practical scenario.