No Arabic abstract
The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry, which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real-time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small and large scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper, we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the north-east Pacific in the spring of 2017.
We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018], initially introduced for sparse linear models, to the sparse corruptions problem. We give theoretical guarantees on the sign recovery of the parameters for a slightly simplified version of the estimator, called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased for variable selection with missing values in the covariates. In addition to not requiring the specification of a model for the covariates, nor estimating their covariance matrix or the noise variance, the method has the great advantage of handling missing not-at random values without specifying a parametric model. Numerical experiments and a medical application underline the relevance of Robust Lasso-Zero in such a context with few available competitors. The method is easy to use and implemented in the R library lass0.
Flow cytometry is a technology that rapidly measures antigen-based markers associated to cells in a cell population. Although analysis of flow cytometry data has traditionally considered one or two markers at a time, there has been increasing interest in multidimensional analysis. However, flow cytometers are limited in the number of markers they can jointly observe, which is typically a fraction of the number of markers of interest. For this reason, practitioners often perform multiple assays based on different, overlapping combinations of markers. In this paper, we address the challenge of imputing the high dimensional jointly distributed values of marker attributes based on overlapping marginal observations. We show that simple nearest neighbor based imputation can lead to spurious subpopulations in the imputed data, and introduce an alternative approach based on nearest neighbor imputation restricted to a cells subpopulation. This requires us to perform clustering with missing data, which we address with a mixture model approach and novel EM algorithm. Since mixture model fitting may be ill-posed, we also develop techniques to initialize the EM algorithm using domain knowledge. We demonstrate our approach on real flow cytometry data.
We present a flexible branching process model for cell population dynamics in synchrony/time-series experiments used to study important cellular processes. Its formulation is constructive, based on an accounting of the unique cohorts in the population as they arise and evolve over time, allowing it to be written in closed form. The model can attribute effects to subsets of the population, providing flexibility not available using the models historically applied to these populations. It provides a tool for in silico synchronization of the population and can be used to deconvolve population-level experimental measurements, such as temporal expression profiles. It also allows for the direct comparison of assay measurements made from multiple experiments. The model can be fit either to budding index or DNA content measurements, or both, and is easily adaptable to new forms of data. The ability to use DNA content data makes the model applicable to almost any organism. We describe the model and illustrate its utility and flexibility in a study of cell cycle progression in the yeast Saccharomyces cerevisiae.
High dimensional data has introduced challenges that are difficult to address when attempting to implement classical approaches of statistical process control. This has made it a topic of interest for research due in recent years. However, in many cases, data sets have underlying structures, such as in advanced manufacturing systems. If extracted correctly, efficient methods for process control can be developed. This paper proposes a robust sparse dimensionality reduction approach for correlated high-dimensional process monitoring to address the aforementioned issues. The developed monitoring technique uses robust sparse probabilistic PCA to reduce the dimensionality of the data stream while retaining interpretability. The proposed methodology utilizes Bayesian variational inference to obtain the estimates of a probabilistic representation of PCA. Simulation studies were conducted to verify the efficacy of the proposed methodology. Furthermore, we conducted a case study for change detection for in-line Raman spectroscopy to validate the efficiency of our proposed method in a practical scenario.
We propose a general Bayesian approach to modeling epidemics such as COVID-19. The approach grew out of specific analyses conducted during the pandemic, in particular an analysis concerning the effects of non-pharmaceutical interventions (NPIs) in reducing COVID-19 transmission in 11 European countries. The model parameterizes the time varying reproduction number $R_t$ through a regression framework in which covariates can e.g be governmental interventions or changes in mobility patterns. This allows a joint fit across regions and partial pooling to share strength. This innovation was critical to our timely estimates of the impact of lockdown and other NPIs in the European epidemics, whose validity was borne out by the subsequent course of the epidemic. Our framework provides a fully generative model for latent infections and observations deriving from them, including deaths, cases, hospitalizations, ICU admissions and seroprevalence surveys. One issue surrounding our models use during the COVID-19 pandemic is the confounded nature of NPIs and mobility. We use our framework to explore this issue. We have open sourced an R package epidemia implementing our approach in Stan. Versions of the model are used by New York State, Tennessee and Scotland to estimate the current situation and make policy decisions.