No Arabic abstract
Data from NASAs Orbiting Carbon Observatory-2 (OCO-2) satellite is essential to many carbon management strategies. A retrieval algorithm is used to estimate CO2 concentration using the radiance data measured by OCO-2. However, due to factors such as cloud cover and cosmic rays, the spatial coverage of the retrieval algorithm is limited in some areas of critical importance for carbon cycle science. Mixed land/water pixels along the coastline are also not used in the retrieval processing due to the lack of valid ancillary variables including land fraction. We propose an approach to model spatial spectral data to solve these two problems by radiance imputation and land fraction estimation. The spectral observations are modeled as spatially indexed functional data with footprint-specific parameters and are reduced to much lower dimensions by functional principal component analysis. The principal component scores are modeled as random fields to account for the spatial dependence, and the missing spectral observations are imputed by kriging the principal component scores. The proposed method is shown to impute spectral radiance with high accuracy for observations over the Pacific Ocean. An unmixing approach based on this model provides much more accurate land fraction estimates in our validation study along Greece coastlines.
Motivated by modeling and analysis of mass-spectrometry data, a semi- and nonparametric model is proposed that consists of a linear parametric component for individual location and scale and a nonparametric regression function for the common shape. A multi-step approach is developed that simultaneously estimates the parametric components and the nonparametric function. Under certain regularity conditions, it is shown that the resulting estimators is consistent and asymptotic normal for the parametric part and achieve the optimal rate of convergence for the nonparametric part when the bandwidth is suitably chosen. Simulation results are presented to demonstrate the effectiveness and finite-sample performance of the method. The method is also applied to a SELDI-TOF mass spectrometry data set from a study of liver cancer patients.
Statistical analysis of large data sets offers new opportunities to better understand many processes. Yet, data accumulation often implies relaxing acquisition procedures or compounding diverse sources. As a consequence, such data sets often contain mixed data, i.e. both quantitative and qualitative and many missing values. Furthermore, aggregated data present a natural textit{multilevel} structure, where individuals or samples are nested within different sites, such as countries or hospitals. Imputation of multilevel data has therefore drawn some attention recently, but current solutions are not designed to handle mixed data, and suffer from important drawbacks such as their computational cost. In this article, we propose a single imputation method for multilevel data, which can be used to complete either quantitative, categorical or mixed data. The method is based on multilevel singular value decomposition (SVD), which consists in decomposing the variability of the data into two components, the between and within groups variability, and performing SVD on both parts. We show on a simulation study that in comparison to competitors, the method has the great advantages of handling data sets of various size, and being computationally faster. Furthermore, it is the first so far to handle mixed data. We apply the method to impute a medical data set resulting from the aggregation of several data sets coming from different hospitals. This application falls in the framework of a larger project on Trauma patients. To overcome obstacles associated to the aggregation of medical data, we turn to distributed computation. The method is implemented in an R package.
This work is motivated by the Obepine French system for SARS-CoV-2 viral load monitoring in wastewater. The objective of this work is to identify, from time-series of noisy measurements, the underlying auto-regressive signals, in a context where the measurements present numerous missing data, censoring and outliers. We propose a method based on an auto-regressive model adapted to censored data with outliers. Inference and prediction are produced via a discretised smoother. This method is both validated on simulations and on real data from Obepine. The proposed method is used to denoise measurements from the quantification of the SARS-CoV-2 E gene in wastewater by RT-qPCR. The resulting smoothed signal shows a good correlation with other epidemiological indicators and an estimate of the whole system noise is produced.
One of the classic concerns in statistics is determining if two samples come from thesame population, i.e. homogeneity testing. In this paper, we propose a homogeneitytest in the context of Functional Data Analysis, adopting an idea from multivariatedata analysis: the data depth plot (DD-plot). This DD-plot is a generalization of theunivariate Q-Q plot (quantile-quantile plot). We propose some statistics based onthese DD-plots, and we use bootstrapping techniques to estimate their distributions.We estimate the finite-sample size and power of our test via simulation, obtainingbetter results than other homogeneity test proposed in the literature. Finally, weillustrate the procedure in samples of real heterogeneous data and get consistent results.
Understanding electrical energy demand at the consumer level plays an important role in planning the distribution of electrical networks and offering of off-peak tariffs, but observing individual consumption patterns is still expensive. On the other hand, aggregated load curves are normally available at the substation level. The proposed methodology separates substation aggregated loads into estimated mean consumption curves, called typical curves, including information given by explanatory variables. In addition, a model-based clustering approach for substations is proposed based on the similarity of their consumers typical curves and covariance structures. The methodology is applied to a real substation load monitoring dataset from the United Kingdom and tested in eight simulated scenarios.