No Arabic abstract
Count data are collected in many scientific and engineering tasks including image processing, single-cell RNA sequencing and ecological studies. Such data sets often contain missing values, for example because some ecological sites cannot be reached in a certain year. In addition, in many instances, side information is also available, for example covariates about ecological sites or species. Low-rank methods are popular to denoise and impute count data, and benefit from a substantial theoretical background. Extensions accounting for covariates have been proposed, but to the best of our knowledge their theoretical and empirical properties have not been thoroughly studied, and few softwares are available for practitioners. We propose a complete methodology called LORI (Low-Rank Interaction), including a Poisson model, an algorithm, and automatic selection of the regularization parameter, to analyze count tables with covariates. We also derive an upper bound on the estimation error. We provide a simulation study with synthetic data, revealing empirically that LORI improves on state of the art methods in terms of estimation and imputation of the missing values. We illustrate how the method can be interpreted through visual displays with the analysis of a well-know plant abundance data set, and show that the LORI outputs are consistent with known results. Finally we demonstrate the relevance of the methodology by analyzing a water-birds abundance table from the French national agency for wildlife and hunting management (ONCFS). The method is available in the R package lori on the Comprehensive Archive Network (CRAN).
Canonical correlation analysis (CCA) is a classical and important multivariate technique for exploring the relationship between two sets of continuous variables. CCA has applications in many fields, such as genomics and neuroimaging. It can extract meaningful features as well as use these features for subsequent analysis. Although some sparse CCA methods have been developed to deal with high-dimensional problems, they are designed specifically for continuous data and do not consider the integer-valued data from next-generation sequencing platforms that exhibit very low counts for some important features. We propose a model-based probabilistic approach for correlation and canonical correlation estimation for two sparse count data sets (PSCCA). PSCCA demonstrates that correlations and canonical correlations estimated at the natural parameter level are more appropriate than traditional estimation methods applied to the raw data. We demonstrate through simulation studies that PSCCA outperforms other standard correlation approaches and sparse CCA approaches in estimating the true correlations and canonical correlations at the natural parameter level. We further apply the PSCCA method to study the association of miRNA and mRNA expression data sets from a squamous cell lung cancer study, finding that PSCCA can uncover a large number of strongly correlated pairs than standard correlation and other sparse CCA approaches.
In employing spatial regression models for counts, we usually meet two issues. First, ignoring the inherent collinearity between covariates and the spatial effect would lead to causal inferences. Second, real count data usually reveal over or under-dispersion where the classical Poisson model is not appropriate to use. We propose a flexible Bayesian hierarchical modeling approach by joining non-confounding spatial methodology and a newly reconsidered dispersed count modeling from the renewal theory to control the issues. Specifically, we extend the methodology for analyzing spatial count data based on the gamma distribution assumption for waiting times. The model can be formulated as a latent Gaussian model, and consequently, we can carry out the fast computation using the integrated nested Laplace approximation method. We also examine different popular approaches for handling spatial confounding and compare their performances in the presence of dispersion. We use the proposed methodology to analyze a clinical dataset related to stomach cancer incidence in Slovenia and perform a simulation study to understand the proposed approachs merits better.
Research on Poisson regression analysis for dependent data has been developed rapidly in the last decade. One of difficult problems in a multivariate case is how to construct a cross-correlation structure and at the meantime make sure that the covariance matrix is positive definite. To address the issue, we propose to use convolved Gaussian process (CGP) in this paper. The approach provides a semi-parametric model and offers a natural framework for modeling common mean structure and covariance structure simultaneously. The CGP enables the model to define different covariance structure for each component of the response variables. This flexibility ensures the model to cope with data coming from different resources or having different data structures, and thus to provide accurate estimation and prediction. In addition, the model is able to accommodate large-dimensional covariates. The definition of the model, the inference and the implementation, as well as its asymptotic properties, are discussed. Comprehensive numerical examples with both simulation studies and real data are presented.
Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available auxiliary information. Motivated by this issue and in order to stabilize the skewness and achieve normality in the response variable, we propose an area-level log-measurement error model on the response variable. Then, under our proposed modeling framework, we derive an empirical Bayes (EB) predictor of positive small area quantities subject to the covariates containing measurement error. We propose a corresponding mean squared prediction error (MSPE) of EB predictor using both a jackknife and a bootstrap method. We show that the order of the bias is $O(m^{-1})$, where $m$ is the number of small areas. Finally, we investigate the performance of our methodology using both design-based and model-based simulation studies.
Rank data arises frequently in marketing, finance, organizational behavior, and psychology. Most analysis of rank data reported in the literature assumes the presence of one or more variables (sometimes latent) based on whose values the items are ranked. In this paper we analyze rank data using a purely probabilistic model where the observed ranks are assumed to be perturbe