No Arabic abstract
As with the advancement of geographical information systems, non-Gaussian spatial data sets are getting larger and more diverse. This study develops a general framework for fast and flexible non-Gaussian regression, especially for spatial/spatiotemporal modeling. The developed model, termed the compositionally-warped additive mixed model (CAMM), combines an additive mixed model (AMM) and the compositionally-warped Gaussian process to model a wide variety of non-Gaussian continuous data including spatial and other effects. A specific advantage of the proposed CAMM is that it requires no explicit assumption of data distribution unlike existing AMMs. Monte Carlo experiments show the estimation accuracy and computational efficiency of CAMM for modeling non-Gaussian data including fat-tailed and/or skewed distributions. Finally, the model is applied to crime data to examine the empirical performance of the regression analysis and prediction. The result shows that CAMM provides intuitively reasonable coefficient estimates and outperforms AMM in terms of prediction accuracy. CAMM is verified to be a fast and flexible model that potentially covers a wide variety of non-Gaussian data modeling. The proposed approach is implemented in an R package spmoran.
Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. We propose a Vecchia-Laplace approximation for GGPs, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in a simple, general, scalable, and accurate methodology. We provide numerical studies and comparisons on simulated and real spatial data. Our methods are implemented in a freely available R package.
Clustering task of mixed data is a challenging problem. In a probabilistic framework, the main difficulty is due to a shortage of conventional distributions for such data. In this paper, we propose to achieve the mixed data clustering with a Gaussian copula mixture model, since copulas, and in particular the Gaussian ones, are powerful tools for easily modelling the distribution of multivariate variables. Indeed, considering a mixing of continuous, integer and ordinal variables (thus all having a cumulative distribution function), this copula mixture model defines intra-component dependencies similar to a Gaussian mixture, so with classical correlation meaning. Simultaneously, it preserves standard margins associated to continuous, integer and ordered features, namely the Gaussian, the Poisson and the ordered multinomial distributions. As an interesting by-product, the proposed mixture model generalizes many well-known ones and also provides tools of visualization based on the parameters. At a practical level, the Bayesian inference is retained and it is achieved with a Metropolis-within-Gibbs sampler. Experiments on simulated and real data sets finally illustrate the expected advantages of the proposed model for mixed data: flexible and meaningful parametrization combined with visualization features.
A rapid growth in spatial open datasets has led to a huge demand for regression approaches accommodating spatial and non-spatial effects in big data. Regression model selection is particularly important to stably estimate flexible regression models. However, conventional methods can be slow for large samples. Hence, we develop a fast and practical model-selection approach for spatial regression models, focusing on the selection of coefficient types that include constant, spatially varying, and non-spatially varying coefficients. A pre-processing approach, which replaces data matrices with small inner products through dimension reduction dramatically accelerates the computation speed of model selection. Numerical experiments show that our approach selects the model accurately and computationally efficiently, highlighting the importance of model selection in the spatial regression context. Then, the present approach is applied to open data to investigate local factors affecting crime in Japan. The results suggest that our approach is useful not only for selecting factors influencing crime risk but also for predicting crime events. This scalable model selection will be key to appropriately specifying flexible and large-scale spatial regression models in the era of big data. The developed model selection approach was implemented in the R package spmoran.
A multivariate distribution can be described by a triangular transport map from the target distribution to a simple reference distribution. We propose Bayesian nonparametric inference on the transport map by modeling its components using Gaussian processes. This enables regularization and accounting for uncertainty in the map estimation, while still resulting in a closed-form and invertible posterior map. We then focus on inferring the distribution of a nonstationary spatial field from a small number of replicates. We develop specific transport-map priors that are highly flexible and are motivated by the behavior of a large class of stochastic processes. Our approach is scalable to high-dimensional fields due to data-dependent sparsity and parallel computations. We also discuss extensions, including Dirichlet process mixtures for marginal non-Gaussianity. We present numerical results to demonstrate the accuracy, scalability, and usefulness of our methods, including statistical emulation of non-Gaussian climate-model output.
Functional principal component analysis (FPCA) could become invalid when data involve non-Gaussian features. Therefore, we aim to develop a general FPCA method to adapt to such non-Gaussian cases. A Kenalls $tau$ function, which possesses identical eigenfunctions as covariance function, is constructed. The particular formulation of Kendalls $tau$ function makes it less insensitive to data distribution. We further apply it to the estimation of FPCA and study the corresponding asymptotic consistency. Moreover, the effectiveness of the proposed method is demonstrated through a comprehensive simulation study and an application to the physical activity data collected by a wearable accelerometer monitor.