No Arabic abstract
This paper presents a new approach to a robust Gaussian process (GP) regression. Most existing approaches replace an outlier-prone Gaussian likelihood with a non-Gaussian likelihood induced from a heavy tail distribution, such as the Laplace distribution and Student-t distribution. However, the use of a non-Gaussian likelihood would incur the need for a computationally expensive Bayesian approximate computation in the posterior inferences. The proposed approach models an outlier as a noisy and biased observation of an unknown regression function, and accordingly, the likelihood contains bias terms to explain the degree of deviations from the regression function. We entail how the biases can be estimated accurately with other hyperparameters by a regularized maximum likelihood estimation. Conditioned on the bias estimates, the robust GP regression can be reduced to a standard GP regression problem with analytical forms of the predictive mean and variance estimates. Therefore, the proposed approach is simple and very computationally attractive. It also gives a very robust and accurate GP estimate for many tested scenarios. For the numerical evaluation, we perform a comprehensive simulation study to evaluate the proposed approach with the comparison to the existing robust GP approaches under various simulated scenarios of different outlier proportions and different noise levels. The approach is applied to data from two measurement systems, where the predictors are based on robust environmental parameter measurements and the response variables utilize more complex chemical sensing methods that contain a certain percentage of outliers. The utility of the measurement systems and value of the environmental data are improved through the computationally efficient GP regression and bias model.
The Gaussian process (GP) regression can be severely biased when the data are contaminated by outliers. This paper presents a new robust GP regression algorithm that iteratively trims the most extreme data points. While the new algorithm retains the attractive properties of the standard GP as a nonparametric and flexible regression method, it can greatly improve the model accuracy for contaminated data even in the presence of extreme or abundant outliers. It is also easier to implement compared with previous robust GP variants that rely on approximate inference. Applied to a wide range of experiments with different contamination levels, the proposed method significantly outperforms the standard GP and the popular robust GP variant with the Student-t likelihood in most test cases. In addition, as a practical example in the astrophysical study, we show that this method can precisely determine the main-sequence ridge line in the color-magnitude diagram of star clusters.
Gaussian process regression (GPR) model is a popular nonparametric regression model. In GPR, features of the regression function such as varying degrees of smoothness and periodicities are modeled through combining various covarinace kernels, which are supposed to model certain effects. The covariance kernels have unknown parameters which are estimated by the EM-algorithm or Markov Chain Monte Carlo. The estimated parameters are keys to the inference of the features of the regression functions, but identifiability of these parameters has not been investigated. In this paper, we prove identifiability of covariance kernel parameters in two radial basis mixed kernel GPR and radial basis and periodic mixed kernel GPR. We also provide some examples about non-identifiable cases in such mixed kernel GPRs.
We apply Gaussian process (GP) regression, which provides a powerful non-parametric probabilistic method of relating inputs to outputs, to survival data consisting of time-to-event and covariate measurements. In this context, the covariates are regarded as the `inputs and the event times are the `outputs. This allows for highly flexible inference of non-linear relationships between covariates and event times. Many existing methods, such as the ubiquitous Cox proportional hazards model, focus primarily on the hazard rate which is typically assumed to take some parametric or semi-parametric form. Our proposed model belongs to the class of accelerated failure time models where we focus on directly characterising the relationship between covariates and event times without any explicit assumptions on what form the hazard rates take. It is straightforward to include various types and combinations of censored and truncated observations. We apply our approach to both simulated and experimental data. We then apply multiple output GP regression, which can handle multiple potentially correlated outputs for each input, to competing risks survival data where multiple event types can occur. By tuning one of the model parameters we can control the extent to which the multiple outputs (the time-to-event for each risk) are dependent thus allowing the specification of correlated risks. Simulation studies suggest that in some cases assuming dependence can lead to more accurate predictions.
This paper presents a Gaussian process (GP) model for estimating piecewise continuous regression functions. In scientific and engineering applications of regression analysis, the underlying regression functions are piecewise continuous in that data follow different continuous regression models for different regions of the data with possible discontinuities between the regions. However, many conventional GP regression approaches are not designed for piecewise regression analysis. We propose a new GP modeling approach for estimating an unknown piecewise continuous regression function. The new GP model seeks for a local GP estimate of an unknown regression function at each test location, using local data neighboring to the test location. To accommodate the possibilities of the local data from different regions, the local data is partitioned into two sides by a local linear boundary, and only the local data belonging to the same side as the test location is used for the regression estimate. This local split works very well when the input regions are bounded by smooth boundaries, so the local linear approximation of the smooth boundaries works well. We estimate the local linear boundary jointly with the other hyperparameters of the GP model, using the maximum likelihood approach. Its computation time is as low as the local GPs time. The superior numerical performance of the proposed approach over the conventional GP modeling approaches is shown using various simulated piecewise regression functions.
Monge-Kantorovich distances, otherwise known as Wasserstein distances, have received a growing attention in statistics and machine learning as a powerful discrepancy measure for probability distributions. In this paper, we focus on forecasting a Gaussian process indexed by probability distributions. For this, we provide a family of positive definite kernels built using transportation based distances. We provide a probabilistic understanding of these kernels and characterize the corresponding stochastic processes. We prove that the Gaussian processes indexed by distributions corresponding to these kernels can be efficiently forecast, opening new perspectives in Gaussian process modeling.