No Arabic abstract
Model uncertainty quantification is an essential component of effective data assimilation. Model errors associated with sub-grid scale processes are often represented through stochastic parameterizations of the unresolved process. Many existing Stochastic Parameterization schemes are only applicable when knowledge of the true sub-grid scale process or full observations of the coarse scale process are available, which is typically not the case in real applications. We present a methodology for estimating the statistics of sub-grid scale processes for the more realistic case that only partial observations of the coarse scale process are available. Model error realizations are estimated over a training period by minimizing their conditional sum of squared deviations given some informative covariates (e.g. state of the system), constrained by available observations and assuming that the observation errors are smaller than the model errors. From these realizations a conditional probability distribution of additive model errors given these covariates is obtained, allowing for complex non-Gaussian error structures. Random draws from this density are then used in actual ensemble data assimilation experiments. We demonstrate the efficacy of the approach through numerical experiments with the multi-scale Lorenz 96 system using both small and large time scale separations between slow (coarse scale) and fast (fine scale) variables. The resulting error estimates and forecasts obtained with this new method are superior to those from two existing methods.
Relative error approaches are more of concern compared to absolute error ones such as the least square and least absolute deviation, when it needs scale invariant of output variable, for example with analyzing stock and survival data. An h-relative error estimation method via the h-likelihood is developed to avoid heavy and intractable integration for a multiplicative regression model with random effect. Statistical properties of the parameters and random effect in the model are studied. To estimate the parameters, we propose an h-relative error computation procedure. Numerical studies including simulation and real examples show the proposed method performs well.
The mixed-logit model is a flexible tool in transportation choice analysis, which provides valuable insights into inter and intra-individual behavioural heterogeneity. However, applications of mixed-logit models are limited by the high computational and data requirements for model estimation. When estimating on small samples, the Bayesian estimation approach becomes vulnerable to over and under-fitting. This is problematic for investigating the behaviour of specific population sub-groups or market segments with low data availability. Similar challenges arise when transferring an existing model to a new location or time period, e.g., when estimating post-pandemic travel behaviour. We propose an Early Stopping Bayesian Data Assimilation (ESBDA) simulator for estimation of mixed-logit which combines a Bayesian statistical approach with Machine Learning methodologies. The aim is to improve the transferability of mixed-logit models and to enable the estimation of robust choice models with low data availability. This approach can provide new insights into choice behaviour where the traditional estimation of mixed-logit models was not possible due to low data availability, and open up new opportunities for investment and planning decisions support. The ESBDA estimator is benchmarked against the Direct Application approach, a basic Bayesian simulator with random starting parameter values and a Bayesian Data Assimilation (BDA) simulator without early stopping. The ESBDA approach is found to effectively overcome under and over-fitting and non-convergence issues in simulation. Its resulting models clearly outperform those of the reference simulators in predictive accuracy. Furthermore, models estimated with ESBDA tend to be more robust, with significant parameters with signs and values consistent with behavioural theory, even when estimated on small samples.
A product relative error estimation method for single index regression model is proposed as an alternative to absolute error methods, such as the least square estimation and the least absolute deviation estimation. It is scale invariant for outcome and covariates in the model. Regression coefficients are estimated via a two-stage procedure and their statistical properties such as consistency and normality are studied. Numerical studies including simulation and a body fat example show that the proposed method performs well.
Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. We propose a Vecchia-Laplace approximation for GGPs, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in a simple, general, scalable, and accurate methodology. We provide numerical studies and comparisons on simulated and real spatial data. Our methods are implemented in a freely available R package.
Conditional density estimation (density regression) estimates the distribution of a response variable y conditional on covariates x. Utilizing a partition model framework, a conditional density estimation method is proposed using logistic Gaussian processes. The partition is created using a Voronoi tessellation and is learned from the data using a reversible jump Markov chain Monte Carlo algorithm. The Markov chain Monte Carlo algorithm is made possible through a Laplace approximation on the latent variables of the logistic Gaussian process model. This approximation marginalizes the parameters in each partition element, allowing an efficient search of the posterior distribution of the tessellation. The method has desirable consistency properties. In simulation and applications, the model successfully estimates the partition structure and conditional distribution of y.