No Arabic abstract
This work proposes a nonparametric method to compare the underlying mean functions given two noisy datasets. The motivation for the work stems from an application of comparing wind turbine power curves. Comparing wind turbine data presents new problems, namely the need to identify the regions of difference in the input space and to quantify the extent of difference that is statistically significant. Our proposed method, referred to as funGP, estimates the underlying functions for different data samples using Gaussian process models. We build a confidence band using the probability law of the estimated function differences under the null hypothesis. Then, the confidence band is used for the hypothesis test as well as for identifying the regions of difference. This identification of difference regions is a distinct feature, as existing methods tend to conduct an overall hypothesis test stating whether two functions are different. Understanding the difference regions can lead to further practical insights and help devise better control and maintenance strategies for wind turbines. The merit of funGP is demonstrated by using three simulation studies and four real wind turbine datasets.
Infrastructure recovery time estimation is critical to disaster management and planning. Inspired by recent resilience planning initiatives, we consider a situation where experts are asked to estimate the time for different infrastructure systems to recover to certain functionality levels after a scenario hazard event. We propose a methodological framework to use expert-elicited data to estimate the expected recovery time curve of a particular infrastructure system. This framework uses the Gaussian process regression (GPR) to capture the experts estimation-uncertainty and satisfy known physical constraints of recovery processes. The framework is designed to find a balance between the data collection cost of expert elicitation and the prediction accuracy of GPR. We evaluate the framework on realistically simulated expert-elicited data concerning the two case study events, the 1995 Great Hanshin-Awaji Earthquake and the 2011 Great East Japan Earthquake.
Process data, temporally ordered categorical observations, are of recent interest due to its increasing abundance and the desire to extract useful information. A process is a collection of time-stamped events of different types, recording how an individual behaves in a given time period. The process data are too complex in terms of size and irregularity for the classical psychometric models to be applicable, at least directly, and, consequently, it is desirable to develop new ways for modeling and analysis. We introduce herein a latent theme dictionary model (LTDM) for processes that identifies co-occurrent event patterns and individuals with similar behavioral patterns. Theoretical properties are established under certain regularity conditions for the likelihood based estimation and inference. A non-parametric Bayes LTDM algorithm using the Markov Chain Monte Carlo method is proposed for computation. Simulation studies show that the proposed approach performs well in a range of situations. The proposed method is applied to an item in the 2012 Programme for International Student Assessment with interpretable findings.
Functional data are defined as realizations of random functions (mostly smooth functions) varying over a continuum, which are usually collected with measurement errors on discretized grids. In order to accurately smooth noisy functional observations and deal with the issue of high-dimensional observation grids, we propose a novel Bayesian method based on the Bayesian hierarchical model with a Gaussian-Wishart process prior and basis function representations. We first derive an induced model for the basis-function coefficients of the functional data, and then use this model to conduct posterior inference through Markov chain Monte Carlo. Compared to the standard Bayesian inference that suffers serious computational burden and unstableness for analyzing high-dimensional functional data, our method greatly improves the computational scalability and stability, while inheriting the advantage of simultaneously smoothing raw observations and estimating the mean-covariance functions in a nonparametric way. In addition, our method can naturally handle functional data observed on random or uncommon grids. Simulation and real studies demonstrate that our method produces similar results as the standard Bayesian inference with low-dimensional common grids, while efficiently smoothing and estimating functional data with random and high-dimensional observation grids where the standard Bayesian inference fails. In conclusion, our method can efficiently smooth and estimate high-dimensional functional data, providing one way to resolve the curse of dimensionality for Bayesian functional data analysis with Gaussian-Wishart processes.
Copulas provide a modular parameterization of multivariate distributions that decouples the modeling of marginals from the dependencies between them. Gaussian Mixture Copula Model (GMCM) is a highly flexible copula that can model many kinds of multi-modal dependencies, as well as asymmetric and tail dependencies. They have been effectively used in clustering non-Gaussian data and in Reproducibility Analysis, a meta-analysis method designed to verify the reliability and consistency of multiple high-throughput experiments. Parameter estimation for GMCM is challenging due to its intractable likelihood. The best previous methods have maximized a proxy-likelihood through a Pseudo Expectation Maximization (PEM) algorithm. They have no guarantees of convergence or convergence to the correct parameters. In this paper, we use Automatic Differentiation (AD) tools to develop a method, called AD-GMCM, that can maximize the exact GMCM likelihood. In our simulation studies and experiments with real data, AD-GMCM finds more accurate parameter estimates than PEM and yields better performance in clustering and Reproducibility Analysis. We discuss the advantages of an AD-based approach, to address problems related to monotonic increase of likelihood and parameter identifiability in GMCM. We also analyze, for GMCM, two well-known cases of degeneracy of maximum likelihood in GMM that can lead to spurious clustering solutions. Our analysis shows that, unlike GMM, GMCM is not affected in one of the cases.
Convolutional dictionary learning (CDL), the problem of estimating shift-invariant templates from data, is typically conducted in the absence of a prior/structure on the templates. In data-scarce or low signal-to-noise ratio (SNR) regimes, which have received little attention from the community, learned templates overfit the data and lack smoothness, which can affect the predictive performance of downstream tasks. To address this limitation, we propose GPCDL, a convolutional dictionary learning framework that enforces priors on templates using Gaussian Processes (GPs). With the focus on smoothness, we show theoretically that imposing a GP prior is equivalent to Wiener filtering the learned templates, thereby suppressing high-frequency components and promoting smoothness. We show that the algorithm is a simple extension of the classical iteratively reweighted least squares, which allows the flexibility to experiment with different smoothness assumptions. Through simulation, we show that GPCDL learns smooth dictionaries with better accuracy than the unregularized alternative across a range of SNRs. Through an application to neural spiking data from rats, we show that learning templates by GPCDL results in a more accurate and visually-interpretable smooth dictionary, leading to superior predictive performance compared to non-regularized CDL, as well as parametric alternatives.