No Arabic abstract
In regression models, predictor variables with inherent ordering, such as tumor staging ranging and ECOG performance status, are commonly seen in medical settings. Statistically, it may be difficult to determine the functional form of an ordinal predictor variable. Often, such a variable is dichotomized based on whether it is above or below a certain cutoff. Other methods conveniently treat the ordinal predictor as a continuous variable and assume a linear relationship with the outcome. However, arbitrarily choosing a method may lead to inaccurate inference and treatment. In this paper, we propose a Bayesian mixture model to simultaneously assess the appropriate form of the predictor in regression models by considering the presence of a changepoint through the lens of a threshold detection problem. By using a mixture model framework to consider both dichotomous and linear forms for the variable, the estimate is a weighted average of linear and binary parameterizations. This method is applicable to continuous, binary, and survival outcomes, and easily amenable to penalized regression. We evaluated the proposed method using simulation studies and apply it to two real datasets. We provide JAGS code for easy implementation.
In some contexts, mixture models can fit certain variables well at the expense of others in ways beyond the analysts control. For example, when the data include some variables with non-trivial amounts of missing values, the mixture model may fit the marginal distributions of the nearly and fully complete variables at the expense of the variables with high fractions of missing data. Motivated by this setting, we present a mixture model for mixed ordinal and nominal data that splits variables into two groups, focus variables and remainder variables. The model allows the analyst to specify a rich sub-model for the focus variables and a simpler sub-model for remainder variables, yet still capture associations among the variables. Using simulations, we illustrate advantages and limitations of focused clustering compared to mixture models that do not distinguish variables. We apply the model to handle missing values in an analysis of the 2012 American National Election Study, estimating relationships among voting behavior, ideology, and political party affiliation.
While there have been a lot of recent developments in the context of Bayesian model selection and variable selection for high dimensional linear models, there is not much work in the presence of change point in literature, unlike the frequentist counterpart. We consider a hierarchical Bayesian linear model where the active set of covariates that affects the observations through a mean model can vary between different time segments. Such structure may arise in social sciences/ economic sciences, such as sudden change of house price based on external economic factor, crime rate changes based on social and built-environment factors, and others. Using an appropriate adaptive prior, we outline the development of a hierarchical Bayesian methodology that can select the true change point as well as the true covariates, with high probability. We provide the first detailed theoretical analysis for posterior consistency with or without covariates, under suitable conditions. Gibbs sampling techniques provide an efficient computational strategy. We also consider small sample simulation study as well as application to crime forecasting applications.
We propose a new method for changepoint estimation in partially-observed, high-dimensional time series that undergo a simultaneous change in mean in a sparse subset of coordinates. Our first methodological contribution is to introduce a MissCUSUM transformation (a generalisation of the popular Cumulative Sum statistics), that captures the interaction between the signal strength and the level of missingness in each coordinate. In order to borrow strength across the coordinates, we propose to project these MissCUSUM statistics along a direction found as the solution to a penalised optimisation problem tailored to the specific sparsity structure. The changepoint can then be estimated as the location of the peak of the absolute value of the projected univariate series. In a model that allows different missingness probabilities in different component series, we identify that the key interaction between the missingness and the signal is a weighted sum of squares of the signal change in each coordinate, with weights given by the observation probabilities. More specifically, we prove that the angle between the estimated and oracle projection directions, as well as the changepoint location error, are controlled with high probability by the sum of two terms, both involving this weighted sum of squares, and representing the error incurred due to noise and the error due to missingness respectively. A lower bound confirms that our changepoint estimator, which we call MissInspect, is optimal up to a logarithmic factor. The striking effectiveness of the MissInspect methodology is further demonstrated both on simulated data, and on an oceanographic data set covering the Neogene period.
The use of a finite mixture of normal distributions in model-based clustering allows to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition this prior allows to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semi-parametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark data sets.
Diffusion tensor imaging (DTI) is a popular magnetic resonance imaging technique used to characterize microstructural changes in the brain. DTI studies quantify the diffusion of water molecules in a voxel using an estimated 3x3 symmetric positive definite diffusion tensor matrix. Statistical analysis of DTI data is challenging because the data are positive definite matrices. Matrix-variate information is often summarized by a univariate quantity, such as the fractional anisotropy (FA), leading to a loss of information. Furthermore, DTI analyses often ignore the spatial association of neighboring voxels, which can lead to imprecise estimates. Although the spatial modeling literature is abundant, modeling spatially dependent positive definite matrices is challenging. To mitigate these issues, we propose a matrix-variate Bayesian semiparametric mixture model, where the positive definite matrices are distributed as a mixture of inverse Wishart distributions with the spatial dependence captured by a Markov model for the mixture component labels. Conjugacy and the double Metropolis-Hastings algorithm result in fast and elegant Bayesian computing. Our simulation study shows that the proposed method is more powerful than non-spatial methods. We also apply the proposed method to investigate the effect of cocaine use on brain structure. The contribution of our work is to provide a novel statistical inference tool for DTI analysis by extending spatial statistics to matrix-variate data.