No Arabic abstract
We propose a latent topic model with a Markovian transition for process data, which consist of time-stamped events recorded in a log file. Such data are becoming more widely available in computer-based educational assessment with complex problem solving items. The proposed model can be viewed as an extension of the hierarchical Bayesian topic model with a hidden Markov structure to accommodate the underlying evolution of an examinees latent state. Using topic transition probabilities along with response times enables us to capture examinees learning trajectories, making clustering/classification more efficient. A forward-backward variational expectation-maximization (FB-VEM) algorithm is developed to tackle the challenging computational problem. Useful theoretical properties are established under certain asymptotic regimes. The proposed method is applied to a complex problem solving item in 2012 Programme for International Student Assessment (PISA 2012).
Motivated by modeling and analysis of mass-spectrometry data, a semi- and nonparametric model is proposed that consists of a linear parametric component for individual location and scale and a nonparametric regression function for the common shape. A multi-step approach is developed that simultaneously estimates the parametric components and the nonparametric function. Under certain regularity conditions, it is shown that the resulting estimators is consistent and asymptotic normal for the parametric part and achieve the optimal rate of convergence for the nonparametric part when the bandwidth is suitably chosen. Simulation results are presented to demonstrate the effectiveness and finite-sample performance of the method. The method is also applied to a SELDI-TOF mass spectrometry data set from a study of liver cancer patients.
Process data, temporally ordered categorical observations, are of recent interest due to its increasing abundance and the desire to extract useful information. A process is a collection of time-stamped events of different types, recording how an individual behaves in a given time period. The process data are too complex in terms of size and irregularity for the classical psychometric models to be applicable, at least directly, and, consequently, it is desirable to develop new ways for modeling and analysis. We introduce herein a latent theme dictionary model (LTDM) for processes that identifies co-occurrent event patterns and individuals with similar behavioral patterns. Theoretical properties are established under certain regularity conditions for the likelihood based estimation and inference. A non-parametric Bayes LTDM algorithm using the Markov Chain Monte Carlo method is proposed for computation. Simulation studies show that the proposed approach performs well in a range of situations. The proposed method is applied to an item in the 2012 Programme for International Student Assessment with interpretable findings.
Measuring veracity or reliability of noisy data is of utmost importance, especially in the scenarios where the information are gathered through automated systems. In a recent paper, Chakraborty et. al. (2019) have introduced a veracity scoring technique for geostatistical data. The authors have used a high-quality `reference data to measure the veracity of the varying-quality observations and incorporated the veracity scores in their analysis of mobile-sensor generated noisy weather data to generate efficient predictions of the ambient temperature process. In this paper, we consider the scenario when no reference data is available and hence, the veracity scores (referred as VS) are defined based on `local summaries of the observations. We develop a VS-based estimation method for parameters of a spatial regression model. Under a non-stationary noise structure and fairly general assumptions on the underlying spatial process, we show that the VS-based estimators of the regression parameters are consistent. Moreover, we establish the advantage of the VS-based estimators as compared to the ordinary least squares (OLS) estimator by analyzing their asymptotic mean squared errors. We illustrate the merits of the VS-based technique through simulations and apply the methodology to a real data set on mass percentages of ash in coal seams in Pennsylvania.
Statistical models with latent structure have a history going back to the 1950s and have seen widespread use in the social sciences and, more recently, in computational biology and in machine learning. Here we study the basic latent class model proposed originally by the sociologist Paul F. Lazarfeld for categorical variables, and we explain its geometric structure. We draw parallels between the statistical and geometric properties of latent class models and we illustrate geometrically the causes of many problems associated with maximum likelihood estimation and related statistical inference. In particular, we focus on issues of non-identifiability and determination of the model dimension, of maximization of the likelihood function and on the effect of symmetric data. We illustrate these phenomena with a variety of synthetic and real-life tables, of different dimension and complexity. Much of the motivation for this work stems from the 100 Swiss Francs problem, which we introduce and describe in detail.
Gaussian latent tree models, or more generally, Gaussian latent forest models have Fisher-information matrices that become singular along interesting submodels, namely, models that correspond to subforests. For these singularities, we compute the real log-canonical thresholds (also known as stochastic complexities or learning coefficients) that quantify the large-sample behavior of the marginal likelihood in Bayesian inference. This provides the information needed for a recently introduced generalization of the Bayesian information criterion. Our mathematical developments treat the general setting of Laplace integrals whose phase functions are sums of squared differences between monomials and constants. We clarify how in this case real log-canonical thresholds can be computed using polyhedral geometry, and we show how to apply the general theory to the Laplace integrals associated with Gaussian latent tree and forest models. In simulations and a data example, we demonstrate how the mathematical knowledge can be applied in model selection.