No Arabic abstract
Cognitive Diagnosis Models (CDMs) are a special family of discrete latent variable models that are widely used in modern educational, psychological, social and biological sciences. A key component of CDMs is a binary $Q$-matrix characterizing the dependence structure between the items and the latent attributes. Additionally, researchers also assume in many applications certain hierarchical structures among the latent attributes to characterize their dependence. In most CDM applications, the attribute-attribute hierarchical structures, the item-attribute $Q$-matrix, the item-level diagnostic model, as well as the number of latent attributes, need to be fully or partially pre-specified, which however may be subjective and misspecified as noted by many recent studies. This paper considers the problem of jointly learning these latent and hierarchical structures in CDMs from observed data with minimal model assumptions. Specifically, a penalized likelihood approach is proposed to select the number of attributes and estimate the latent and hierarchical structures simultaneously. An efficient expectation-maximization (EM) algorithm and a latent structure recovery algorithm are developed, and statistical consistency theory is also established under mild conditions. The good performance of the proposed method is illustrated by simulation studies and a real data application in educational assessment.
Cognitive Diagnosis Models (CDMs) are a special family of discrete latent variable models widely used in educational, psychological and social sciences. In many applications of CDMs, certain hierarchical structures among the latent attributes are assumed by researchers to characterize their dependence structure. Specifically, a directed acyclic graph is used to specify hierarchical constraints on the allowable configurations of the discrete latent attributes. In this paper, we consider the important yet unaddressed problem of testing the existence of latent hierarchical structures in CDMs. We first introduce the concept of testability of hierarchical structures in CDMs and present sufficient conditions. Then we study the asymptotic behaviors of the likelihood ratio test (LRT) statistic, which is widely used for testing nested models. Due to the irregularity of the problem, the asymptotic distribution of LRT becomes nonstandard and tends to provide unsatisfactory finite sample performance under practical conditions. We provide statistical insights on such failures, and propose to use parametric bootstrap to perform the testing. We also demonstrate the effectiveness and superiority of parametric bootstrap for testing the latent hierarchies over non-parametric bootstrap and the naive Chi-squared test through comprehensive simulations and an educational assessment dataset.
Gaussian Graphical models (GGM) are widely used to estimate the network structures in many applications ranging from biology to finance. In practice, data is often corrupted by latent confounders which biases inference of the underlying true graphical structure. In this paper, we compare and contrast two strategies for inference in graphical models with latent confounders: Gaussian graphical models with latent variables (LVGGM) and PCA-based removal of confounding (PCA+GGM). While these two approaches have similar goals, they are motivated by different assumptions about confounding. In this paper, we explore the connection between these two approaches and propose a new method, which combines the strengths of these two approaches. We prove the consistency and convergence rate for the PCA-based method and use these results to provide guidance about when to use each method. We demonstrate the effectiveness of our methodology using both simulations and in two real-world applications.
Cognitive diagnosis models (CDMs) are useful statistical tools to provide rich information relevant for intervention and learning. As a popular approach to estimate and make inference of CDMs, the Markov chain Monte Carlo (MCMC) algorithm is widely used in practice. However, when the number of attributes, $K$, is large, the existing MCMC algorithm may become time-consuming, due to the fact that $O(2^K)$ calculations are usually needed in the process of MCMC sampling to get the conditional distribution for each attribute profile. To overcome this computational issue, motivated by Culpepper and Hudson (2018), we propose a computationally efficient sequential Gibbs sampling method, which needs $O(K)$ calculations to sample each attribute profile. We use simulation and real data examples to show the good finite-sample performance of the proposed sequential Gibbs sampling, and its advantage over existing methods.
There has been growing interest in recent years in Q-matrix based cognitive diagnosis models. Parameter estimation and respondent classification under these models may suffer due to identifiability issues. Non-identifiability can be described by a partition separating attribute profiles into groups of those with identical likelihoods. Marginal identifiability concerns the identifiability of individual attributes. Maximum likelihood estimation of the proportion of respondents within each equivalence class is consistent, making possible a new measure of assessment quality reporting the proportion of respondents for whom each individual attribute is marginally identifiable. Arising from this is a new posterior-based classification method adjusting for non-identifiability.
Learning the latent network structure from large scale multivariate point process data is an important task in a wide range of scientific and business applications. For instance, we might wish to estimate the neuronal functional connectivity network based on spiking times recorded from a collection of neurons. To characterize the complex processes underlying the observed data, we propose a new and flexible class of nonstationary Hawkes processes that allow both excitatory and inhibitory effects. We estimate the latent network structure using an efficient sparse least squares estimation approach. Using a thinning representation, we establish concentration inequalities for the first and second order statistics of the proposed Hawkes process. Such theoretical results enable us to establish the non-asymptotic error bound and the selection consistency of the estimated parameters. Furthermore, we describe a least squares loss based statistic for testing if the background intensity is constant in time. We demonstrate the efficacy of our proposed method through simulation studies and an application to a neuron spike train data set.