Computer-based interactive items have become prevalent in recent educational assessments. In such items, the entire human-computer interactive process is recorded in a log file and is known as the response process. This paper aims at extracting useful information from response processes. In particular, we consider an exploratory latent variable analysis for process data. Latent variables are extracted through a multidimensional scaling framework and can be empirically proved to contain more information than classic binary responses in terms of out-of-sample prediction of many variables.
Accurate assessment of students ability is the key task of a test. Assessments based on final responses are the standard. As the infrastructure advances, substantially more information is observed. One of such instances is the process data that is collected by computer-based interactive items, which contain a students detailed interactive processes. In this paper, we show both theoretically and empirically that appropriately including such information in the assessment will substantially improve relevant assessment precision. The precision is measured empirically by out-of-sample test reliability.
The analysis of high dimensional survival data is challenging, primarily due to the problem of overfitting which occurs when spurious relationships are inferred from data that subsequently fail to exist in test data. Here we propose a novel method of extracting a low dimensional representation of covariates in survival data by combining the popular Gaussian Process Latent Variable Model (GPLVM) with a Weibull Proportional Hazards Model (WPHM). The combined model offers a flexible non-linear probabilistic method of detecting and extracting any intrinsic low dimensional structure from high dimensional data. By reducing the covariate dimension we aim to diminish the risk of overfitting and increase the robustness and accuracy with which we infer relationships between covariates and survival outcomes. In addition, we can simultaneously combine information from multiple data sources by expressing multiple datasets in terms of the same low dimensional space. We present results from several simulation studies that illustrate a reduction in overfitting and an increase in predictive performance, as well as successful detection of intrinsic dimensionality. We provide evidence that it is advantageous to combine dimensionality reduction with survival outcomes rather than performing unsupervised dimensionality reduction on its own. Finally, we use our model to analyse experimental gene expression data and detect and extract a low dimensional representation that allows us to distinguish high and low risk groups with superior accuracy compared to doing regression on the original high dimensional data.
Multidimensional Scaling (MDS) is a classical technique for embedding data in low dimensions, still in widespread use today. Originally introduced in the 1950s, MDS was not designed with high-dimensional data in mind; while it remains popular with data analysis practitioners, no doubt it should be adapted to the high-dimensional data regime. In this paper we study MDS under modern setting, and specifically, high dimensions and ambient measurement noise. We show that, as the ambient noise level increase, MDS suffers a sharp breakdown that depends on the data dimension and noise level, and derive an explicit formula for this breakdown point in the case of white noise. We then introduce MDS+, an extremely simple variant of MDS, which applies a carefully derived shrinkage nonlinearity to the eigenvalues of the MDS similarity matrix. Under a loss function measuring the embedding quality, MDS+ is the unique asymptotically optimal shrinkage function. We prove that MDS+ offers improved embedding, sometimes significantly so, compared with classical MDS. Furthermore, MDS+ does not require external estimates of the embedding dimension (a famous difficulty in classical MDS), as it calculates the optimal dimension into which the data should be embedded.
The Coronavirus Disease 2019 (COVID-19) pandemic has caused tremendous amount of deaths and a devastating impact on the economic development all over the world. Thus, it is paramount to control its further transmission, for which purpose it is necessary to find the mechanism of its transmission process and evaluate the effect of different control strategies. To deal with these issues, we describe the transmission of COVID-19 as an explosive Markov process with four parameters. The state transitions of the proposed Markov process can clearly disclose the terrible explosion and complex heterogeneity of COVID-19. Based on this, we further propose a simulation approach with heterogeneous infections. Experimentations show that our approach can closely track the real transmission process of COVID-19, disclose its transmission mechanism, and forecast the transmission under different non-drug intervention strategies. More importantly, our approach can helpfully develop effective strategies for controlling COVID-19 and appropriately compare their control effect in different countries/cities.
We introduce a multivariate Hawkes process with constraints on its conditional density. It is a multivariate point process with conditional intensity similar to that of a multivariate Hawkes process but certain events are forbidden with respect to boundary conditions on a multidimensional constraint variable, whose evolution is driven by the point process. We study this process in the special case where the fertility function is exponential so that the process is entirely described by an underlying Markov chain, which includes the constraint variable. Some conditions on the parameters are established to ensure the ergodicity of the chain. Moreover, scaling limits are derived for the integrated point process. This study is primarily motivated by the stochastic modelling of a limit order book for high frequency financial data analysis.