No Arabic abstract
Often, government agencies and survey organizations know the population counts or percentages for some of the variables in a survey. These may be available from auxiliary sources, for example, administrative databases or other high quality surveys. We present and illustrate a model-based framework for leveraging such auxiliary marginal information when handling unit and item nonresponse. We show how one can use the margins to specify different missingness mechanisms for each type of nonresponse. We use the framework to impute missing values in voter turnout in a subset of data from the U.S. Current Population Survey (CPS). In doing so, we examine the sensitivity of results to different assumptions about the unit and item nonresponse.
Heywood cases are known from linear factor analysis literature as variables with communalities larger than 1.00, and in present day factor models, the problem also shows in negative residual variances. For binary data, ordinal factor models can be applied with either delta parameterization or theta parametrization. The former is more common than the latter and can yield Heywood cases when limited information estimation is used. The same problem shows up as nonconvergence cases in theta parameterized factor models and as extremely large discriminations in item response theory (IRT) models. In this study, we explain why the same problem appears in different forms depending on the method of analysis. We first discuss this issue using equations and then illustrate our conclusions using a small simulation study, where all three methods, delta and theta parameterized ordinal factor models (with estimation based on polychoric correlations) and an IRT model (with full information estimation), are used to analyze the same datasets. We also compared the performances of the WLS, WLSMV, and ULS estimators for the ordinal factor models. Finally, we analyze real data with the same three approaches. The results of the simulation study and the analysis of real data confirm the theoretical conclusions.
The joint modeling of mean and dispersion (JMMD) provides an efficient method to obtain useful models for the mean and dispersion, especially in problems of robust design experiments. However, in the literature on JMMD there are few works dedicated to variable selection and this theme is still a challenge. In this article, we propose a procedure for selecting variables in JMMD, based on hypothesis testing and the quality of the models fit. A criterion for checking the goodness of fit is used, in each iteration of the selection process, as a filter for choosing the terms that will be evaluated by a hypothesis test. Three types of criteria were considered for checking the quality of the model fit in our variable selection procedure. The criteria used were: the extended Akaike information criterion, the corrected Akaike information criterion and a specific criterion for the JMMD, proposed by us, a type of extended adjusted coefficient of determination. Simulation studies were carried out to verify the efficiency of our variable selection procedure. In all situations considered, the proposed procedure proved to be effective and quite satisfactory. The variable selection process was applied to a real example from an industrial experiment.
This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.
The increasing prevalence of rich sources of data and the availability of electronic medical record databases and electronic registries opens tremendous opportunities for enhancing medical research. For example, controlled trials are ubiquitously used to investigate the effect of a medical treatment, perhaps dependent on a set of patient covariates, and traditional approaches have relied primarily on randomized patient sampling and allocation to treatment and control group. However, when covariate data for a large cohort group of patients have already been collected and are available in a database, one can potentially design a treatment/control sample and allocation that provides far better estimates of the covariate-dependent effects of the treatment. In this paper, we develop a new approach that uses optimal design of experiments (DOE) concepts to accomplish this objective. The approach selects the patients for the treatment and control samples upfront, based on their covariate values, in a manner that optimizes the information content in the data. For the optimal sample selection, we develop simple guidelines and an optimization algorithm that provides solutions that are substantially better than random sampling. Moreover, our approach causes no sampling bias in the estimated effects, for the same reason that DOE principles do not bias estimated effects. We test our method with a simulation study based on a testbed data set containing information on the effect of statins on low-density lipoprotein (LDL) cholesterol.
A population-averaged additive subdistribution hazard model is proposed to assess the marginal effects of covariates on the cumulative incidence function to analyze correlated failure time data subject to competing risks. This approach extends the population-averaged additive hazard model by accommodating potentially dependent censoring due to competing events other than the event of interest. Assuming an independent working correlation structure, an estimating equations approach is considered to estimate the regression coefficients and a sandwich variance estimator is proposed. The sandwich variance estimator accounts for both the correlations between failure times as well as the those between the censoring times, and is robust to misspecification of the unknown dependency structure within each cluster. We further develop goodness-of-fit tests to assess the adequacy of the additive structure of the subdistribution hazard for each covariate, as well as for the overall model. Simulation studies are carried out to investigate the performance of the proposed methods in finite samples; and we illustrate our methods by analyzing the STrategies to Reduce Injuries and Develop confidence in Elders (STRIDE) study.