No Arabic abstract
Partially observed cured data occur in the analysis of spontaneous abortion (SAB) in observational studies in pregnancy. In contrast to the traditional cured data, such data has an observable `cured portion as women who do not abort spontaneously. The data is also subject to left truncate in addition to right-censoring because women may enter or withdraw from a study any time during their pregnancy. Left truncation in particular causes unique bias in the presence of a cured portion. In this paper, we study a cure rate model and develop a conditional nonparametric maximum likelihood approach. To tackle the computational challenge we adopt an EM algorithm making use of ghost copies of the data, and a closed form variance estimator is derived. Under suitable assumptions, we prove the consistency of the resulting estimator involving an unbounded cumulative baseline hazard function, as well as the asymptotic normality. Simulation results are carried out to evaluate the finite sample performance. We present the analysis of the motivating SAB study to illustrate the power of our model addressing both occurrence and timing of SAB, as compared to existing approaches in practice.
Nonparametric empirical Bayes methods provide a flexible and attractive approach to high-dimensional data analysis. One particularly elegant empirical Bayes methodology, involving the Kiefer-Wolfowitz nonparametric maximum likelihood estimator (NPMLE) for mixture models, has been known for decades. However, implementation and theoretical analysis of the Kiefer-Wolfowitz NPMLE are notoriously difficult. A fast algorithm was recently proposed that makes NPMLE-based procedures feasible for use in large-scale problems, but the algorithm calculates only an approximation to the NPMLE. In this paper we make two contributions. First, we provide upper bounds on the convergence rate of the approximate NPMLEs statistical error, which have the same order as the best known bounds for the true NPMLE. This suggests that the approximate NPMLE is just as effective as the true NPMLE for statistical applications. Second, we illustrate the promise of NPMLE procedures in a high-dimensional binary classification problem. We propose a new procedure and show that it vastly outperforms existing methods in experiments with simulated data. In real data analyses involving cancer survival and gene expression data, we show that it is very competitive with several recently proposed methods for regularized linear discriminant analysis, another popular approach to high-dimensional classification.
Nonparametric maximum likelihood (NPML) for mixture models is a technique for estimating mixing distributions that has a long and rich history in statistics going back to the 1950s, and is closely related to empirical Bayes methods. Historically, NPML-based methods have been considered to be relatively impractical because of computational and theoretical obstacles. However, recent work focusing on approximate NPML methods suggests that these methods may have great promise for a variety of modern applications. Building on this recent work, a class of flexible, scalable, and easy to implement approximate NPML methods is studied for problems with multivariate mixing distributions. Concrete guidance on implementing these methods is provided, with theoretical and empirical support; topics covered include identifying the support set of the mixing distribution, and comparing algorithms (across a variety of metrics) for solving the simple convex optimization problem at the core of the approximate NPML problem. Additionally, three diverse real data applications are studied to illustrate the methods performance: (i) A baseball data analysis (a classical example for empirical Bayes methods), (ii) high-dimensional microarray classification, and (iii) online prediction of blood-glucose density for diabetes patients. Among other things, the empirical results demonstrate the relative effectiveness of using multivariate (as opposed to univariate) mixing distributions for NPML-based approaches.
This work was motivated by observational studies in pregnancy with spontaneous abortion (SAB) as outcome. Clearly some women experience the SAB event but the rest do not. In addition, the data are left truncated due to the way pregnant women are recruited into these studies. For those women who do experience SAB, their exact event times are sometimes unknown. Finally, a small percentage of the women are lost to follow-up during their pregnancy. All these give rise to data that are left truncated, partly interval and right-censored, and with a clearly defined cured portion. We consider the non-mixture Cox regression cure rate model and adopt the semiparametric spline-based sieve maximum likelihood approach to analyze such data. Using modern empirical process theory we show that both the parametric and the nonparametric parts of the sieve estimator are consistent, and we establish the asymptotic normality for both parts. Simulation studies are conducted to establish the finite sample performance. Finally, we apply our method to a database of observational studies on spontaneous abortion.
Suppose an online platform wants to compare a treatment and control policy, e.g., two different matching algorithms in a ridesharing system, or two different inventory management algorithms in an online retail site. Standard randomized controlled trials are typically not feasible, since the goal is to estimate policy performance on the entire system. Instead, the typical current practice involves dynamically alternating between the two policies for fixed lengths of time, and comparing the average performance of each over the intervals in which they were run as an estimate of the treatment effect. However, this approach suffers from *temporal interference*: one algorithm alters the state of the system as seen by the second algorithm, biasing estimates of the treatment effect. Further, the simple non-adaptive nature of such designs implies they are not sample efficient. We develop a benchmark theoretical model in which to study optimal experimental design for this setting. We view testing the two policies as the problem of estimating the steady state difference in reward between two unknown Markov chains (i.e., policies). We assume estimation of the steady state reward for each chain proceeds via nonparametric maximum likelihood, and search for consistent (i.e., asymptotically unbiased) experimental designs that are efficient (i.e., asymptotically minimum variance). Characterizing such designs is equivalent to a Markov decision problem with a minimum variance objective; such problems generally do not admit tractable solutions. Remarkably, in our setting, using a novel application of classical martingale analysis of Markov chains via Poissons equation, we characterize efficient designs via a succinct convex optimization problem. We use this characterization to propose a consistent, efficient online experimental design that adaptively samples the two Markov chains.
Fiducial Inference, introduced by Fisher in the 1930s, has a long history, which at times aroused passionate disagreements. However, its application has been largely confined to relatively simple parametric problems. In this paper, we present what might be the first time fiducial inference, as generalized by Hannig et al. (2016), is systematically applied to estimation of a nonparametric survival function under right censoring. We find that the resulting fiducial distribution gives rise to surprisingly good statistical procedures applicable to both one sample and two sample problems. In particular, we use the fiducial distribution of a survival function to construct pointwise and curvewise confidence intervals for the survival function, and propose tests based on the curvewise confidence interval. We establish a functional Bernstein-von Mises theorem, and perform thorough simulation studies in scenarios with different levels of censoring. The proposed fiducial based confidence intervals maintain coverage in situations where asymptotic methods often have substantial coverage problems. Furthermore, the average length of the proposed confidence intervals is often shorter than the length of competing methods that maintain coverage. Finally, the proposed fiducial test is more powerful than various types of log-rank tests and sup log-rank tests in some scenarios. We illustrate the proposed fiducial test comparing chemotherapy against chemotherapy combined with radiotherapy using data from the treatment of locally unresectable gastric cancer.