A new strategy is introduced for estimating population size and networked population characteristics. Sample selection is based on a multi-wave snowball sampling design. A generalized stochastic block model is posited for the populations network graph. Inference is based on a Bayesian data augmentation procedure. Applications are provided to an empirical and simulated populations. The results demonstrate that statistically efficient estimates of the size and distribution of the population can be achieved.
We present a new design and inference method for estimating population size of a hidden population best reached through a link-tracing design. The strategy involves the Rao-Blackwell Theorem applied to a sufficient statistic markedly different from the usual one that arises in sampling from a finite population. An empirical application is described. The result demonstrates that the strategy can efficiently incorporate adaptively selected members of the sample into the inference procedure.
A new approach to estimate population size based on a stratified link-tracing sampling design is presented. The method extends on the Frank and Snijders (1994) approach by allowing for heterogeneity in the initial sample selection procedure. Rao-Blackwell estimators and corresponding resampling approximations similar to that detailed in Vincent and Thompson (2017) are explored. An empirical application is provided for a hard-to-reach networked population. The results demonstrate that the approach has much potential for application to such populations. Supplementary materials for this article are available online.
Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number $n$ of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified weight (also called its size). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes prioritize control of the sample size, violating the PPS property if necessary. We provide a new PPS scheme that allows a different trade-off: our method enforces the PPS property at all times while ensuring that the sample size never exceeds the target value $n$. The sample size is exactly equal to $n$ if possible, and otherwise has maximal expected value and minimal variance. Thus we bound the sample size, thereby avoiding storage overflows and helping to control the time required for analytics over the sample, while allowing the user complete control over the sample contents. The method is both simple to implement and efficient, being a one-pass streaming algorithm with an amortized processing time of $O(1)$ per item.
We consider the problem of estimating the rate of defects (mean number of defects per item), given the counts of defects detected by two independent imperfect inspectors on one sample of items. In contrast with the setting for the well-known method of Capture-Recapture, we {it{do not}} have information regarding the number of defects jointly detected by {it{both}} inspectors. We solve this problem by constructing two types of estimators - a simple moment-type estimator, and a complicated maximum-likelihood estimator. The performance of these estimators is studied analytically and by means of simulations. It is shown that the maximum-likelihood estimator is superior to the moment-type estimator. A systematic comparison with the Capture-Recapture method is also made.
Estimation of population size using incomplete lists (also called the capture-recapture problem) has a long history across many biological and social sciences. For example, human rights and other groups often construct partial and overlapping lists of victims of armed conflicts, with the hope of using this information to estimate the total number of victims. Earlier statistical methods for this setup either use potentially restrictive parametric assumptions, or else rely on typically suboptimal plug-in-type nonparametric estimators; however, both approaches can lead to substantial bias, the former via model misspecification and the latter via smoothing. Under an identifying assumption that two lists are conditionally independent given measured covariate information, we make several contributions. First, we derive the nonparametric efficiency bound for estimating the capture probability, which indicates the best possible performance of any estimator, and sheds light on the statistical limits of capture-recapture methods. Then we present a new estimator, and study its finite-sample properties, showing that it has a double robustness property new to capture-recapture, and that it is near-optimal in a non-asymptotic sense, under relatively mild nonparametric conditions. Next, we give a method for constructing confidence intervals for total population size from generic capture probability estimators, and prove non-asymptotic near-validity. Finally, we study our methods in simulations, and apply them to estimate the number of killings and disappearances attributable to different groups in Peru during its internal armed conflict between 1980 and 2000.