No Arabic abstract
Introduction The tau statistic is a recent second-order correlation function that can assess the magnitude and range of global spatiotemporal clustering from epidemiological data containing geolocations of individual cases and, usually, disease onset times. This is the first review of its use, and the aspects of its computation and presentation that could affect inferences drawn and bias estimates of the statistic. Methods Using Google Scholar we searched papers or preprints that cited the papers that first defined/reformed the statistic. We tabulated their key characteristics to understand the statistics development since 2012. Results Only half of the 16 studies found were considered to be using true tau statistics, but their inclusion in the review still provided important insights into their analysis motivations. All papers that used graphical hypothesis testing and parameter estimation used incorrect methods. There is a lack of clarity over how to choose the time-relatedness interval to relate cases and the distance band set, that are both required to calculate the statistic. Some studies demonstrated nuanced applications of the tau statistic in settings with unusual data or time relation variables, which enriched understanding of its possibilities. A gap was noticed in the estimators available to account for variable person-time at risk. Discussion Our review comprehensively covers current uses of the tau statistic for descriptive analysis, graphical hypothesis testing, and parameter estimation of spatiotemporal clustering. We also define a new estimator of the tau statistic for disease rates. For the tau statistic there are still open questions on its implementation which we hope this review inspires others to research.
The tau statistic $tau$ uses geolocation and, usually, symptom onset time to assess global spatiotemporal clustering from epidemiological data. We test different factors that could affect graphical hypothesis tests of clustering or bias clustering range estimates based on the statistic, by comparison with a baseline analysis of an open access measles dataset. From re-analysing this data we find that the spatial bootstrap sampling method used to construct the confidence interval for the tau estimate and confidence interval (CI) type can bias clustering range estimates. We suggest that the bias-corrected and accelerated (BCa) CI is essential for asymmetric sample bootstrap distributions of tau estimates. We also find evidence against no spatiotemporal clustering, $p$-value $in$ [0,0.014] (global envelope test). We develop a tau-specific modification of the Loh & Stein spatial bootstrap sampling method, which gives more precise bootstrapped tau estimates and a 20% higher estimated clustering endpoint than previously published (36.0m; 95% BCa CI (14.9, 46.6), vs 30m) and an equivalent increase in the clustering area of elevated disease odds by 44%. What appears a modest radial bias in the range estimate is more than doubled on the areal scale, which public health resources are proportional to. This difference could have important consequences for control. Correct practice of hypothesis testing of no clustering and clustering range estimation of the tau statistic are illustrated in the Graphical abstract. We advocate proper implementation of this useful statistic, ultimately to reduce inaccuracies in control policy decisions made during disease clustering analysis.
Inference on vertex-aligned graphs is of wide theoretical and practical importance.There are, however, few flexible and tractable statistical models for correlated graphs, and even fewer comprehensive approaches to parametric inference on data arising from such graphs. In this paper, we consider the correlated Bernoulli random graph model (allowing different Bernoulli coefficients and edge correlations for different pairs of vertices), and we introduce a new variance-reducing technique -- called emph{balancing} -- that can refine estimators for model parameters. Specifically, we construct a disagreement statistic and show that it is complete and sufficient; balancing can be interpreted as Rao-Blackwellization with this disagreement statistic. We show that for unbiased estimators of functions of model parameters, balancing generates uniformly minimum variance unbiased estimators (UMVUEs). However, even when unbiased estimators for model parameters do {em not} exist -- which, as we prove, is the case with both the heterogeneity correlation and the total correlation parameters -- balancing is still useful, and lowers mean squared error. In particular, we demonstrate how balancing can improve the efficiency of the alignment strength estimator for the total correlation, a parameter that plays a critical role in graph matchability and graph matching runtime complexity.
We propose a hierarchical Bayesian model to estimate the proportional contribution of source populations to a newly founded colony. Samples are derived from the first generation offspring in the colony, but mating may occur preferentially among migrants from the same source population. Genotypes of the newly founded colony and source populations are used to estimate the mixture proportions, and the mixture proportions are related to environmental and demographic factors that might affect the colonizing process. We estimate an assortative mating coefficient, mixture proportions, and regression relationships between environmental factors and the mixture proportions in a single hierarchical model. The first-stage likelihood for genotypes in the newly founded colony is a mixture multinomial distribution reflecting the colonizing process. The environmental and demographic data are incorporated into the model through a hierarchical prior structure. A simulation study is conducted to investigate the performance of the model by using different levels of population divergence and number of genetic markers included in the analysis. We use Markov chain Monte Carlo (MCMC) simulation to conduct inference for the posterior distributions of model parameters. We apply the model to a data set derived from grey seals in the Orkney Islands, Scotland. We compare our model with a similar model previously used to analyze these data. The results from both the simulation and application to real data indicate that our model provides better estimates for the covariate effects.
Particle physics experiments such as those run in the Large Hadron Collider result in huge quantities of data, which are boiled down to a few numbers from which it is hoped that a signal will be detected. We discuss a simple probability model for this and derive frequentist and noninformative Bayesian procedures for inference about the signal. Both are highly accurate in realistic cases, with the frequentist procedure having the edge for interval estimation, and the Bayesian procedure yielding slightly better point estimates. We also argue that the significance, or $p$-value, function based on the modified likelihood root provides a comprehensive presentation of the information in the data and should be used for inference.
How should social scientists understand and communicate the uncertainty of statistically estimated causal effects? It is well-known that the conventional significance-vs.-insignificance approach is associated with misunderstandings and misuses. Behavioral research suggests people understand uncertainty more appropriately in a numerical, continuous scale than in a verbal, discrete scale. Motivated by these backgrounds, I propose presenting the probabilities of different effect sizes. Probability is an intuitive continuous measure of uncertainty. It allows researchers to better understand and communicate the uncertainty of statistically estimated effects. In addition, my approach needs no decision threshold for an uncertainty measure or an effect size, unlike the conventional approaches, allowing researchers to be agnostic about a decision threshold such as p<5% and a justification for that. I apply my approach to a previous social scientific study, showing it enables richer inference than the significance-vs.-insignificance approach taken by the original study. The accompanying R package makes my approach easy to implement.