ترغب بنشر مسار تعليمي؟ اضغط هنا

Efficient design of geographically-defined clusters with spatial autocorrelation

188   0   0.0 ( 0 )
 نشر من قبل Samuel Watson
 تاريخ النشر 2020
  مجال البحث الاحصاء الرياضي
والبحث باللغة English
 تأليف Samuel I. Watson




اسأل ChatGPT حول البحث

Clusters form the basis of a number of research study designs including survey and experimental studies. Cluster-based designs can be less costly but also less efficient than individual-based designs due to correlation between individuals within the same cluster. Their design typically relies on textit{ad hoc} choices of correlation parameters, and is insensitive to variations in cluster design. This article examines how to efficiently design clusters where they are geographically defined by demarcating areas incorporating individuals and households or other units. Using geostatistical models for spatial autocorrelation we generate approximations to within cluster average covariance in order to estimate the effective sample size given particular cluster design parameters. We show how the number of enumerated locations, cluster area, proportion sampled, and sampling method affect the efficiency of the design and consider the optimization problem of choosing the most efficient design subject to budgetary constraints. We also consider how the parameters from these approximations can be interpreted simply in terms of `real-world quantities and used in design analysis.



قيم البحث

اقرأ أيضاً

Physical or geographic location proves to be an important feature in many data science models, because many diverse natural and social phenomenon have a spatial component. Spatial autocorrelation measures the extent to which locally adjacent observat ions of the same phenomenon are correlated. Although statistics like Morans $I$ and Gearys $C$ are widely used to measure spatial autocorrelation, they are slow: all popular methods run in $Omega(n^2)$ time, rendering them unusable for large data sets, or long time-courses with moderate numbers of points. We propose a new $S_A$ statistic based on the notion that the variance observed when merging pairs of nearby clusters should increase slowly for spatially autocorrelated variables. We give a linear-time algorithm to calculate $S_A$ for a variable with an input agglomeration order (available at https://github.com/aamgalan/spatial_autocorrelation). For a typical dataset of $n approx 63,000$ points, our $S_A$ autocorrelation measure can be computed in 1 second, versus 2 hours or more for Morans $I$ and Gearys $C$. Through simulation studies, we demonstrate that $S_A$ identifies spatial correlations in variables generated with spatially-dependent model half an order of magnitude earlier than either Morans $I$ or Gearys $C$. Finally, we prove several theoretical properties of $S_A$: namely that it behaves as a true correlation statistic, and is invariant under addition or multiplication by a constant.
We develop a new robust geographically weighted regression method in the presence of outliers. We embed the standard geographically weighted regression in robust objective function based on $gamma$-divergence. A novel feature of the proposed approach is that two tuning parameters that control robustness and spatial smoothness are automatically tuned in a data-dependent manner. Further, the proposed method can produce robust standard error estimates of the robust estimator and give us a reasonable quantity for local outlier detection. We demonstrate that the proposed method is superior to the existing robust version of geographically weighted regression through simulation and data analysis.
The performance of Markov chain Monte Carlo calculations is determined by both ensemble variance of the Monte Carlo estimator and autocorrelation of the Markov process. In order to study autocorrelation, binning analysis is commonly used, where the a utocorrelation is estimated from results grouped into bins of logarithmically increasing sizes. In this paper, we show that binning analysis comes with a bias that can be eliminated by combining bin sizes. We then show binning analysis can be performed on-the-fly with linear overhead in time and logarithmic overhead in memory with respect to the sample size. We then show that binning analysis contains information not only about the integrated effect of autocorrelation, but can be used to estimate the spectrum of autocorrelation lengths, yielding the height of phase space barriers in the system. Finally, we revisit the Ising model and apply the proposed method to recover its autocorrelation spectra.
Estimation of autocorrelations and spectral densities is of fundamental importance in many fields of science, from identifying pulsar signals in astronomy to measuring heart beats in medicine. In circumstances where one is interested in specific auto correlation functions that do not fit into any simple families of models, such as auto-regressive moving average (ARMA), estimating model parameters is generally approached in one of two ways: by fitting the model autocorrelation function to a non-parameteric autocorrelation estimate via regression analysis or by fitting the model autocorrelation function directly to the data via maximum likelihood. Prior literature suggests that variogram regression yields parameter estimates of comparable quality to maximum likelihood. In this letter we demonstrate that, as sample size is increases, the accuracy of the maximum-likelihood estimates (MLE) ultimately improves by orders of magnitude beyond that of variogram regression. For relatively continuous and Gaussian processes, this improvement can occur for sample sizes of less than 100. Moreover, even where the accuracy of these methods is comparable, the MLE remains almost universally better and, more critically, variogram regression does not provide reliable confidence intervals. Inaccurate regression parameter estimates are typically accompanied by underestimated standard errors, whereas likelihood provides reliable confidence intervals.
Although a number of studies have developed fast geographically weighted regression (GWR) algorithms for large samples, none of them has achieved linear-time estimation, which is considered a requisite for big data analysis in machine learning, geost atistics, and related domains. Against this backdrop, this study proposes a scalable GWR (ScaGWR) for large datasets. The key improvement is the calibration of the model through a pre-compression of the matrices and vectors whose size depends on the sample size, prior to the leave-one-out cross-validation, which is the heaviest computational step in conventional GWR. This pre-compression allows us to run the proposed GWR extension so that its computation time increases linearly with the sample size. With this improvement, the ScaGWR can be calibrated with one million observations without parallelization. Moreover, the ScaGWR estimator can be regarded as an empirical Bayesian estimator that is more stable than the conventional GWR estimator. We compare the ScaGWR with the conventional GWR in terms of estimation accuracy and computational efficiency using a Monte Carlo simulation. Then, we apply these methods to a US income analysis. The code for ScaGWR is available in the R package scgwr. The code is embedded into C++ code and implemented in another R package, GWmodel.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا