A race-DC in Big Data

257 0 0.0 ( 0 )

Download Cite

Added by Jun Lu

Publication date 2019

fields Mathematical Statistics

and research's language is English

Authors Lu Lin - Jun Lu

Methodology

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The strategy of divide-and-combine (DC) has been widely used in the area of big data. Bias-correction is crucial in the DC procedure for validly aggregating the locally biased estimators, especial for the case when the number of batches of data is large. This paper establishes a race-DC through a residual-adjustment composition estimate (race). The race-DC applies to various types of biased estimators, which include but are not limited to Lasso estimator, Ridge estimator and principal component estimator in linear regression, and least squares estimator in nonlinear regression. The resulting global estimator is strictly unbiased under linear model, and is acceleratingly bias-reduced in nonlinear model, and can achieve the theoretical optimality, for the case when the number of batches of data is large. Moreover, the race-DC is computationally simple because it is a least squares estimator in a pro forma linear regression. Detailed simulation studies demonstrate that the resulting global estimator is significantly bias-corrected, and the behavior is comparable with the oracle estimation and is much better than the competitors.

rate research

Optimal Subsampling Algorithms for Big Data Regressions

95 - Mingyao Ai , Jun Yu , Huiming Zhang 2018

To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequalities, finite sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.

Methodology Statistics Theory Computation

Experimental Design Issues in Big Data. The Question of Bias

65 - Elena Pesce , Eva Riccomagno , Henry P. Wynn 2017

Data can be collected in scientific studies via a controlled experiment or passive observation. Big data is often collected in a passive way, e.g. from social media. In studies of causation great efforts are made to guard against bias and hidden confounders or feedback which can destroy the identification of causation by corrupting or omitting counterfactuals (controls). Various solutions of these problems are discussed, including randomization.

Methodology

New reconstruction and data processing methods for regression and interpolation analysis of multidimensional big data

88 - Yuri K. Shestopaloff , Alexander Y. Shestopaloff 2017

The problems of computational data processing involving regression, interpolation, reconstruction and imputation for multidimensional big datasets are becoming more important these days, because of the availability of data and their widely spread usage in business, technological, scientific and other applications. The existing methods often have limitations, which either do not allow, or make it difficult to accomplish many data processing tasks. The problems usually relate to algorithm accuracy, applicability, performance (computational and algorithmic), demands for computational resources, both in terms of power and memory, and difficulty working with high dimensions. Here, we propose a new concept and introduce two methods, which use local area predictors (input data) for finding outcomes. One method uses the gradient based approach, while the second one employs an introduced family of smooth approximating functions. The new methods are free from many drawbacks of existing approaches. They are practical, have very wide range of applicability, provide high accuracy, excellent computational performance, fit for parallel computing, and very well suited for processing high dimension big data. The methods also provide multidimensional outcome, when needed. We present numerical examples of up to one hundred dimensions, and report in detail performance characteristics and various properties of new methods.

Methodology

Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data

212 - Daniel Zilber , Matthias Katzfuss 2019

Generalized Gaussian processes (GGPs) are highly flexible models that combine latent GPs with potentially non-Gaussian likelihoods from the exponential family. GGPs can be used in a variety of settings, including GP classification, nonparametric count regression, modeling non-Gaussian spatial data, and analyzing point patterns. However, inference for GGPs can be analytically intractable, and large datasets pose computational challenges due to the inversion of the GP covariance matrix. We propose a Vecchia-Laplace approximation for GGPs, which combines a Laplace approximation to the non-Gaussian likelihood with a computationally efficient Vecchia approximation to the GP, resulting in a simple, general, scalable, and accurate methodology. We provide numerical studies and comparisons on simulated and real spatial data. Our methods are implemented in a freely available R package.

Methodology Computation

A Global Bias-Correction DC Method for Biased Estimation under Memory Constraint

54 - Lu Lin , Feng Li 2019

This paper establishes a global bias-correction divide-and-conquer (GBC-DC) rule for biased estimation under the case of memory constraint. In order to introduce the new estimation, a closed representation of the local estimators obtained by the data in each batch is adopted, aiming to formulate a pro forma linear regression between the local estimators and the true parameter of interest. Least square method is then used within this framework to composite a global estimator of the parameter. Thus, the main advantage over the classical DC method is that the new GBC-DC method can absorb the information hidden in the statistical structure and the variables in each batch of data. Consequently, the resulting global estimator is strictly unbiased even if the local estimator has a non-negligible bias. Moreover, the global estimator is consistent, and even can achieve root-$n$ consistency, without the constraint on the number of batches. Another attractive feature of the new method is computationally simple and efficient, without use of any iterative algorithm and local bias-correction. Specifically, the proposed GBC-DC method applies to various biased estimations such as shrinkage-type estimation and nonparametric regression estimation. Detailed simulation studies demonstrate that the proposed GBC-DC approach is significantly bias-corrected, and the behavior is comparable with the full data estimation and is much better than the competitors.

Methodology Statistics Theory Statistics Theory