No Arabic abstract
This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each subsample with a subsample size $k_Nll N$ to meet the memory constraint is sampled uniformly without replacement. Aggregating the estimators of $m_N$ subsamples can lead to subbagging estimation. To analyze the theoretical properties of the subbagging estimator, we adapt the incomplete $U$-statistics theory with an infinite order kernel to allow overlapping drawn subsamples in the sampling procedure. Utilizing this novel theoretical framework, we demonstrate that via a proper hyperparameter selection of $k_N$ and $m_N$, the subbagging estimator can achieve $sqrt{N}$-consistency and asymptotic normality under the condition $(k_Nm_N)/Nto alpha in (0,infty]$. Compared to the full sample estimator, we theoretically show that the $sqrt{N}$-consistent subbagging estimator has an inflation rate of $1/alpha$ in its asymptotic variance. Simulation experiments are presented to demonstrate the finite sample performances. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.
This paper considers fixed effects estimation and inference in linear and nonlinear panel data models with random coefficients and endogenous regressors. The quantities of interest -- means, variances, and other moments of the random coefficients -- are estimated by cross sectional sample moments of GMM estimators applied separately to the time series of each individual. To deal with the incidental parameter problem introduced by the noise of the within-individual estimators in short panels, we develop bias corrections. These corrections are based on higher-order asymptotic expansions of the GMM estimators and produce improved point and interval estimates in moderately long panels. Under asymptotic sequences where the cross sectional and time series dimensions of the panel pass to infinity at the same rate, the uncorrected estimator has an asymptotic bias of the same order as the asymptotic variance. The bias corrections remove the bias without increasing variance. An empirical example on cigarette demand based on Becker, Grossman and Murphy (1994) shows significant heterogeneity in the price effect across U.S. states.
This paper considers inference on fixed effects in a linear regression model estimated from network data. An important special case of our setup is the two-way regression model. This is a workhorse technique in the analysis of matched data sets, such as employer-employee or student-teacher panel data. We formalize how the structure of the network affects the accuracy with which the fixed effects can be estimated. This allows us to derive sufficient conditions on the network for consistent estimation and asymptotically-valid inference to be possible. Estimation of moments is also considered. We allow for general networks and our setup covers both the dense and sparse case. We provide numerical results for the estimation of teacher value-added models and regressions with occupational dummies.
In this study, we develop a novel estimation method of the quantile treatment effects (QTE) under the rank invariance and rank stationarity assumptions. Ishihara (2020) explores identification of the nonseparable panel data model under these assumptions and propose a parametric estimation based on the minimum distance method. However, the minimum distance estimation using this process is computationally demanding when the dimensionality of covariates is large. To overcome this problem, we propose a two-step estimation method based on the quantile regression and minimum distance method. We then show consistency and asymptotic normality of our estimator. Monte Carlo studies indicate that our estimator performs well in finite samples. Last, we present two empirical illustrations, to estimate the distributional effects of insurance provision on household production and of TV watching on child cognitive development.
Factor structures or interactive effects are convenient devices to incorporate latent variables in panel data models. We consider fixed effect estimation of nonlinear panel single-index models with factor structures in the unobservables, which include logit, probit, ordered probit and Poisson specifications. We establish that fixed effect estimators of model parameters and average partial effects have normal distributions when the two dimensions of the panel grow large, but might suffer of incidental parameter bias. We show how models with factor structures can also be applied to capture important features of network data such as reciprocity, degree heterogeneity, homophily in latent variables and clustering. We illustrate this applicability with an empirical example to the estimation of a gravity equation of international trade between countries using a Poisson model with multiple factors.
We develop new semiparametric methods for estimating treatment effects. We focus on a setting where the outcome distributions may be thick tailed, where treatment effects are small, where sample sizes are large and where assignment is completely random. This setting is of particular interest in recent experimentation in tech companies. We propose using parametric models for the treatment effects, as opposed to parametric models for the full outcome distributions. This leads to semiparametric models for the outcome distributions. We derive the semiparametric efficiency bound for this setting, and propose efficient estimators. In the case with a constant treatment effect one of the proposed estimators has an interesting interpretation as a weighted average of quantile treatment effects, with the weights proportional to (minus) the second derivative of the log of the density of the potential outcomes. Our analysis also results in an extension of Hubers model and trimmed mean to include asymmetry and a simplified condition on linear combinations of order statistics, which may be of independent interest.