No Arabic abstract
Many ecological studies and conservation policies are based on field observations of species, which can be affected by systematic variability introduced by the observation process. A recently introduced causal modeling technique called half-sibling regression can detect and correct for systematic errors in measurements of multiple independent random variables. However, it will remove intrinsic variability if the variables are dependent, and therefore does not apply to many situations, including modeling of species counts that are controlled by common causes. We present a technique called three-quarter sibling regression to partially overcome this limitation. It can filter the effect of systematic noise when the latent variables have observed common causes. We provide theoretical justification of this approach, demonstrate its effectiveness on synthetic data, and show that it reduces systematic detection variability due to moon brightness in moth surveys.
Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.
Sliced inverse regression is one of the most popular sufficient dimension reduction methods. Originally, it was designed for independent and identically distributed data and recently extend to the case of serially and spatially dependent data. In this work we extend it to the case of spatially dependent data where the response might depend also on neighbouring covariates when the observations are taken on a grid-like structure as it is often the case in econometric spatial regression applications. We suggest guidelines on how to decide upon the dimension of the subspace of interest and also which spatial lag might be of interest when modeling the response. These guidelines are supported by a conducted simulation study.
This paper investigates the problem of making inference about a parametric model for the regression of an outcome variable $Y$ on covariates $(V,L)$ when data are fused from two separate sources, one which contains information only on $(V, Y)$ while the other contains information only on covariates. This data fusion setting may be viewed as an extreme form of missing data in which the probability of observing complete data $(V,L,Y)$ on any given subject is zero. We have developed a large class of semiparametric estimators, which includes doubly robust estimators, of the regression coefficients in fused data. The proposed method is DR in that it is consistent and asymptotically normal if, in addition to the model of interest, we correctly specify a model for either the data source process under an ignorability assumption, or the distribution of unobserved covariates. We evaluate the performance of our various estimators via an extensive simulation study, and apply the proposed methods to investigate the relationship between net asset value and total expenditure among U.S. households in 1998, while controlling for potential confounders including income and other demographic variables.
Correlated data are ubiquitous in todays data-driven society. A fundamental task in analyzing these data is to understand, characterize and utilize the correlations in them in order to conduct valid inference. Yet explicit regression analysis of correlations has been so far limited to longitudinal data, a special form of correlated data, while implicit analysis via mixed-effects models lacks generality as a full inferential tool. This paper proposes a novel regression approach for modelling the correlation structure, leveraging a new generalized z-transformation. This transformation maps correlation matrices that are constrained to be positive definite to vectors with un-restricted support, and is order-invariant. Building on these two properties, we develop a regression model to relate the transformed parameters to any covariates. We show that coupled with a mean and a variance regression model, the use of maximum likelihood leads to asymptotically normal parameter estimates, and crucially enables statistical inference for all the parameters. The performance of our framework is demonstrated in extensive simulation. More importantly, we illustrate the use of our model with the analysis of the classroom data, a highly unbalanced multilevel clustered data with within-class and within-school correlations, and the analysis of the malaria immune response data in Benin, a longitudinal data with time-dependent covariates in addition to time. Our analyses reveal new insights not previously known.
In this study, we develop a novel estimation method of the quantile treatment effects (QTE) under the rank invariance and rank stationarity assumptions. Ishihara (2020) explores identification of the nonseparable panel data model under these assumptions and propose a parametric estimation based on the minimum distance method. However, the minimum distance estimation using this process is computationally demanding when the dimensionality of covariates is large. To overcome this problem, we propose a two-step estimation method based on the quantile regression and minimum distance method. We then show consistency and asymptotic normality of our estimator. Monte Carlo studies indicate that our estimator performs well in finite samples. Last, we present two empirical illustrations, to estimate the distributional effects of insurance provision on household production and of TV watching on child cognitive development.