No Arabic abstract
Environmental variability often has substantial impacts on natural populations and communities through its effects on the performance of individuals. Because organisms responses to environmental conditions are often nonlinear (e.g., decreasing performance on both sides of an optimal temperature), the mean response is often different from the response in the mean environment. Ye et. al. 2020, proposed testing for the presence of such variance effects on individual or population growth rates by estimating the Jensen Effect, the difference in average growth rates under varying versus fixed environments, in functional single index models for environmental effects on growth. In this paper, we extend this analysis to effect of environmental variance on reproduction and survival, which have count and binary outcomes. In the standard generalized linear models used to analyze such data the direction of the Jensen Effect is tacitly assumed a priori by the models link function. Here we extend the methods of Ye et. al. 2020 using a generalized single index model to test whether this assumed direction is contradicted by the data. We show that our test has reasonable power under mild alternatives, but requires sample sizes that are larger than are often available. We demonstrate our methods on a long-term time series of plant ground cover on the Idaho steppe.
Under measurement constraints, responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called response-free. We propose a response-free sampling procedure mbox{(OSUMC)} for generalized linear models (GLMs). Using the A-optimality criterion, i.e., the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the Optimal Sampling Under Measurement Constraints (OSUMC) algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives.
We propose and analyze a generalized splitting method to sample approximately from a distribution conditional on the occurrence of a rare event. This has important applications in a variety of contexts in operations research, engineering, and computational statistics. The method uses independent trials starting from a single particle. We exploit this independence to obtain asymptotic and non-asymptotic bounds on the total variation error of the sampler. Our main finding is that the approximation error depends crucially on the relative variability of the number of points produced by the splitting algorithm in one run, and that this relative variability can be readily estimated via simulation. We illustrate the relevance of the proposed method on an application in which one needs to sample (approximately) from an intractable posterior density in Bayesian inference.
With the availability of more non-euclidean data objects, statisticians are faced with the task of developing appropriate statistical methods. For regression models in which the predictors lie in $R^p$ and the response variables are situated in a metric space, conditional Frechet means can be used to define the Frechet regression function. Global and local Frechet methods have recently been developed for modeling and estimating this regression function as extensions of multiple and local linear regression, respectively. This paper expands on these methodologies by proposing the Frechet Single Index (FSI) model and utilizing local Frechet along with $M$-estimation to estimate both the index and the underlying regression function. The method is illustrated by simulations for response objects on the surface of the unit sphere and through an analysis of human mortality data in which lifetable data are represented by distributions of age-of-death, viewed as elements of the Wasserstein space of distributions.
Determining the number G of components in a finite mixture distribution is an important and difficult inference issue. This is a most important question, because statistical inference about the resulting model is highly sensitive to the value of G. Selecting an erroneous value of G may produce a poor density estimate. This is also a most difficult question from a theoretical perspective as it relates to unidentifiability issues of the mixture model. This is further a most relevant question from a practical viewpoint since the meaning of the number of components G is strongly related to the modelling purpose of a mixture distribution. We distinguish in this chapter between selecting G as a density estimation problem in Section 2 and selecting G in a model-based clustering framework in Section 3. Both sections discuss frequentist as well as Bayesian approaches. We present here some of the Bayesian solutions to the different interpretations of picking the right number of components in a mixture, before concluding on the ill-posed nature of the question.
In prediction problems, it is common to model the data-generating process and then use a model-based procedure, such as a Bayesian predictive distribution, to quantify uncertainty about the next observation. However, if the posited model is misspecified, then its predictions may not be calibrated -- that is, the predictive distributions quantiles may not be nominal frequentist prediction upper limits, even asymptotically. Rather than abandoning the comfort of a model-based formulation for a more complicated non-model-based approach, here we propose a strategy in which the data itself helps determine if the assumed model-based solution should be adjusted to account for model misspecification. This is achieved through a generalized Bayes formulation where a learning rate parameter is tuned, via the proposed generalized predictive calibration (GPrC) algorithm, to make the predictive distribution calibrated, even under model misspecification. Extensive numerical experiments are presented, under a variety of settings, demonstrating the proposed GPrC algorithms validity, efficiency, and robustness.