No Arabic abstract
Physical processes that manifest as tangential vector fields on a sphere are common in geophysical and environmental sciences. These naturally occurring vector fields are often subject to physical constraints, such as being curl-free or divergence-free. We construct a new class of parametric models for cross-covariance functions of curl-free and divergence-free vector fields that are tangential to the unit sphere. These models are constructed by applying the surface gradient or the surface curl operator to scalar random potential fields defined on the unit sphere. We propose a likelihood-based estimation procedure for the model parameters and show that fast computation is possible even for large data sets when the observations are on a regular latitude-longitude grid. Characteristics and utility of the proposed methodology are illustrated through simulation studies and by applying it to an ocean surface wind velocity data set collected through satellite-based scatterometry remote sensing. We also compare the performance of the proposed model with a class of bivariate Matern models in terms of estimation and prediction, and demonstrate that the proposed model is superior in capturing certain physical characteristics of the wind fields.
The joint modeling of mean and dispersion (JMMD) provides an efficient method to obtain useful models for the mean and dispersion, especially in problems of robust design experiments. However, in the literature on JMMD there are few works dedicated to variable selection and this theme is still a challenge. In this article, we propose a procedure for selecting variables in JMMD, based on hypothesis testing and the quality of the models fit. A criterion for checking the goodness of fit is used, in each iteration of the selection process, as a filter for choosing the terms that will be evaluated by a hypothesis test. Three types of criteria were considered for checking the quality of the model fit in our variable selection procedure. The criteria used were: the extended Akaike information criterion, the corrected Akaike information criterion and a specific criterion for the JMMD, proposed by us, a type of extended adjusted coefficient of determination. Simulation studies were carried out to verify the efficiency of our variable selection procedure. In all situations considered, the proposed procedure proved to be effective and quite satisfactory. The variable selection process was applied to a real example from an industrial experiment.
Support vector machine (SVM) is one of the most popular classification algorithms in the machine learning literature. We demonstrate that SVM can be used to balance covariates and estimate average causal effects under the unconfoundedness assumption. Specifically, we adapt the SVM classifier as a kernel-based weighting procedure that minimizes the maximum mean discrepancy between the treatment and control groups while simultaneously maximizing effective sample size. We also show that SVM is a continuous relaxation of the quadratic integer program for computing the largest balanced subset, establishing its direct relation to the cardinality matching method. Another important feature of SVM is that the regularization parameter controls the trade-off between covariate balance and effective sample size. As a result, the existing SVM path algorithm can be used to compute the balance-sample size frontier. We characterize the bias of causal effect estimation arising from this trade-off, connecting the proposed SVM procedure to the existing kernel balancing methods. Finally, we conduct simulation and empirical studies to evaluate the performance of the proposed methodology and find that SVM is competitive with the state-of-the-art covariate balancing methods.
In this paper, we propose a novel approach to modeling nonstationary spatial fields. The proposed method works by expanding the geographic plane over which these processes evolve into higher dimensional spaces, transforming and clarifying complex patterns in the physical plane. By combining aspects of multi-dimensional scaling, group lasso, and latent variables models, a dimensionally sparse projection is found in which the originally nonstationary field exhibits stationarity. Following a comparison with existing methods in a simulated environment, dimension expansion is studied on a classic test-bed data set historically used to study nonstationary models. Following this, we explore the use of dimension expansion in modeling air pollution in the United Kingdom, a process known to be strongly influenced by rural/urban effects, amongst others, which gives rise to a nonstationary field.
Dialect variation is of considerable interest in linguistics and other social sciences. However, traditionally it has been studied using proxies (transcriptions) rather than acoustic recordings directly. We introduce novel statistical techniques to analyse geolocalised speech recordings and to explore the spatial variation of pronunciations continuously over the region of interest, as opposed to traditional isoglosses, which provide a discrete partition of the region. Data of this type require an explicit modeling of the variation in the mean and the covariance. Usual Euclidean metrics are not appropriate, and we therefore introduce the concept of $d$-covariance, which allows consistent estimation both in space and at individual locations. We then propose spatial smoothing for these objects which accounts for the possibly non convex geometry of the domain of interest. We apply the proposed method to data from the spoken part of the British National Corpus, deposited at the British Library, London, and we produce maps of the dialect variation over Great Britain. In addition, the methods allow for acoustic reconstruction across the domain of interest, allowing researchers to listen to the statistical analysis.
We propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. We develop a data integration procedure for statistical estimation and inference of regression parameters that is implemented in a fully distributed and parallelized computational scheme. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, we propose to analyze each data source using Qu, Lindsay and Li (2000)s quadratic inference functions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources using a combined meta-estimator in a similar spirit to Hansen (1982)s generalised method of moments. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We illustrate the proposed methodology with the joint integrative analysis of the association between smoking and metabolites in a large multi-cohort study and provide an R package for ease of implementation.