No Arabic abstract
We propose a versatile joint regression framework for count responses. The method is implemented in the R add-on package GJRM and allows for modelling linear and non-linear dependence through the use of several copulae. Moreover, the parameters of the marginal distributions of the count responses and of the copula can be specified as flexible functions of covariates. Motivated by a football application, we also discuss an extension which forces the regression coefficients of the marginal (linear) predictors to be equal via a suitable penalisation. Model fitting is based on a trust region algorithm which estimates simultaneously all the parameters of the joint models. We investigate the proposals empirical performance in two simulation studies, the first one designed for arbitrary count data, the other one reflecting football-specific settings. Finally, the method is applied to FIFA World Cup data, showing its competitiveness to the standard approach with regard to predictive performance.
The relationship between short-term exposure to air pollution and mortality or morbidity has been the subject of much recent research, in which the standard method of analysis uses Poisson linear or additive models. In this paper we use a Bayesian dynamic generalised linear model (DGLM) to estimate this relationship, which allows the standard linear or additive model to be extended in two ways: (i) the long-term trend and temporal correlation present in the health data can be modelled by an autoregressive process rather than a smooth function of calendar time; (ii) the effects of air pollution are allowed to evolve over time. The efficacy of these two extensions are investigated by applying a series of dynamic and non-dynamic models to air pollution and mortality data from Greater London. A Bayesian approach is taken throughout, and a Markov chain monte carlo simulation algorithm is presented for inference. An alternative likelihood based analysis is also presented, in order to allow a direct comparison with the only previous analysis of air pollution and health data using a DGLM.
We develop a new methodology for spatial regression of aggregated outputs on multi-resolution covariates. Such problems often occur with spatial data, for example in crop yield prediction, where the output is spatially-aggregated over an area and the covariates may be observed at multiple resolutions. Building upon previous work on aggregated output regression, we propose a regression framework to synthesise the effects of the covariates at different resolutions on the output and provide uncertainty estimation. We show that, for a crop yield prediction problem, our approach is more scalable, via variational inference, than existing multi-resolution regression models. We also show that our framework yields good predictive performance, compared to existing multi-resolution crop yield models, whilst being able to provide estimation of the underlying spatial effects.
In this paper, we construct a hierarchical model for spatial compositional data, which is used to reconstruct past land-cover compositions (in terms of coniferous forest, broadleaved forest, and unforested/open land) for five time periods during the past $6,000$ years over Europe. The model consists of a Gaussian Markov Random Field (GMRF) with Dirichlet observations. A block updated Markov chain Monte Carlo (MCMC), including an adaptive Metropolis adjusted Langevin step, is used to estimate model parameters. The sparse precision matrix in the GMRF provides computational advantages leading to a fast MCMC algorithm. Reconstructions are obtained by combining pollen-based estimates of vegetation cover at a limited number of locations with scenarios of past deforestation and output from a dynamic vegetation model. To evaluate uncertainties in the predictions a novel way of constructing joint confidence regions for the entire composition at each prediction location is proposed. The hierarchical models ability to reconstruct past land cover is evaluated through cross validation for all time periods, and by comparing reconstructions for the recent past to a present day European forest map. The evaluation results are promising and the model is able to capture known structures in past land-cover compositions.
One of the classic concerns in statistics is determining if two samples come from thesame population, i.e. homogeneity testing. In this paper, we propose a homogeneitytest in the context of Functional Data Analysis, adopting an idea from multivariatedata analysis: the data depth plot (DD-plot). This DD-plot is a generalization of theunivariate Q-Q plot (quantile-quantile plot). We propose some statistics based onthese DD-plots, and we use bootstrapping techniques to estimate their distributions.We estimate the finite-sample size and power of our test via simulation, obtainingbetter results than other homogeneity test proposed in the literature. Finally, weillustrate the procedure in samples of real heterogeneous data and get consistent results.
Under-representation of certain populations, based on gender, race/ethnicity, and age, in data collection for predictive modeling may yield less-accurate predictions for the under-represented groups. Recently, this issue of fairness in predictions has attracted significant attention, as data-driven models are increasingly utilized to perform crucial decision-making tasks. Methods to achieve fairness in the machine learning literature typically build a single prediction model subject to some fairness criteria in a manner that encourages fair prediction performances for all groups. These approaches have two major limitations: i) fairness is often achieved by compromising accuracy for some groups; ii) the underlying relationship between dependent and independent variables may not be the same across groups. We propose a Joint Fairness Model (JFM) approach for binary outcomes that estimates group-specific classifiers using a joint modeling objective function that incorporates fairness criteria for prediction. We introduce an Accelerated Smoothing Proximal Gradient Algorithm to solve the convex objective function, and demonstrate the properties of the proposed JFM estimates. Next, we presented the key asymptotic properties for the JFM parameter estimates. We examined the efficacy of the JFM approach in achieving prediction performances and parities, in comparison with the Single Fairness Model, group-separate model, and group-ignorant model through extensive simulations. Finally, we demonstrated the utility of the JFM method in the motivating example to obtain fair risk predictions for under-represented older patients diagnosed with coronavirus disease 2019 (COVID-19).