No Arabic abstract
Under-representation of certain populations, based on gender, race/ethnicity, and age, in data collection for predictive modeling may yield less-accurate predictions for the under-represented groups. Recently, this issue of fairness in predictions has attracted significant attention, as data-driven models are increasingly utilized to perform crucial decision-making tasks. Methods to achieve fairness in the machine learning literature typically build a single prediction model subject to some fairness criteria in a manner that encourages fair prediction performances for all groups. These approaches have two major limitations: i) fairness is often achieved by compromising accuracy for some groups; ii) the underlying relationship between dependent and independent variables may not be the same across groups. We propose a Joint Fairness Model (JFM) approach for binary outcomes that estimates group-specific classifiers using a joint modeling objective function that incorporates fairness criteria for prediction. We introduce an Accelerated Smoothing Proximal Gradient Algorithm to solve the convex objective function, and demonstrate the properties of the proposed JFM estimates. Next, we presented the key asymptotic properties for the JFM parameter estimates. We examined the efficacy of the JFM approach in achieving prediction performances and parities, in comparison with the Single Fairness Model, group-separate model, and group-ignorant model through extensive simulations. Finally, we demonstrated the utility of the JFM method in the motivating example to obtain fair risk predictions for under-represented older patients diagnosed with coronavirus disease 2019 (COVID-19).
Both Bayesian and varying coefficient models are very useful tools in practice as they can be used to model parameter heterogeneity in a generalizable way. Motivated by the need of enhancing Marketing Mix Modeling at Uber, we propose a Bayesian Time Varying Coefficient model, equipped with a hierarchical Bayesian structure. This model is different from other time varying coefficient models in the sense that the coefficients are weighted over a set of local latent variables following certain probabilistic distributions. Stochastic Variational Inference is used to approximate the posteriors of latent variables and dynamic coefficients. The proposed model also helps address many challenges faced by traditional MMM approaches. We used simulations as well as real world marketing datasets to demonstrate our model superior performance in terms of both accuracy and interpretability.
This paper addresses the problem of making statistical inference about a population that can only be identified through classifier predictions. The problem is motivated by scientific studies in which human labels of a population are replaced by a classifier. For downstream analysis of the population based on classifier predictions to be sound, the predictions must generalize equally across experimental conditions. In this paper, we formalize the task of statistical inference using classifier predictions, and propose bootstrap procedures to allow inference with a generalizable classifier. We demonstrate the performance of our methods through extensive simulations and a case study with live cell imaging data.
We propose a versatile joint regression framework for count responses. The method is implemented in the R add-on package GJRM and allows for modelling linear and non-linear dependence through the use of several copulae. Moreover, the parameters of the marginal distributions of the count responses and of the copula can be specified as flexible functions of covariates. Motivated by a football application, we also discuss an extension which forces the regression coefficients of the marginal (linear) predictors to be equal via a suitable penalisation. Model fitting is based on a trust region algorithm which estimates simultaneously all the parameters of the joint models. We investigate the proposals empirical performance in two simulation studies, the first one designed for arbitrary count data, the other one reflecting football-specific settings. Finally, the method is applied to FIFA World Cup data, showing its competitiveness to the standard approach with regard to predictive performance.
We propose an efficient statistical method (denoted as SSR-Tensor) to robustly and quickly detect hot-spots that are sparse and temporal-consistent in a spatial-temporal dataset through the tensor decomposition. Our main idea is first to build an SSR model to decompose the tensor data into a Smooth global trend mean, Sparse local hot-spots, and Residuals. Next, tensor decomposition is utilized as follows: bases are introduced to describe within-dimension correlation, and tensor products are used for between-dimension interaction. Then, a combination of LASSO and fused LASSO is used to estimate the model parameters, where an efficient recursive estimation procedure is developed based on the large-scale convex optimization, where we first transform the general LASSO optimization into regular LASSO optimization and apply FISTA to solve it with the fastest convergence rate. Finally, a CUSUM procedure is applied to detect when and where the hot-spot event occurs. We compare the performance of the proposed method in a numerical simulation study and a real-world case study, which contains a dataset including a collection of three types of crime rates for U.S. mainland states during the year 1965-2014. In both cases, the proposed SSR-Tensor is able to achieve the fast detection and accurate localization of the hot-spots.
This work is motivated by the Obepine French system for SARS-CoV-2 viral load monitoring in wastewater. The objective of this work is to identify, from time-series of noisy measurements, the underlying auto-regressive signals, in a context where the measurements present numerous missing data, censoring and outliers. We propose a method based on an auto-regressive model adapted to censored data with outliers. Inference and prediction are produced via a discretised smoother. This method is both validated on simulations and on real data from Obepine. The proposed method is used to denoise measurements from the quantification of the SARS-CoV-2 E gene in wastewater by RT-qPCR. The resulting smoothed signal shows a good correlation with other epidemiological indicators and an estimate of the whole system noise is produced.