No Arabic abstract
Learning optimal policies from historical data enables the gains from personalization to be realized in a wide variety of applications. The growing policy learning literature focuses on a setting where the treatment assignment policy does not adapt to the data. However, adaptive data collection is becoming more common in practice, from two primary sources: 1) data collected from adaptive experiments that are designed to improve inferential efficiency; 2) data collected from production systems that are adaptively evolving an operational policy to improve performance over time (e.g. contextual bandits). In this paper, we aim to address the challenge of learning the optimal policy with adaptively collected data and provide one of the first theoretical inquiries into this problem. We propose an algorithm based on generalized augmented inverse propensity weighted estimators and establish its finite-sample regret bound. We complement this regret upper bound with a lower bound that characterizes the fundamental difficulty of policy learning with adaptive data. Finally, we demonstrate our algorithms effectiveness using both synthetic data and public benchmark datasets.
From scientific experiments to online A/B testing, the previously observed data often affects how future experiments are performed, which in turn affects which data will be collected. Such adaptivity introduces complex correlations between the data and the collection procedure. In this paper, we prove that when the data collection procedure satisfies natural conditions, then sample means of the data have systematic emph{negative} biases. As an example, consider an adaptive clinical trial where additional data points are more likely to be tested for treatments that show initial promise. Our surprising result implies that the average observed treatment effects would underestimate the true effects of each treatment. We quantitatively analyze the magnitude and behavior of this negative bias in a variety of settings. We also propose a novel debiasing algorithm based on selective inference techniques. In experiments, our method can effectively reduce bias and estimation error.
Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators -- which includes estimators based on empirical risk minimization as well as maximum likelihood -- on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.
Adaptive collection of data is commonplace in applications throughout science and engineering. From the point of view of statistical inference however, adaptive data collection induces memory and correlation in the samples, and poses significant challenge. We consider the high-dimensional linear regression, where the samples are collected adaptively, and the sample size $n$ can be smaller than $p$, the number of covariates. In this setting, there are two distinct sources of bias: the first due to regularization imposed for consistent estimation, e.g. using the LASSO, and the second due to adaptivity in collecting the samples. We propose online debiasing, a general procedure for estimators such as the LASSO, which addresses both sources of bias. In two concrete contexts $(i)$ time series analysis and $(ii)$ batched data collection, we demonstrate that online debiasing optimally debiases the LASSO estimate when the underlying parameter $theta_0$ has sparsity of order $o(sqrt{n}/log p)$. In this regime, the debiased estimator can be used to compute $p$-values and confidence intervals of optimal size.
Boosting techniques and neural networks are particularly effective machine learning methods for insurance pricing. Often in practice, there are nevertheless endless debates about the choice of the right loss function to be used to train the machine learning model, as well as about the appropriate metric to assess the performances of competing models. Also, the sum of fitted values can depart from the observed totals to a large extent and this often confuses actuarial analysts. The lack of balance inherent to training models by minimizing deviance outside the familiar GLM with canonical link setting has been empirically documented in Wuthrich (2019, 2020) who attributes it to the early stopping rule in gradient descent methods for model fitting. The present paper aims to further study this phenomenon when learning proceeds by minimizing Tweedie deviance. It is shown that minimizing deviance involves a trade-off between the integral of weighted differences of lower partial moments and the bias measured on a specific scale. Autocalibration is then proposed as a remedy. This new method to correct for bias adds an extra local GLM step to the analysis. Theoretically, it is shown that it implements the autocalibration concept in pure premium calculation and ensures that balance also holds on a local scale, not only at portfolio level as with existing bias-correction techniques. The convex order appears to be the natural tool to compare competing models, putting a new light on the diagnostic graphs and associated metrics proposed by Denuit et al. (2019).
In the standard data analysis framework, data is first collected (once for all), and then data analysis is carried out. With the advancement of digital technology, decisionmakers constantly analyze past data and generate new data through the decisions they make. In this paper, we model this as a Markov decision process and show that the dynamic interaction between data generation and data analysis leads to a new type of bias -- reinforcement bias -- that exacerbates the endogeneity problem in standard data analysis. We propose a class of instrument variable (IV)-based reinforcement learning (RL) algorithms to correct for the bias and establish their asymptotic properties by incorporating them into a two-timescale stochastic approximation framework. A key contribution of the paper is the development of new techniques that allow for the analysis of the algorithms in general settings where noises feature time-dependency. We use the techniques to derive sharper results on finite-time trajectory stability bounds: with a polynomial rate, the entire future trajectory of the iterates from the algorithm fall within a ball that is centered at the true parameter and is shrinking at a (different) polynomial rate. We also use the technique to provide formulas for inferences that are rarely done for RL algorithms. These formulas highlight how the strength of the IV and the degree of the noises time dependency affect the inference.