No Arabic abstract
Wireless communication systems operate in complex time-varying environments. Therefore, selecting the optimal configuration parameters in these systems is a challenging problem. For wireless links, emph{rate selection} is used to select the optimal data transmission rate that maximizes the link throughput subject to an application-defined latency constraint. We model rate selection as a stochastic multi-armed bandit (MAB) problem, where a finite set of transmission rates are modeled as independent bandit arms. For this setup, we propose Con-TS, a novel constrained version of the Thompson sampling algorithm, where the latency requirement is modeled by a high-probability linear constraint. We show that for Con-TS, the expected number of constraint violations over T transmission intervals is upper bounded by O(sqrt{KT}), where K is the number of available rates. Further, the expected loss in cumulative throughput compared to the optimal rate selection scheme (i.e., the egret is also upper bounded by O(sqrt{KT log K}). Through numerical simulations, we demonstrate that Con-TS significantly outperforms state-of-the-art bandit schemes for rate selection.
We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O(sqrt{T}) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms. We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.
We study the use of policy gradient algorithms to optimize over a class of generalized Thompson sampling policies. Our central insight is to view the posterior parameter sampled by Thompson sampling as a kind of pseudo-action. Policy gradient methods can then be tractably applied to search over a class of sampling policies, which determine a probability distribution over pseudo-actions (i.e., sampled parameters) as a function of observed data. We also propose and compare policy gradient estimators that are specialized to Bayesian bandit problems. Numerical experiments demonstrate that direct policy search on top of Thompson sampling automatically corrects for some of the algorithms known shortcomings and offers meaningful improvements even in long horizon problems where standard Thompson sampling is extremely effective.
Bayesian optimization (BO) is a prominent approach to optimizing expensive-to-evaluate black-box functions. The massive computational capability of edge devices such as mobile phones, coupled with privacy concerns, has led to a surging interest in federated learning (FL) which focuses on collaborative training of deep neural networks (DNNs) via first-order optimization techniques. However, some common machine learning tasks such as hyperparameter tuning of DNNs lack access to gradients and thus require zeroth-order/black-box optimization. This hints at the possibility of extending BO to the FL setting (FBO) for agents to collaborate in these black-box optimization tasks. This paper presents federated Thompson sampling (FTS) which overcomes a number of key challenges of FBO and FL in a principled way: We (a) use random Fourier features to approximate the Gaussian process surrogate model used in BO, which naturally produces the parameters to be exchanged between agents, (b) design FTS based on Thompson sampling, which significantly reduces the number of parameters to be exchanged, and (c) provide a theoretical convergence guarantee that is robust against heterogeneous agents, which is a major challenge in FL and FBO. We empirically demonstrate the effectiveness of FTS in terms of communication efficiency, computational efficiency, and practical performance.
In mobile health (mHealth) smart devices deliver behavioral treatments repeatedly over time to a user with the goal of helping the user adopt and maintain healthy behaviors. Reinforcement learning appears ideal for learning how to optimally make these sequential treatment decisions. However, significant challenges must be overcome before reinforcement learning can be effectively deployed in a mobile healthcare setting. In this work we are concerned with the following challenges: 1) individuals who are in the same context can exhibit differential response to treatments 2) only a limited amount of data is available for learning on any one individual, and 3) non-stationary responses to treatment. To address these challenges we generalize Thompson-Sampling bandit algorithms to develop IntelligentPooling. IntelligentPooling learns personalized treatment policies thus addressing challenge one. To address the second challenge, IntelligentPooling updates each users degree of personalization while making use of available data on other users to speed up learning. Lastly, IntelligentPooling allows responsivity to vary as a function of a users time since beginning treatment, thus addressing challenge three. We show that IntelligentPooling achieves an average of 26% lower regret than state-of-the-art. We demonstrate the promise of this approach and its ability to learn from even a small group of users in a live clinical trial.
Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously proves both the optimal problem-dependent bound of $(1+epsilon)sum_i frac{ln T}{Delta_i}+O(frac{N}{epsilon^2})$ and the first near-optimal problem-independent bound of $O(sqrt{NTln T})$ on the expected regret of this algorithm. Our near-optimal problem-independent bound solves a COLT 2012 open problem of Chapelle and Li. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [ALT 2012]. Our novel martingale-based analysis techniques are conceptually simple, easily extend to distributions other than the Beta distribution, and also extend to the more general contextual bandits setting [Manuscript, Agrawal and Goyal, 2012].