No Arabic abstract
We introduce Robust Restless Bandits, a challenging generalization of restless multi-arm bandits (RMAB). RMABs have been widely studied for intervention planning with limited resources. However, most works make the unrealistic assumption that the transition dynamics are known perfectly, restricting the applicability of existing methods to real-world scenarios. To make RMABs more useful in settings with uncertain dynamics: (i) We introduce the Robust RMAB problem and develop solutions for a minimax regret objective when transitions are given by interval uncertainties; (ii) We develop a double oracle algorithm for solving Robust RMABs and demonstrate its effectiveness on three experimental domains; (iii) To enable our double oracle approach, we introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs. RMABPPO hinges on learning an auxiliary $lambda$-network that allows each arms learning to decouple, greatly reducing sample complexity required for training; (iv) Under minimax regret, the adversary in the double oracle approach is notoriously difficult to implement due to non-stationarity. To address this, we formulate the adversary oracle as a multi-agent reinforcement learning problem and solve it with a multi-agent extension of RMABPPO, which may be of independent interest as the first known algorithm for this setting. Code is available at https://github.com/killian-34/RobustRMAB.
This paper proposes adversarial attacks for Reinforcement Learning (RL) and then improves the robustness of Deep Reinforcement Learning algorithms (DRL) to parameter uncertainties with the help of these attacks. We show that even a naively engineered attack successfully degrades the performance of DRL algorithm. We further improve the attack using gradient information of an engineered loss function which leads to further degradation in performance. These attacks are then leveraged during training to improve the robustness of RL within robust control framework. We show that this adversarial training of DRL algorithms like Deep Double Q learning and Deep Deterministic Policy Gradients leads to significant increase in robustness to parameter variations for RL benchmarks such as Cart-pole, Mountain Car, Hopper and Half Cheetah environment.
Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as $trightarrow{}infty$. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximations precision to an upper bound of LPQLs return as $trightarrow{}infty$. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data.
Reinforcement learning agents are faced with two types of uncertainty. Epistemic uncertainty stems from limited data and is useful for exploration, whereas aleatoric uncertainty arises from stochastic environments and must be accounted for in risk-sensitive applications. We highlight the challenges involved in simultaneously estimating both of them, and propose a framework for disentangling and estimating these uncertainties on learned Q-values. We derive unbiased estimators of these uncertainties and introduce an uncertainty-aware DQN algorithm, which we show exhibits safe learning behavior and outperforms other DQN variants on the MinAtar testbed.
Deep neural networks, including reinforcement learning agents, have been proven vulnerable to small adversarial changes in the input, thus making deploying such networks in the real world problematic. In this paper, we propose RADIAL-RL, a method to train reinforcement learning agents with improved robustness against any $l_p$-bounded adversarial attack. By simply minimizing an upper bound of the loss functions under worst case adversarial perturbation derived from efficient robustness verification methods, we significantly improve robustness of RL-agents trained on Atari-2600 games and show that RADIAL-RL can beat state-of-the-art robust training algorithms when evaluated against PGD-attacks. We also propose a new evaluation method, Greedy Worst-Case Reward (GWC), for measuring attack agnostic robustness of RL agents. GWC can be evaluated efficiently and it serves as a good estimate of the reward under the worst possible sequence of adversarial attacks; in particular, GWC accounts for the importance of each action and their temporal dependency, improving upon previous approaches that only evaluate whether each single action can change under input perturbations. Our code is available at https://github.com/tuomaso/radial_rl.
Robust Reinforcement Learning aims to find the optimal policy with some extent of robustness to environmental dynamics. Existing learning algorithms usually enable the robustness through disturbing the current state or simulating environmental parameters in a heuristic way, which lack quantified robustness to the system dynamics (i.e. transition probability). To overcome this issue, we leverage Wasserstein distance to measure the disturbance to the reference transition kernel. With Wasserstein distance, we are able to connect transition kernel disturbance to the state disturbance, i.e. reduce an infinite-dimensional optimization problem to a finite-dimensional risk-aware problem. Through the derived risk-aware optimal Bellman equation, we show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm--Wasserstein Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.