Secure-UCB: Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification


Abstract in English

This paper studies bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards, and can contaminate the rewards with additive noise. We show that emph{any} bandit algorithm with regret $O(log T)$ can be forced to suffer a regret $Omega(T)$ with an expected amount of contamination $O(log T)$. This amount of contamination is also necessary, as we prove that there exists an $O(log T)$ regret bandit algorithm, specifically the classical UCB, that requires $Omega(log T)$ amount of contamination to suffer regret $Omega(T)$. To combat such poising attacks, our second main contribution is to propose a novel algorithm, Secure-UCB, which uses limited emph{verification} to access a limited number of uncontaminated rewards. We show that with $O(log T)$ expected number of verifications, Secure-UCB can restore the order optimal $O(log T)$ regret emph{irrespective of the amount of contamination} used by the attacker. Finally, we prove that for any bandit algorithm, this number of verifications $O(log T)$ is necessary to recover the order-optimal regret. We can then conclude that Secure-UCB is order-optimal in terms of both the expected regret and the expected number of verifications, and can save stochastic bandits from any data poisoning attack.

Download