No Arabic abstract
Large-scale screening for potential threats with limited resources and capacity for screening is a problem of interest at airports, seaports, and other ports of entry. Adversaries can observe screening procedures and arrive at a time when there will be gaps in screening due to limited resource capacities. To capture this game between ports and adversaries, this problem has been previously represented as a Stackelberg game, referred to as a Threat Screening Game (TSG). Given the significant complexity associated with solving TSGs and uncertainty in arrivals of customers, existing work has assumed that screenees arrive and are allocated security resources at the beginning of the time window. In practice, screenees such as airport passengers arrive in bursts correlated with flight time and are not bound by fixed time windows. To address this, we propose an online threat screening model in which screening strategy is determined adaptively as a passenger arrives while satisfying a hard bound on acceptable risk of not screening a threat. To solve the online problem with a hard bound on risk, we formulate it as a Reinforcement Learning (RL) problem with constraints on the action space (hard bound on risk). We provide a novel way to efficiently enforce linear inequality constraints on the action output in Deep Reinforcement Learning. We show that our solution allows us to significantly reduce screenee wait time while guaranteeing a bound on risk.
The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in single-agent settings and perfect-information games, best exemplified by AlphaZero. However, prior algorithms of this form cannot cope with imperfect-information games. This paper presents ReBeL, a general framework for self-play reinforcement learning and search that provably converges to a Nash equilibrium in any two-player zero-sum game. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero. Results in two different imperfect-information games show ReBeL converges to an approximate Nash equilibrium. We also show ReBeL achieves superhuman performance in heads-up no-limit Texas holdem poker, while using far less domain knowledge than any prior poker AI.
The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Videos and code are available at https://sites.google.com/view/latent-policy.
It is known that there are uncoupled learning heuristics leading to Nash equilibrium in all finite games. Why should players use such learning heuristics and where could they come from? We show that there is no uncoupled learning heuristic leading to Nash equilibrium in all finite games that a player has an incentive to adopt, that would be evolutionary stable or that could learn itself. Rather, a player has an incentive to strategically teach such a learning opponent in order secure at least the Stackelberg leader payoff. The impossibility result remains intact when restricted to the classes of generic games, two-player games, potential games, games with strategic complements or 2x2 games, in which learning is known to be nice. More generally, it also applies to uncoupled learning heuristics leading to correlated equilibria, rationalizable outcomes, iterated admissible outcomes, or minimal curb sets. A possibility result restricted to strategically trivial games fails if some generic games outside this class are considered as well.
Making decisions in the presence of a strategic opponent requires one to take into account the opponents ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where only one of the agents knows the true objective function. Formally, we model the N-CIRL formalism as a zero-sum Markov game with one-sided incomplete information. Through interacting with the more informed player, the less informed player attempts to both infer, and act according to, the true objective function. As a result of the one-sided incomplete information, the multi-stage game can be decomposed into a sequence of single-stage games expressed by a recursive formula. Solving this recursive formula yields the value of the N-CIRL game and the more informed players equilibrium strategy. Another recursive formula, constructed by forming an auxiliary game, termed the dual game, yields the less informed players strategy. Building upon these two recursive formulas, we develop a computationally tractable algorithm to approximately solve for the equilibrium strategies. Finally, we demonstrate the benefits of our N-CIRL formalism over the existing multi-agent IRL formalism via extensive numerical simulation in a novel cyber security setting.
Many real-world systems possess a hierarchical structure where a strategic plan is forwarded and implemented in a top-down manner. Examples include business activities in large companies or policy making for reducing the spread during pandemics. We introduce a novel class of games that we call structured hierarchical games (SHGs) to capture these strategic interactions. In an SHG, each player is represented as a vertex in a multi-layer decision tree and controls a real-valued action vector reacting to orders from its predecessors and influencing its descendants behaviors strategically based on its own subjective utility. SHGs generalize extensive form games as well as Stackelberg games. For general SHGs with (possibly) nonconvex payoffs and high-dimensional action spaces, we propose a new solution concept which we call local subgame perfect equilibrium. By exploiting the hierarchical structure and strategic dependencies in payoffs, we derive a back propagation-style gradient-based algorithm which we call Differential Backward Induction to compute an equilibrium. We theoretically characterize the convergence properties of DBI and empirically demonstrate a large overlap between the stable points reached by DBI and equilibrium solutions. Finally, we demonstrate the effectiveness of our algorithm in finding emph{globally} stable solutions and its scalability for a recently introduced class of SHGs for pandemic policy making.