Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents


Abstract in English

Recent years saw a plethora of work on explaining complex intelligent agents. One example is the development of several algorithms that generate saliency maps which show how much each pixel attributed to the agents decision. However, most evaluations of such saliency maps focus on image classification tasks. As far as we know, there is no work that thoroughly compares different saliency maps for Deep Reinforcement Learning agents. This paper compares four perturbation-based approaches to create saliency maps for Deep Reinforcement Learning agents trained on four different Atari 2600 games. All four approaches work by perturbing parts of the input and measuring how much this affects the agents output. The approaches are compared using three computational metrics: dependence on the learned parameters of the agent (sanity checks), faithfulness to the agents reasoning (input degradation), and run-time. In particular, during the sanity checks we find issues with two approaches and propose a solution to fix one of those issues.

Download