No Arabic abstract
Planning under model uncertainty is a fundamental problem across many applications of decision making and learning. In this paper, we propose the Robust Adaptive Monte Carlo Planning (RAMCP) algorithm, which allows computation of risk-sensitive Bayes-adaptive policies that optimally trade off exploration, exploitation, and robustness. RAMCP formulates the risk-sensitive planning problem as a two-player zero-sum game, in which an adversary perturbs the agents belief over the models. We introduce t
Bayesian networks provide a probabilistic semantics for qualitative assertions about likelihood. A qualitative reasoner based on an algebra over these assertions can derive further conclusions about the influence of actions. While the conclusions are much weaker than those computed from complete probability distributions, they are still valuable for suggesting potential actions, eliminating obviously inferior plans, identifying important tradeoffs, and explaining probabilistic models.
Thanks to recent advances, AI Planning has become the underlying technique for several applications. Figuring prominently among these is automated Web Service Composition (WSC) at the capability level, where services are described in terms of preconditions and effects over ontological concepts. A key issue in addressing WSC as planning is that ontologies are not only formal vocabularies; they also axiomatize the possible relationships between concepts. Such axioms correspond to what has been termed integrity constraints in the actions and change literature, and applying a web service is essentially a belief update operation. The reasoning required for belief update is known to be harder than reasoning in the ontology itself. The support for belief update is severely limited in current planning tools. Our first contribution consists in identifying an interesting special case of WSC which is both significant and more tractable. The special case, which we term forward effects, is characterized by the fact that every ramification of a web service application involves at least one new constant generated as output by the web service. We show that, in this setting, the reasoning required for belief update simplifies to standard reasoning in the ontology itself. This relates to, and extends, current notions of message-based WSC, where the need for belief update is removed by a strong (often implicit or informal) assumption of locality of the individual messages. We clarify the computational properties of the forward effects case, and point out a strong relation to standard notions of planning under uncertainty, suggesting that effective tools for the latter can be successfully adapted to address the former. Furthermore, we identify a significant sub-case, named strictly forward effects, where an actual compilation into planning under uncertainty exists. This enables us to exploit off-the-shelf planning tools to solve message-based WSC in a general form that involves powerful ontologies, and requires reasoning about partial matches between concepts. We provide empirical evidence that this approach may be quite effective, using Conformant-FF as the underlying planner.
We consider the problem of designing policies for partially observable Markov decision processes (POMDPs) with dynamic coherent risk objectives. Synthesizing risk-averse optimal policies for POMDPs requires infinite memory and thus undecidable. To overcome this difficulty, we propose a method based on bounded policy iteration for designing stochastic but finite state (memory) controllers, which takes advantage of standard convex optimization methods. Given a memory budget and optimality criterion, the proposed method modifies the stochastic finite state controller leading to sub-optimal solutions with lower coherent risk.
Humans can learn and reason under substantial uncertainty in a space of infinitely many concepts, including structured relational concepts (a scene with objects that have the same color) and ad-hoc categories defined through goals (objects that could fall on ones head). In contrast, standard classification benchmarks: 1) consider only a fixed set of category labels, 2) do not evaluate compositional concept learning and 3) do not explicitly capture a notion of reasoning under uncertainty. We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) to bridge this gap. CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning boolean operations, variable binding, etc. Importantly, it also defines a model-independent compositionality gap to evaluate the difficulty of generalizing out-of-distribution along each of these axes. Extensive evaluations across a range of modeling choices spanning different modalities (image, schemas, and sounds), splits, privileged auxiliary concept information, and choices of negatives reveal substantial scope for modeling advances on the proposed task. All code and datasets will be available online.
This paper targets control problems that exhibit specific safety and performance requirements. In particular, the aim is to ensure that an agent, operating under uncertainty, will at runtime strictly adhere to such requirements. Previous works create so-called shields that correct an existing controller for the agent if it is about to take unbearable safety risks. However, so far, shields do not consider that an environment may not be fully known in advance and may evolve for complex control and learning tasks. We propose a new method for the efficient computation of a shield that is adaptive to a changing environment. In particular, we base our method on problems that are sufficiently captured by potentially infinite Markov decision processes (MDP) and quantitative specifications such as mean payoff objectives. The shield is independent of the controller, which may, for instance, take the form of a high-performing reinforcement learning agent. At runtime, our method builds an internal abstract representation of the MDP and constantly adapts this abstraction and the shield based on observations from the environment. We showcase the applicability of our method via an urban traffic control problem.