ﻻ يوجد ملخص باللغة العربية
A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by developing an Adversarial Reinforcement Learning (ARL) algorithm that reduces our MDP to a sequence of emph{adversarial} bandit problems. ARL achieves $O(sqrt{SATH^3})$ regret, which is optimal with respect to $S$, $A$, and $T$, and its dependence on $H$ is the best (even for the usual stationary MDP) among existing model-free methods.
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $mathcal{tilde{O}}(L|X|sqrt{|A|T})$
This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds guarantee: it achieves $mathcal{O}(log T)$ regret when the losses are
Multi-objective Neural Architecture Search (NAS) aims to discover novel architectures in the presence of multiple conflicting objectives. Despite recent progress, the problem of approximating the full Pareto front accurately and efficiently remains c
In data stream mining, predictive models typically suffer drops in predictive performance due to concept drift. As enough data representing the new concept must be collected for the new concept to be well learnt, the predictive performance of existin
Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these chang