ترغب بنشر مسار تعليمي؟ اضغط هنا

In this paper, we study the problem of regret minimization in reinforcement learning (RL) under differential privacy constraints. This work is motivated by the wide range of RL applications for providing personalized service, where privacy concerns a re becoming paramount. In contrast to previous works, we take the first step towards non-tabular RL settings, while providing a rigorous privacy guarantee. In particular, we consider the adaptive control of differentially private linear quadratic (LQ) systems. We develop the first private RL algorithm, PRL, which is able to attain a sub-linear regret while guaranteeing privacy protection. More importantly, the additional cost due to privacy is only on the order of $frac{ln(1/delta)^{1/4}}{epsilon^{1/2}}$ given privacy parameters $epsilon, delta > 0$. Through this process, we also provide a general procedure for adaptive control of LQ systems under changing regularizers, which not only generalizes previous non-private controls, but also serves as the basis for general private controls.
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem where the transition kernel $P^*$ belongs to a family of models $mathcal{P}^*$ with finite metric entropy. In the model selection framework, instead of $mathcal{P}^*$, we are given $M$ nested families of transition kernels $cP_1 subset cP_2 subset ldots subset cP_M$. We propose and analyze a novel algorithm, namely emph{Adaptive Reinforcement Learning (General)} (texttt{ARL-GEN}) that adapts to the smallest such family where the true transition kernel $P^*$ lies. texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning (texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts a model selection module at the beginning of each epoch. Under a mild separability assumption on the model classes, we show that texttt{ARL-GEN} obtains a regret of $Tilde{mathcal{O}}(d_{mathcal{E}}^*H^2+sqrt{d_{mathcal{E}}^* mathbb{M}^* H^2 T})$, with high probability, where $H$ is the horizon length, $T$ is the total number of steps, $d_{mathcal{E}}^*$ is the Eluder dimension and $mathbb{M}^*$ is the metric entropy corresponding to $mathcal{P}^*$. Note that this regret scaling matches that of an oracle that knows $mathcal{P}^*$ in advance. We show that the cost of model selection for texttt{ARL-GEN} is an additive term in the regret having a weak dependence on $T$. Subsequently, we remove the separability assumption and consider the setup of linear mixture MDPs, where the transition kernel $P^*$ has a linear function approximation. With this low rank structure, we propose novel adaptive algorithms for model selection, and obtain (order-wise) regret identical to that of an oracle with knowledge of the true model class.
We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by eithe r a low-dimensional representation of the stochastic transition model or an approximation of the $Q$-functions. However, the understanding of function approximation schemes for state-value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $tilde{O}big(Hgamma_Nsqrt{N}big)$, where $H$ is the episode length, $N$ is the total number of time steps and $gamma_N$ is an information theoretic quantity relating the effective dimension of the state-action feature space. Our method bypasses the need for estimating transition probabilities and applies to any domain on which kernels can be defined. It also brings new insights into the general theory of kernel methods for approximate inference and RL regret minimization.
We consider multi-objective optimization (MOO) of an unknown vector-valued function in the non-parametric Bayesian optimization (BO) setting, with the aim being to learn points on the Pareto front of the objectives. Most existing BO algorithms do not model the fact that the multiple objectives, or equivalently, tasks can share similarities, and even the few that do lack rigorous, finite-time regret guarantees that capture explicitly inter-task structure. In this work, we address this problem by modelling inter-task dependencies using a multi-task kernel and develop two novel BO algorithms based on random scalarizations of the objectives. Our algorithms employ vector-valued kernel regression as a stepping stone and belong to the upper confidence bound class of algorithms. Under a smoothness assumption that the unknown vector-valued function is an element of the reproducing kernel Hilbert space associated with the multi-task kernel, we derive worst-case regret bounds for our algorithms that explicitly capture the similarities between tasks. We numerically benchmark our algorithms on both synthetic and real-life MOO problems, and show the advantages offered by learning with multi-task kernels.
We develop algorithms with low regret for learning episodic Markov decision processes based on kernel approximation techniques. The algorithms are based on both the Upper Confidence Bound (UCB) as well as Posterior or Thompson Sampling (PSRL) philoso phies, and work in the general setting of continuous state and action spaces when the true unknown transition dynamics are assumed to have smoothness induced by an appropriate Reproducing Kernel Hilbert Space (RKHS).
We present two algorithms for Bayesian optimization in the batch feedback setting, based on Gaussian process upper confidence bound and Thompson sampling approaches, along with frequentist regret guarantees and numerical results.
We consider black box optimization of an unknown function in the nonparametric Gaussian process setting when the noise in the observed function values can be heavy tailed. This is in contrast to existing literature that typically assumes sub-Gaussian noise distributions for queries. Under the assumption that the unknown function belongs to the Reproducing Kernel Hilbert Space (RKHS) induced by a kernel, we first show that an adaptation of the well-known GP-UCB algorithm with reward truncation enjoys sublinear $tilde{O}(T^{frac{2 + alpha}{2(1+alpha)}})$ regret even with only the $(1+alpha)$-th moments, $alpha in (0,1]$, of the reward distribution being bounded ($tilde{O}$ hides logarithmic factors). However, for the common squared exponential (SE) and Mat{e}rn kernels, this is seen to be significantly larger than a fundamental $Omega(T^{frac{1}{1+alpha}})$ lower bound on regret. We resolve this gap by developing novel Bayesian optimization algorithms, based on kernel approximation techniques, with regret bounds matching the lower bound in order for the SE kernel. We numerically benchmark the algorithms on environments based on both synthetic models and real-world data sets.
We consider online learning for minimizing regret in unknown, episodic Markov decision processes (MDPs) with continuous states and actions. We develop variants of the UCRL and posterior sampling algorithms that employ nonparametric Gaussian process p riors to generalize across the state and action spaces. When the transition and reward functions of the true MDP are members of the associated Reproducing Kernel Hilbert Spaces of functions induced by symmetric psd kernels (frequentist setting), we show that the algorithms enjoy sublinear regret bounds. The bounds are in terms of explicit structural parameters of the kernels, namely a novel generalization of the information gain metric from kernelized bandit, and highlight the influence of transition and reward function structure on the learning performance. Our results are applicable to multidimensional state and action spaces with composite kernel structures, and generalize results from the literature on kernelized bandits, and the adaptive control of parametric linear dynamical systems with quadratic costs.
We consider the problem of online learning in misspecified linear stochastic multi-armed bandit problems. Regret guarantees for state-of-the-art linear bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit (OFUL) hold under the assumption that the arms expected rewards are perfectly linear in their features. It is, however, of interest to investigate the impact of potential misspecification in linear bandit models, where the expected rewards are perturbed away from the linear subspace determined by the arms features. Although OFUL has recently been shown to be robust to relatively small deviations from linearity, we show that any linear bandit algorithm that enjoys optimal regret performance in the perfectly linear setting (e.g., OFUL) must suffer linear regret under a sparse additive perturbation of the linear model. In an attempt to overcome this negative result, we define a natural class of bandit models characterized by a non-sparse deviation from linearity. We argue that the OFUL algorithm can fail to achieve sublinear regret even under models that have non-sparse deviation.We finally develop a novel bandit algorithm, comprising a hypothesis test for linearity followed by a decision to use either the OFUL or Upper Confidence Bound (UCB) algorithm. For perfectly linear bandit models, the algorithm provably exhibits OFULs favorable regret performance, while for misspecified models satisfying the non-sparse deviation property, the algorithm avoids the linear regret phenomenon and falls back on UCBs sublinear regret scaling. Numerical experiments on synthetic data, and on recommendation data from the public Yahoo! Learning to Rank Challenge dataset, empirically support our findings.
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization-Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector- valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا