ترغب بنشر مسار تعليمي؟ اضغط هنا

A Unified Framework for Conservative Exploration

144   0   0.0 ( 0 )
 نشر من قبل Yunchang Yang
 تاريخ النشر 2021
والبحث باللغة English




اسأل ChatGPT حول البحث

We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. For multi-armed bandits, linear bandits and tabular RL, specialized algorithms and theoretical analyses were proposed in previous work. In this paper, we present a unified framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, our framework gives a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. We strengthen the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP. For upper bounds, our framework turns a certain nonconservative upper-confidence-bound (UCB) algorithm into a conservative algorithm with a simple analysis. For multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.



قيم البحث

اقرأ أيضاً

92 - Huyan Huang , Yipeng Liu , Ce Zhu 2020
Coupled tensor decomposition reveals the joint data structure by incorporating priori knowledge that come from the latent coupled factors. The tensor ring (TR) decomposition is invariant under the permutation of tensors with different mode properties , which ensures the uniformity of decomposed factors and mode attributes. The TR has powerful expression ability and achieves success in some multi-dimensional data processing applications. To let coupled tensors help each other for missing component estimation, in this paper we utilize TR for coupled completion by sharing parts of the latent factors. The optimization model for coupled TR completion is developed with a novel Frobenius norm. It is solved by the block coordinate descent algorithm which efficiently solves a series of quadratic problems resulted from sampling pattern. The excess risk bound for this optimization model shows the theoretical performance enhancement in comparison with other coupled nuclear norm based methods. The proposed method is validated on numerical experiments on synthetic data, and experimental results on real-world data demonstrate its superiority over the state-of-the-art methods in terms of recovery accuracy.
The worst-case training principle that minimizes the maximal adversarial loss, also known as adversarial training (AT), has shown to be a state-of-the-art approach for enhancing adversarial robustness against norm-ball bounded input perturbations. No netheless, min-max optimization beyond the purpose of AT has not been rigorously explored in the research of adversarial attack and defense. In particular, given a set of risk sources (domains), minimizing the maximal loss induced from the domain set can be reformulated as a general min-max problem that is different from AT. Examples of this general formulation include attacking model ensembles, devising universal perturbation under multiple inputs or data transformations, and generalized AT over different types of attack models. We show that these problems can be solved under a unified and theoretically principled min-max optimization framework. We also show that the self-adjusted domain weights learned from our method provides a means to explain the difficulty level of attack and defense over multiple domains. Extensive experiments show that our approach leads to substantial performance improvement over the conventional averaging strategy.
Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling tria l and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https://sites.google.com/view/conservative-safety-critics/home
In this paper, we develop a quadrature framework for large-scale kernel machines via a numerical integration representation. Considering that the integration domain and measure of typical kernels, e.g., Gaussian kernels, arc-cosine kernels, are fully symmetric, we leverage deterministic fully symmetric interpolatory rules to efficiently compute quadrature nodes and associated weights for kernel approximation. The developed interpolatory rules are able to reduce the number of needed nodes while retaining a high approximation accuracy. Further, we randomize the above deterministic rules by the classical Monte-Carlo sampling and control variates techniques with two merits: 1) The proposed stochastic rules make the dimension of the feature mapping flexibly varying, such that we can control the discrepancy between the original and approximate kernels by tuning the dimnension. 2) Our stochastic rules have nice statistical properties of unbiasedness and variance reduction with fast convergence rate. In addition, we elucidate the relationship between our deterministic/stochastic interpolatory rules and current quadrature rules for kernel approximation, including the sparse grids quadrature and stochastic spherical-radial rules, thereby unifying these methods under our framework. Experimental results on several benchmark datasets show that our methods compare favorably with other representative kernel approximation based methods.
Deep Neural Networks are well known to be vulnerable to adversarial attacks and backdoor attacks, where minor modifications on the input can mislead the models to give wrong results. Although defenses against adversarial attacks have been widely stud ied, research on mitigating backdoor attacks is still at an early stage. It is unknown whether there are any connections and common characteristics between the defenses against these two attacks. In this paper, we present a unified framework for detecting malicious examples and protecting the inference results of Deep Learning models. This framework is based on our observation that both adversarial examples and backdoor examples have anomalies during the inference process, highly distinguishable from benign samples. As a result, we repurpose and revise four existing adversarial defense methods for detecting backdoor examples. Extensive evaluations indicate these approaches provide reliable protection against backdoor attacks, with a higher accuracy than detecting adversarial examples. These solutions also reveal the relations of adversarial examples, backdoor examples and normal samples in model sensitivity, activation space and feature space. This can enhance our understanding about the inherent features of these two attacks, as well as the defense opportunities.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا