No Arabic abstract
The problem of exploration in unknown environments continues to pose a challenge for reinforcement learning algorithms, as interactions with the environment are usually expensive or limited. The technique of setting subgoals with an intrinsic reward allows for the use of supplemental feedback to aid agent in environment with sparse and delayed rewards. In fact, it can be an effective tool in directing the exploration behavior of the agent toward useful parts of the state space. In this paper, we consider problems where an agent faces an unknown task in the future and is given prior opportunities to ``practice on related tasks where the interactions are still expensive. We propose a one-step Bayes-optimal algorithm for selecting subgoal designs, along with the number of episodes and the episode length, to efficiently maximize the expected performance of an agent. We demonstrate its excellent performance on a variety of tasks and also prove an asymptotic optimality guarantee.
For a partially unknown linear systems, we present a systematic control design approach based on generated data from measurements of closed-loop experiments with suitable test controllers. These experiments are used to improve the achieved performance and to reduce the uncertainty about the unknown parts of the system. This is achieved through a parametrization of auspicious controllers with convex relaxation techniques from robust control, which guarantees that their implementation on the unknown plant is safe. This approach permits to systematically incorporate available prior knowledge about the system by employing the framework of linear fractional representations.
We present a deterministic exploration mechanism for sponsored search auctions, which enables the auctioneer to learn the relevance scores of advertisers, and allows advertisers to estimate the true value of clicks generated at the auction site. This exploratory mechanism deviates only minimally from the mechanism being currently used by Google and Yahoo! in the sense that it retains the same pricing rule, similar ranking scheme, as well as, similar mathematical structure of payoffs. In particular, the estimations of the relevance scores and true-values are achieved by providing a chance to lower ranked advertisers to obtain better slots. This allows the search engine to potentially test a new pool of advertisers, and correspondingly, enables new advertisers to estimate the value of clicks/leads generated via the auction. Both these quantities are unknown a priori, and their knowledge is necessary for the auction to operate efficiently. We show that such an exploration policy can be incorporated without any significant loss in revenue for the auctioneer. We compare the revenue of the new mechanism to that of the standard mechanism at their corresponding symmetric Nash equilibria and compute the cost of uncertainty, which is defined as the relative loss in expected revenue per impression. We also bound the loss in efficiency, as well as, in user experience due to exploration, under the same solution concept (i.e. SNE). Thus the proposed exploration mechanism learns the relevance scores while incorporating the incentive constraints from the advertisers who are selfish and are trying to maximize their own profits, and therefore, the exploration is essentially achieved via mechanism design. We also discuss variations of the new mechanism such as truthful implementations.
Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.
Reinforcement learning (RL) promises to enable autonomous acquisition of complex behaviors for diverse agents. However, the success of current reinforcement learning algorithms is predicated on an often under-emphasised requirement -- each trial needs to start from a fixed initial state distribution. Unfortunately, resetting the environment to its initial state after each trial requires substantial amount of human supervision and extensive instrumentation of the environment which defeats the purpose of autonomous reinforcement learning. In this work, we propose Value-accelerated Persistent Reinforcement Learning (VaPRL), which generates a curriculum of initial states such that the agent can bootstrap on the success of easier tasks to efficiently learn harder tasks. The agent also learns to reach the initial states proposed by the curriculum, minimizing the reliance on human interventions into the learning. We observe that VaPRL reduces the interventions required by three orders of magnitude compared to episodic RL while outperforming prior state-of-the art methods for reset-free RL both in terms of sample efficiency and asymptotic performance on a variety of simulated robotics problems.
There are various costs for attackers to manipulate the features of security classifiers. The costs are asymmetric across features and to the directions of changes, which cannot be precisely captured by existing cost models based on $L_p$-norm robustness. In this paper, we utilize such domain knowledge to increase the attack cost of evading classifiers, specifically, tree ensemble models that are widely used by security tasks. We propose a new cost modeling method to capture the feature manipulation cost as constraint, and then we integrate the cost-driven constraint into the node construction process to train robust tree ensembles. During the training process, we use the constraint to find data points that are likely to be perturbed given the feature manipulation cost, and we use a new robust training algorithm to optimize the quality of the trees. Our cost-aware training method can be applied to different types of tree ensembles, including gradient boosted decision trees and random forest models. Using Twitter spam detection as the case study, our evaluation results show that we can increase the attack cost by 10.6X compared to the baseline. Moreover, our robust training method using cost-driven constraint can achieve higher accuracy, lower false positive rate, and stronger cost-aware robustness than the state-of-the-art training method using $L_infty$-norm cost model. Our code is available at