ترغب بنشر مسار تعليمي؟ اضغط هنا

Fast Reinforcement Learning with Large Action Sets using Error-Correcting Output Codes for MDP Factorization

149   0   0.0 ( 0 )
 نشر من قبل Gabriel Dulac-Arnold
 تاريخ النشر 2012
والبحث باللغة English




اسأل ChatGPT حول البحث

The use of Reinforcement Learning in real-world scenarios is strongly limited by issues of scale. Most RL learning algorithms are unable to deal with problems composed of hundreds or sometimes even dozens of possible actions, and therefore cannot be applied to many real-world problems. We consider the RL problem in the supervised classification framework where the optimal policy is obtained through a multiclass classifier, the set of classes being the set of actions of the problem. We introduce error-correcting output codes (ECOCs) in this setting and propose two new methods for reducing complexity when using rollouts-based approaches. The first method consists in using an ECOC-based classifier as the multiclass classifier, reducing the learning complexity from O(A2) to O(Alog(A)). We then propose a novel method that profits from the ECOCs coding dictionary to split the initial MDP into O(log(A)) seperate two-action MDPs. This second method reduces learning complexity even further, from O(A2) to O(log(A)), thus rendering problems with large action sets tractable. We finish by experimentally demonstrating the advantages of our approach on a set of benchmark problems, both in speed and performance.



قيم البحث

اقرأ أيضاً

This paper introduces MDP homomorphic networks for deep reinforcement learning. MDP homomorphic networks are neural networks that are equivariant under symmetries in the joint state-action space of an MDP. Current approaches to deep reinforcement lea rning do not usually exploit knowledge about such structure. By building this prior knowledge into policy and value networks using an equivariance constraint, we can reduce the size of the solution space. We specifically focus on group-structured symmetries (invertible transformations). Additionally, we introduce an easy method for constructing equivariant network layers numerically, so the system designer need not solve the constraints by hand, as is typically done. We construct MDP homomorphic MLPs and CNNs that are equivariant under either a group of reflections or rotations. We show that such networks converge faster than unstructured baselines on CartPole, a grid world and Pong.
Multi-class classification is mandatory for real world problems and one of promising techniques for multi-class classification is Error Correcting Output Code. We propose a method for constructing the Error Correcting Output Code to obtain the suitab le combination of positive and negative classes encoded to represent binary classifiers. The minimum weight perfect matching algorithm is applied to find the optimal pairs of subset of classes by using the generalization performance as a weighting criterion. Based on our method, each subset of classes with positive and negative labels is appropriately combined for learning the binary classifiers. Experimental results show that our technique gives significantly higher performance compared to traditional methods including the dense random code and the sparse random code both in terms of accuracy and classification times. Moreover, our method requires significantly smaller number of binary classifiers while maintaining accuracy compared to the One-Versus-One.
Universal quantum computers require a large network of qubits robust against errors. Recent theoretical and experimental studies on donor nuclear spins in silicon, engineered on semiconductor platforms compatible with industrial fabrication, show the ir coherent behavior and potential for scalability. Here we present a hardware-efficient quantum protocol that corrects phase flips of a nuclear spin using explicit experimentally feasible operations. We introduce the MAUS encoding (Moment AngUlar System encoding) which uses the large Hilbert space provided by the nuclear spin of the donor to encode the information and employ the electron spin of the donor as an ancilla for error correction. Simulations using present-day experimental manipulation fidelities predict significant improvement in logical qubit fidelity over existing spin quantum-error-correction protocols. These results provides a realizable blueprint for a corrected spin-based qubit.
Most current distributed processing research deals with improving the flexibility and convergence speed of algorithms for networks of finite size with no constraints on information sharing and no concept for expected levels of signal privacy. In this work we investigate the concept of data privacy in unbounded public networks, where linear codes are used to create hard limits on the number of nodes contributing to a distributed task. We accomplish this by wrapping local observations in a linear code and intentionally applying symbol errors prior to transmission. If many nodes join the distributed task, a proportional number of symbol errors are introduced into the code leading to decoding failure if the codes predefined symbol error limit is exceeded.
Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring effic iently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically-backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particular explorations bonuses. A common rule of thumb for exploration bonuses is to use $1/sqrt{n}$ bonus that is added to the empirical estimates of the reward, where $n$ is a number of times this particular state (or a state-action pair) was visited. We show that, surprisingly, for a pure-exploration objective of reward-free exploration, bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon $H$. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting, which is another pure-exploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا