ترغب بنشر مسار تعليمي؟ اضغط هنا

Learning model-based strategies in simple environments with hierarchical q-networks

107   0   0.0 ( 0 )
 نشر من قبل Alp M\\\"uyesser
 تاريخ النشر 2018
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent advances in deep learning have allowed artificial agents to rival human-level performance on a wide range of complex tasks; however, the ability of these networks to learn generalizable strategies remains a pressing challenge. This critical limitation is due in part to two factors: the opaque information representation in deep neural networks and the complexity of the task environments in which they are typically deployed. Here we propose a novel Hierarchical Q-Network (HQN) motivated by theories of the hierarchical organization of the human prefrontal cortex, that attempts to identify lower dimensional patterns in the value landscape that can be exploited to construct an internal model of rules in simple environments. We draw on combinatorial games, where there exists a single optimal strategy for winning that generalizes across other features of the game, to probe the strategy generalization of the HQN and other reinforcement learning (RL) agents using variations of Wythoffs game. Traditional RL approaches failed to reach satisfactory performance on variants of Wythoffs Game; however, the HQN learned heuristic-like strategies that generalized across changes in board configuration. More importantly, the HQN allowed for transparent inspection of the agents internal model of the game following training. Our results show how a biologically inspired hierarchical learner can facilitate learning abstract rules to promote robust and flexible action policies in simplified training environments with clearly delineated optimal strategies.


قيم البحث

اقرأ أيضاً

A* search is an informed search algorithm that uses a heuristic function to guide the order in which nodes are expanded. Since the computation required to expand a node and compute the heuristic values for all of its generated children grows linearly with the size of the action space, A* search can become impractical for problems with large action spaces. This computational burden becomes even more apparent when heuristic functions are learned by general, but computationally expensive, deep neural networks. To address this problem, we introduce DeepCubeAQ, a deep reinforcement learning and search algorithm that builds on the DeepCubeA algorithm and deep Q-networks. DeepCubeAQ learns a heuristic function that, with a single forward pass through a deep neural network, computes the sum of the transition cost and the heuristic value of all of the children of a node without explicitly generating any of the children, eliminating the need for node expansions. DeepCubeAQ then uses a novel variant of A* search, called AQ* search, that uses the deep Q-network to guide search. We use DeepCubeAQ to solve the Rubiks cube when formulated with a large action space that includes 1872 meta-actions and show that this 157-fold increase in the size of the action space incurs less than a 4-fold increase in computation time when performing AQ* search and that AQ* search is orders of magnitude faster than A* search.
We propose a simple SIR model in order to investigate the impact of various confinement strategies on a most virulent epidemic. Our approach is motivated by the current COVID-19 pandemic. The main hypothesis is the existence of two populations of sus ceptible persons, one which obeys confinement and for which the infection rate does not exceed 1, and a population which, being non confined for various imperatives, can be substantially more infective. The model, initially formulated as a differential system, is discretised following a specific procedure, the discrete system serving as an integrator for the differential one. Our model is calibrated so as to correspond to what is observed in the COVID-19 epidemic. Several conclusions can be reached, despite the very simple structure of our model. First, it is not possible to pinpoint the genesis of the epidemic by just analysing data from when the epidemic is in full swing. It may well turn out that the epidemic has reached a sizeable part of the world months before it became noticeable. Concerning the confinement scenarios, a universal feature of all our simulations is that relaxing the lockdown constraints leads to a rekindling of the epidemic. Thus we sought the conditions for the second epidemic peak to be lower than the first one. This is possible in all the scenarios considered (abrupt, progressive or stepwise exit) but typically a progressive exit can start earlier than an abrupt one. However, by the time the progressive exit is complete, the overall confinement times are not too different. From our results, the most promising strategy is that of a stepwise exit. And in fact its implementation could be quite feasible, with the major part of the population (minus the fragile groups) exiting simultaneously but obeying rigorous distancing constraints.
Efficient decision-making over continuously changing data is essential for many application domains such as cyber-physical systems, industry digitalization, etc. Modern stream reasoning frameworks allow one to model and solve various real-world probl ems using incremental and continuous evaluation of programs as new data arrives in the stream. Applied techniques use, e.g., Datalog-like materialization or truth maintenance algorithms to avoid costly re-computations, thus ensuring low latency and high throughput of a stream reasoner. However, the expressiveness of existing approaches is quite limited and, e.g., they cannot be used to encode problems with constraints, which often appear in practice. In this paper, we suggest a novel approach that uses the Conflict-Driven Constraint Learning (CDCL) to efficiently update legacy solutions by using intelligent management of learned constraints. In particular, we study the applicability of reinforcement learning to continuously assess the utility of learned constraints computed in previous invocations of the solving algorithm for the current one. Evaluations conducted on real-world reconfiguration problems show that providing a CDCL algorithm with relevant learned constraints from previous iterations results in significant performance improvements of the algorithm in stream reasoning scenarios. Under consideration for acceptance in TPLP.
Learning to navigate in complex environments with dynamic elements is an important milestone in developing AI agents. In this work we formulate the navigation question as a reinforcement learning problem and show that data efficiency and task perform ance can be dramatically improved by relying on additional auxiliary tasks leveraging multimodal sensory inputs. In particular we consider jointly learning the goal-driven reinforcement learning problem with auxiliary depth prediction and loop closure classification tasks. This approach can learn to navigate from raw sensory input in complicated 3D mazes, approaching human-level performance even under conditions where the goal location changes frequently. We provide detailed analysis of the agent behaviour, its ability to localise, and its network activity dynamics, showing that the agent implicitly learns key navigation abilities.
An improvement of Q-learning is proposed in this paper. It is different from classic Q-learning in that the similarity between different states and actions is considered in the proposed method. During the training, a new updating mechanism is used, i n which the Q value of the similar state-action pairs are updated synchronously. The proposed method can be used in combination with both tabular Q-learning function and deep Q-learning. And the results of numerical examples illustrate that compared to the classic Q-learning, the proposed method has a significantly better performance.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا