PNS: Population-Guided Novelty Search Learning Method for Reinforcement Learning

Reinforcement Learning (RL) has made remarkable achievements, but it still suffers from inadequate exploration strategies, sparse reward signals, and deceptive reward functions. These problems motivate the need for a more efficient and directed exploration. For solving this, a Population-guided Novelty Search (PNS) parallel learning method is proposed. In PNS, the population is divided into multiple sub-populations, each of which has one chief agent and several exploring agents. The role of the chief agent is to evaluate the policies learned by exploring agents and to share the optimal policy with all sub-populations. The role of exploring agents is to learn their policies in collaboration with the guidance of the optimal policy and, simultaneously, upload their policies to the chief agent. To balance exploration and exploitation, the Novelty Search (NS) is employed in chief agents to encourage policies with high novelty while maximizing per-episode performance. The introduction of sub-populations and NS mechanisms promote directed exploration and enables better policy search. In the numerical experiment section, the proposed scheme is applied to the twin delayed deep deterministic (TD3) policy gradient algorithm, and the effectiveness of PNS to promote exploration and improve performance in both continuous control domains and discrete control domains is demonstrated. Notably, the proposed method achieves rewards that far exceed the SOTA methods in Delayed MoJoCo environments.
