Title :
Policy Improvement by a Model-Free Dyna Architecture
Author :
Kao-Shing Hwang ; Chia-Yue Lo
Author_Institution :
Dept. of Electr. Eng., Nat. Sun Yat-Sen Univ., Kaohisung, Taiwan
Abstract :
The objective of this paper is to accelerate the process of policy improvement in reinforcement learning. The proposed Dyna-style system combines two learning schemes, one of which utilizes a temporal difference method for direct learning; the other uses relative values for indirect learning in planning between two successive direct learning cycles. Instead of establishing a complicated world model, the approach introduces a simple predictor of average rewards to actor-critic architecture in the simulation (planning) mode. The relative value of a state, defined as the accumulated differences between immediate reward and average reward, is used to steer the improvement process in the right direction. The proposed learning scheme is applied to control a pendulum system for tracking a desired trajectory to demonstrate its adaptability and robustness. Through reinforcement signals from the environment, the system takes the appropriate action to drive an unknown dynamic to track desired outputs in few learning cycles. Comparisons are made between the proposed model-free method, a connectionist adaptive heuristic critic, and an advanced method of Dyna-Q learning in the experiments of labyrinth exploration. The proposed method outperforms its counterparts in terms of elapsed time and convergence rate.
Keywords :
learning (artificial intelligence); nonlinear control systems; pendulums; robust control; trajectory control; Dyna-Q learning; Dyna-style system; actor-critic architecture; adaptability; connectionist adaptive heuristic critic; convergence rate; elapsed time; indirect learning; labyrinth exploration; learning cycles; learning schemes; model-free Dyna architecture; pendulum system control; policy improvement; policy improvement process; reinforcement learning; reinforcement signals; simulation mode; successive direct learning cycles; temporal difference method; trajectory tracking; Acceleration; Computational modeling; Computer architecture; Learning systems; Neural networks; Predictive models; Trajectory; Critic-actor structure; Dyna-style reinforcement learning; POMDP; temporal difference;
Journal_Title :
Neural Networks and Learning Systems, IEEE Transactions on
DOI :
10.1109/TNNLS.2013.2244100