Policy Improvement by a Model-Free Dyna Architecture

Author

Kao-Shing Hwang ; Chia-Yue Lo

Author_Institution

Dept. of Electr. Eng., Nat. Sun Yat-Sen Univ., Kaohisung, Taiwan

Volume

24

Issue

5

fYear

2013

fDate

May-13

Firstpage

776

Lastpage

788

Abstract

The objective of this paper is to accelerate the process of policy improvement in reinforcement learning. The proposed Dyna-style system combines two learning schemes, one of which utilizes a temporal difference method for direct learning; the other uses relative values for indirect learning in planning between two successive direct learning cycles. Instead of establishing a complicated world model, the approach introduces a simple predictor of average rewards to actor-critic architecture in the simulation (planning) mode. The relative value of a state, defined as the accumulated differences between immediate reward and average reward, is used to steer the improvement process in the right direction. The proposed learning scheme is applied to control a pendulum system for tracking a desired trajectory to demonstrate its adaptability and robustness. Through reinforcement signals from the environment, the system takes the appropriate action to drive an unknown dynamic to track desired outputs in few learning cycles. Comparisons are made between the proposed model-free method, a connectionist adaptive heuristic critic, and an advanced method of Dyna-Q learning in the experiments of labyrinth exploration. The proposed method outperforms its counterparts in terms of elapsed time and convergence rate.

Keywords

learning (artificial intelligence); nonlinear control systems; pendulums; robust control; trajectory control; Dyna-Q learning; Dyna-style system; actor-critic architecture; adaptability; connectionist adaptive heuristic critic; convergence rate; elapsed time; indirect learning; labyrinth exploration; learning cycles; learning schemes; model-free Dyna architecture; pendulum system control; policy improvement; policy improvement process; reinforcement learning; reinforcement signals; simulation mode; successive direct learning cycles; temporal difference method; trajectory tracking; Acceleration; Computational modeling; Computer architecture; Learning systems; Neural networks; Predictive models; Trajectory; Critic-actor structure; Dyna-style reinforcement learning; POMDP; temporal difference;

fLanguage

English

Journal_Title

Neural Networks and Learning Systems, IEEE Transactions on

Publisher

ieee

ISSN

2162-237X

Type

jour

DOI

10.1109/TNNLS.2013.2244100

Filename

6463457