DocumentCode :
1047721
Title :
Kernel-Based Least Squares Policy Iteration for Reinforcement Learning
Author :
Xu, Xin ; Hu, Dewen ; Lu, Xicheng
Author_Institution :
Nat. Univ. of Defense Technol., Changsha
Volume :
18
Issue :
4
fYear :
2007
fDate :
7/1/2007 12:00:00 AM
Firstpage :
973
Lastpage :
992
Abstract :
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating a- - n initial controller to ensure online performance.
Keywords :
Markov processes; adaptive control; feedback; iterative methods; learning (artificial intelligence); least squares approximations; nonlinear control systems; optimal control; state-space methods; uncertain systems; Mercer kernels; acrobot; adaptive feedback control; approximate linear dependency; continuous state spaces; control plants; double-link underactuated pendulum; generalization ability; kernel sparsification procedure; kernel-based least squares policy iteration; kernel-based least squares temporal-difference algorithm; large-scale Markov decision problems; least squares policy iteration algorithm; near-optimal control policy; nonlinear feedback control; online learning control; policy evaluation; reinforcement learning; ship heading control problem; stochastic chain problem; swing up control; uncertain dynamic systems; Adaptive control; Convergence; Feedback control; Kernel; Learning; Least squares approximation; Least squares methods; Linear approximation; Programmable control; State-space methods; Approximate dynamic programming; Markov decision problems (MDPs); kernel methods; least squares; reinforcement learning (RL); Algorithms; Artificial Intelligence; Biomimetics; Computer Simulation; Decision Support Techniques; Feedback; Least-Squares Analysis; Markov Chains; Models, Theoretical; Reinforcement (Psychology);
fLanguage :
English
Journal_Title :
Neural Networks, IEEE Transactions on
Publisher :
ieee
ISSN :
1045-9227
Type :
jour
DOI :
10.1109/TNN.2007.899161
Filename :
4267723
Link To Document :
بازگشت