Title :
Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data
Author :
Lewis, F.L. ; Vamvoudakis, Kyriakos G.
Author_Institution :
Autom. & Robot. Res. Inst., Univ. of Texas at Arlington, Fort Worth, TX, USA
Abstract :
Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.
Keywords :
Markov processes; autoregressive moving average processes; control engineering computing; data handling; decision theory; dynamic programming; feedback; iterative methods; knowledge based systems; learning (artificial intelligence); observability; optimal control; polynomial approximation; adaptive dynamic programming; chastic equivalent system; control system community; control system theory; decision processes; deterministic behavior; iteration algorithm; learning algorithm; linear dynamical system; observable Markov decision process; optimal controller; optimal state variable feedback gain; output feedback; partially observable dynamic processes; polynomial autoregressive moving average controller; reinforcement learning; Control systems; Dynamic programming; Feedback control; Learning; Optimal control; Output feedback; Polynomials; State feedback; Stochastic systems; Upper bound; Approximate dynamic programming (ADP); data-based optimal control; output feedback (OPFB); policy iteration (PI); value iteration (VI); Algorithms; Artificial Intelligence; Feedback; Learning; Markov Chains; Reinforcement (Psychology);
Journal_Title :
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on
DOI :
10.1109/TSMCB.2010.2043839