Title :
Simulation-based uniform value function estimates of discounted and average-reward MDPs
Author :
Jain, Rahul ; Varaiya, Pravin
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., California Univ., Berkeley, CA, USA
Abstract :
The value function of a Markov decision problem assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. Uniform convergence results are also obtained for the average reward case. They can be extended to partially observed MDPs and Markov games. The results can be viewed as an extension of the probably approximately correct (PAC) learning theory for partially observable MDPs (POMDPs) and Markov games.
Keywords :
Markov processes; decision theory; game theory; Markov games; V-C dimension; average-reward Markov decision problem; discounted reward Markov decision problem; partially observable Markov decision problem; probably approximately correct learning theory; pseudo dimension; simulation-based uniform value function estimates; uniform convergence; Computational modeling; Convergence; Dynamic programming; Equations; Game theory; Optimal control; Space stations; State estimation; State-space methods; Stochastic processes;
Conference_Titel :
Decision and Control, 2004. CDC. 43rd IEEE Conference on
Print_ISBN :
0-7803-8682-5
DOI :
10.1109/CDC.2004.1429444