• DocumentCode
    2832323
  • Title

    PAC bounds for simulation-based optimization of Markov decision processes

  • Author

    Watson, Thomas

  • Author_Institution
    IBM TJ Watson Res. Center, Hawthorne
  • fYear
    2007
  • fDate
    12-14 Dec. 2007
  • Firstpage
    3466
  • Lastpage
    3471
  • Abstract
    We generalize the PAC Learning framework for Markov decision processes developed in [18]. We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an e-optimal policy from simulation. We provide sample complexity of such an approach.
  • Keywords
    Markov processes; convergence; optimisation; Markov decision processes; PAC learning; convergence; simulation-based optimization; Computational modeling; Convergence; Dynamic programming; Equations; Geometry; Search problems; Solid modeling; Space stations; USA Councils; Upper bound;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Decision and Control, 2007 46th IEEE Conference on
  • Conference_Location
    New Orleans, LA
  • ISSN
    0191-2216
  • Print_ISBN
    978-1-4244-1497-0
  • Electronic_ISBN
    0191-2216
  • Type

    conf

  • DOI
    10.1109/CDC.2007.4435050
  • Filename
    4435050