• DocumentCode
    110096
  • Title

    Online Markov Decision Processes Under Bandit Feedback

  • Author

    Neu, Gergely ; Gyorgy, Andras ; Szepesvari, Csaba ; Antos, Andras

  • Author_Institution
    SequeL Team, INRIA Lille - Nord Eur., Villeneuve d´Ascq, France
  • Volume
    59
  • Issue
    3
  • fYear
    2014
  • fDate
    Mar-14
  • Firstpage
    676
  • Lastpage
    691
  • Abstract
    We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T2/3lnT). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T1/2lnT), giving the first rigorously proven, essentially tight regret bound for the problem.
  • Keywords
    Markov processes; decision theory; feedback; learning (artificial intelligence); multi-agent systems; probability; bandit feedback; finite stochastic Markovian environments; learning agent; online Markov decision processes; online learning; reward function; stationary policy; total reward received; transition probabilities; Educational institutions; Electronic mail; Equations; History; Kernel; Markov processes; Vectors; Adversarial environment; Markov decision process; online learning; robust control;
  • fLanguage
    English
  • Journal_Title
    Automatic Control, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9286
  • Type

    jour

  • DOI
    10.1109/TAC.2013.2292137
  • Filename
    6675002