DocumentCode :
854389
Title :
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards
Author :
Anantharam, Venkatachalam ; Varaiya, Pravin ; Walrand, Jean
Author_Institution :
Cornell University, Ithaca, NY, USA
Volume :
32
Issue :
11
fYear :
1987
fDate :
11/1/1987 12:00:00 AM
Firstpage :
977
Lastpage :
982
Abstract :
At each instant of time we are required to sample a fixed number m \\geq 1 out of N Markov chains whose stationary transition probability matrices belong to a family suitably parameterized by a real number \\theta . The objective is to maximize the long run expected value of the samples. The learning loss of a sampling scheme corresponding to a parameters configuration C = (\\theta_{1}, ..., \\theta_{N}) is quantified by the regret R_{n}(C) . This is the difference between the maximum expected reward that could be achieved if C were known and the expected reward actually achieved. We provide a lower bound for the regret associated with any uniformly good scheme, and construct a sampling scheme which attains the lower bound for every C . The lower bound is given explicitly in terms of the Kullback-Liebler number between pairs of transition probabilities.
Keywords :
Adaptive control; Markov processes; Optimal stochastic control; Resource management; Stochastic optimal control; Arm; Computer science; Laboratories; Probability distribution; Random variables; Sampling methods; State-space methods; Statistical distributions; Statistics; Stochastic processes;
fLanguage :
English
Journal_Title :
Automatic Control, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9286
Type :
jour
DOI :
10.1109/TAC.1987.1104485
Filename :
1104485
Link To Document :
بازگشت