DocumentCode
1535838
Title
A Structured Multiarmed Bandit Problem and the Greedy Policy
Author
Mersereau, Adam J. ; Rusmevichientong, Paat ; Tsitsiklis, John N.
Author_Institution
Kenan-Flagler Bus. Sch., Univ. of North Carolina, Chapel Hill, NC, USA
Volume
54
Issue
12
fYear
2009
Firstpage
2787
Lastpage
2802
Abstract
We consider a multiarmed bandit problem where the expected reward of each arm is a linear function of an unknown scalar with a prior distribution. The objective is to choose a sequence of arms that maximizes the expected total (or discounted total) reward. We demonstrate the effectiveness of a greedy policy that takes advantage of the known statistical correlation structure among the arms. In the infinite horizon discounted reward setting, we show that the greedy and optimal policies eventually coincide, and both settle on the best arm. This is in contrast with the Incomplete Learning Theorem for the case of independent arms. In the total reward setting, we show that the cumulative Bayes risk after T periods under the greedy policy is at most O(logT), which is smaller than the lower bound of ??(log2 T) established by Lai for a general, but different, class of bandit problems. We also establish the tightness of our bounds. Theoretical and numerical results show that the performance of our policy scales independently of the number of arms.
Keywords
Bayes methods; Markov processes; greedy algorithms; statistical distributions; cumulative Bayes risk; discounted total reward; expected total reward; greedy policy; incomplete learning theorem; infinite horizon discounted reward setting; linear function; prior distribution; statistical correlation structure; structured multiarmed bandit; Arm; Convergence; Costs; Infinite horizon; Laboratories; Operations research; Prototypes; Markov decision process (MDP);
fLanguage
English
Journal_Title
Automatic Control, IEEE Transactions on
Publisher
ieee
ISSN
0018-9286
Type
jour
DOI
10.1109/TAC.2009.2031725
Filename
5308361
Link To Document