A Structured Multiarmed Bandit Problem and the Greedy Policy

Author

Mersereau, Adam J. ; Rusmevichientong, Paat ; Tsitsiklis, John N.

Author_Institution

Kenan-Flagler Bus. Sch., Univ. of North Carolina, Chapel Hill, NC, USA

Volume

54

Issue

12

fYear

2009

Firstpage

2787

Lastpage

2802

Abstract

We consider a multiarmed bandit problem where the expected reward of each arm is a linear function of an unknown scalar with a prior distribution. The objective is to choose a sequence of arms that maximizes the expected total (or discounted total) reward. We demonstrate the effectiveness of a greedy policy that takes advantage of the known statistical correlation structure among the arms. In the infinite horizon discounted reward setting, we show that the greedy and optimal policies eventually coincide, and both settle on the best arm. This is in contrast with the Incomplete Learning Theorem for the case of independent arms. In the total reward setting, we show that the cumulative Bayes risk after T periods under the greedy policy is at most O(logT), which is smaller than the lower bound of ??(log² T) established by Lai for a general, but different, class of bandit problems. We also establish the tightness of our bounds. Theoretical and numerical results show that the performance of our policy scales independently of the number of arms.

Keywords

Bayes methods; Markov processes; greedy algorithms; statistical distributions; cumulative Bayes risk; discounted total reward; expected total reward; greedy policy; incomplete learning theorem; infinite horizon discounted reward setting; linear function; prior distribution; statistical correlation structure; structured multiarmed bandit; Arm; Convergence; Costs; Infinite horizon; Laboratories; Operations research; Prototypes; Markov decision process (MDP);

fLanguage

English

Journal_Title

Automatic Control, IEEE Transactions on

Publisher

ieee

ISSN

0018-9286

Type

jour

DOI

10.1109/TAC.2009.2031725

Filename

5308361