Achieving complete learning in Multi-Armed Bandit problems

Author

Vakili, Sattar ; Qing Zhao

Author_Institution

Dept. of Electr. & Comput. Eng., Univ. of California, Davis, Davis, CA, USA

fYear

2013

fDate

3-6 Nov. 2013

Firstpage

1778

Lastpage

1782

Abstract

In the classic Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward distributions. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T. It is known that the minimum growth rate of regret (defined as the total expected loss with respect to the ideal scenario of known reward models of all arms) is logarithmic with T. In other words, mistakes in selecting suboptimal arms occur infinitely often, and the player will never converge to the arm with the largest reward mean. In this paper, we are interested in the questions that whether side information on the reward model can lead to bounded regret, thus, complete learning, and what is the minimum side information to achieve complete learning. We show that the knowledge of a value η between the largest and the second largest reward mean (among all arms) leads to complete learning by constructing an online learning policy with bounded regret. This result applies to both light-tailed and heavy-tailed reward distributions.

Keywords

game theory; learning (artificial intelligence); statistical distributions; MAB problem; bounded regret; complete learning; heavy-tailed reward distribution; light-tailed reward distribution; multiarmed bandit problem; online learning policy; regret minimum growth rate; reward mean; reward model; side information; total expected reward; Computers; Educational institutions; Random variables; Round robin; Sequential analysis; Upper bound;

fLanguage

English

Publisher

ieee

Conference_Titel

Signals, Systems and Computers, 2013 Asilomar Conference on

Conference_Location

Pacific Grove, CA

Print_ISBN

978-1-4799-2388-5

Type

conf

DOI

10.1109/ACSSC.2013.6810607

Filename

6810607