On the value of learning for Bernoulli bandits with unknown parameters

Author

Bhulai, Sandjai ; Koole, Ger

Author_Institution

Vrije Univ., Amsterdam, Netherlands

Volume

45

Issue

11

fYear

2000

fDate

11/1/2000 12:00:00 AM

Firstpage

2135

Lastpage

2140

Abstract

Investigates the multiarmed bandit problem, where each arm generates an infinite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be beta-distributed. Every time a bandit is selected, its beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long-term discounted rewards. We study the relationship between the necessity of acquiring additional information and the reward. This is done by considering two extreme situations, which occur when a bandit has been played N times: the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. We show that the difference in reward between this lower and upper bound goes to zero as N grows large.

Keywords

Bayes methods; Markov processes; adaptive control; decision theory; game theory; probability; Bernoulli bandits; Bernoulli distributed rewards; Bernoulli distributions; beta-distribution; decision maker; infinite sequence; long-term discounted rewards; multiarmed bandit problem; Adaptive control; Arm; Bayesian methods; Clinical trials; Closed-form solution; Diseases; Dynamic programming; Equations; Minimax techniques; Upper bound;

fLanguage

English

Journal_Title

Automatic Control, IEEE Transactions on

Publisher

ieee

ISSN

0018-9286

Type

jour

DOI

10.1109/9.887641

Filename

887641