مرکز منطقه ای اطلاع رساني علوم و فناوري - Annealing-pareto multi-objective multi-armed bandit algorithm

DocumentCode :

1799311

Title :

Annealing-pareto multi-objective multi-armed bandit algorithm

Author :

Yahyaa, Saba Q. ; Drugan, Madalina M. ; Manderick, Bernard

Author_Institution :

Dept. of Comput. Sci. Pleinlaan 2, Vrije Univ. Brussel, Brussels, Belgium

fYear :

2014

fDate :

9-12 Dec. 2014

Firstpage :

Lastpage :

Abstract :

In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy respectively with the Pareto dominance relation. In this paper, we propose Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front. We also propose annealing-Pareto algorithm that trades-off between the exploration and exploitation by using a decaying parameter ϵ_t in combination with Pareto dominance relation. The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front. We experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-Pareto is the best performing algorithm.

Keywords :

Pareto optimisation; sampling methods; simulated annealing; stochastic programming; KG-policy; MOMAB; Pareto Thompson sampling; Pareto dominance relation; Pareto front; Pareto knowledge gradient; Pareto optimal arms; Pareto upper confidence bound; Pareto-KG; Pareto-UCB1; UCB1-policy; annealing-Pareto multiobjective multiarmed bandit algorithm; decaying parameter; multiobjective Bernoulli distribution problems; multiobjective multiarmed bandit; reward vectors; stochastic rewards; Annealing; Entropy; Heuristic algorithms; Nickel; Pareto optimization; Probability distribution; Vectors;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014 IEEE Symposium on

Conference_Location :

Orlando, FL

Type :

conf

DOI :

10.1109/ADPRL.2014.7010619

Filename :

7010619

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1799311