At each instant of time we are required to sample a fixed number

out of

Markov chains whose stationary transition probability matrices belong to a family suitably parameterized by a real number

. The objective is to maximize the long run expected value of the samples. The learning loss of a sampling scheme corresponding to a parameters configuration

is quantified by the regret

. This is the difference between the maximum expected reward that could be achieved if

were known and the expected reward actually achieved. We provide a lower bound for the regret associated with any uniformly good scheme, and construct a sampling scheme which attains the lower bound for every

. The lower bound is given explicitly in terms of the Kullback-Liebler number between pairs of transition probabilities.