Graph Bandits : Multi-Armed Bandits with Locality Constraints

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Multi-armed bandits (MABs) have been studied extensively in the literature and have applications in a wealth of domains, including recommendation systems, dynamic pricing, and investment management. On the one hand, the current MAB literature largely seems to focus on the setting where each arm is available to play at each time step, and ignores how agents move between the arms. On the other hand, there is work that takes the movement between arms into account, but this work models the problem as a Markov decision process and applies generic reinforcement learning (RL) algorithms, like Q-learning. This thesis examines an extension of the MAB problem to a setting where the set of available arms at each round depends on which arm was played in the previous round. In this formulation the arms are nodes in a graph, and arms that can be played successively are connected via edges. We denote this the Graph Bandit (GB) problem. We show that under certain conditions the optimal action is governed by a stationary policy. Furthermore, we develop an algorithm that leverages the graphical structure of the problem to find this policy when the reward distributions are perfectly known, and denote this algorithm the Q-graph. When the reward distributions are unknown, we show how to leverage the Qgraph algorithm together with standard sampling algorithms like Thompson sampling and upper confidence bound to create an online learning algorithm that provably achieves logarithmic regret. Finally, this regret-bound is supported in numerical simulations, and it is illustrated how the proposed Q-graph algorithm outperforms generic algorithms from the MAB and RL communities.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)