AlphaZero to Alpha Hero : A pre-study on Additional Tree Sampling within Self-Play Reinforcement Learning

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Fredrik Carlsson; Joey Öhman; [2019]

Nyckelord: ;

Sammanfattning: In self-play reinforcement learning an agent plays games against itself and with the help of hindsight and retrospection improves its policy over time. Using this premise, AlphaZero famously managed to become the strongest known Go, Shogi, and Chess entity by training a deep neural network from data collected solely from self-play. AlphaZero couples this deep neural network with a Monte Carlo Tree Search algorithm that drastically improves the networks initial policy and state evaluation. When training AlphaZero relies on the final outcome of the game for the generation of training labels. By altering the learning target to instead make use of the improved state evaluation acquired after the tree search, the creation of training labels for states exclusively visited by tree search becomes possible. We propose the extension of Additional Tree Sampling that exploits the change of learning target and provide theoretical arguments and counterarguments for the validity of this approach. Further, an empirical analysis is performed on the game Connect Four, which harbors results that justifies the change in learning target. The altered learning target seems to have no negative impact on the final player strength nor on the behavior of the learning algorithm over time. Based on these positive results we encourage further research of Additional Tree Sampling in order to validify or reject the usefulness of this method.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)