Improving robustness of beyond visual range strategies with adapted training distributions

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: A key obstacle for training an autonomous agent in real air-to-air combat is the lack of available training data, which makes it difficult to apply supervised learning techniques. Self-play is a method that can be used where an agent trains against itself or against versions of itself without imitation data or human instruction. Agents training only against themselves learn brittle strategies that do not generalize very well, which is why training against a distribution of strategies is necessary to improve robustness. In this thesis, we study two problems. First, what is a robust strategy, and how do we evaluate it? Secondly, how do we increase the robustness of strategies learned in a self-play setting by adapting the training distribution? The problems are significant to study because self-play is a very promising method of training not only for air combat but in any non-cooperative problem setting where a simulator can be used to gather training data with no human in the loop. Specifically, in the aircraft industry, the cost of gathering samples is incredibly high. To evaluate the robustness of a population of strategies we turned to evolutionary game theory and connected theα-rank algorithm to what we perceive as robustness. Theα-rank induces a strict ordering over the set, which we then take as an evaluation of the robustness of the strategies. We validated that a highα-rank correlated well with performing well in an out of population evaluation. To study how the robustness of a population correlated with training distributions, we trained populations against four different training distributions. We used the uniform, δ-uniform, andα-rank distributions that rely on no information, information on the training process, and information on the robustness of agents, respectively. We also designed a novel amortizedα-rank training distribution that combines the information behind the δ-uniform and α-rank distributions, and we showed that it induced superior robustness properties in the learned strategies. Our efforts indicate that even better training distributions can be produced, which is useful when looking at using self-play in the future. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)