A metaheuristic for vehicle routing problems based on reinforcement learning

Detta är en Master-uppsats från KTH/Optimeringslära och systemteori

Författare: David Ödling; [2018]

Nyckelord: ;

Sammanfattning: The vehicle routing problem is an old and well-studied problem that arise in last mile logistics. The rapid increase of e-commerce, in particular with an increasing the demand for time scheduled home deliveries on the customer’s terms, is making the problem ever more relevant. By minimizing the cost and environmental impact, we have the setting for mathematical problem called the vehicle routing problem with time windows. Since the problem is NP-Hard, heuristic methods are often used. In practice, they work very well and typically offer a good tradeoff between speed and quality. However, since the heuristics are often tailormade to fit the needs of the underlying problem, no known algorithm dominates the other on all problems. One way to overcome the need for specialization is to produce heuristics that are adaptive. In this thesis, an offline learning method is proposed to generate an adaptive heuristic using local search heuristics and reinforcement learning. The reinforcement learning agents explored in this thesis are situated in both discrete and continuous state representations. Common to all state spaces are that they are inspired by human-crafted reference models where the last action and the result of that action define the state. Four different reinforcement learning models are evaluated in the various environments. By evaluating the models on a subset of the Solomon benchmark instances, we find that all models but one improve upon a random baseline. The average learner from each of the successful models was slightly worse than the human crafted baseline. However, the best among the generated models was an actor-critic based model which outperformed the best human baseline model. Due to the scalar objective function, the results are not directly comparable to the Solomon benchmark results with hierarchal objectives. None the less, the results are encouraging as a proof of principle with results in line with the human crafted baseline. The results indicate two clear paths for further work. First, applying the formulation to more complex environments with more actions and more powerful state spaces. Secondly, investigate models based on stochastic policies and recurrent neural networks to cope with the inherently partially observed environment.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)