Supervised Learning for Prediction of Tumour Mutational Burden

Detta är en Master-uppsats från KTH/Matematisk statistik

Sammanfattning: Tumour Mutational Burden is a promising biomarker to predict response to immunotherapy. In this thesis, statistical methods of supervised learning were used to predict TMB: GLM, Decision Trees and SVM. Predictions were based on data from targeted DNA sequencing, using variants found in the exonic, intronic, UTR and intergenic regions of the human DNA. This project was of an exploratory nature, performed in a pan-cancer setting. Both regression and classification were considered. The purpose was to investigate whether variants found in these regions of the DNA sequence are useful when predicting TMB. Poisson regression and Negative binomial regression were used within the framework of GLM. The results indicated deficiencies in the model assumptions and that the use of GLM for the application is questionable. The single regression tree did not yield satisfactory prediction accuracy. However, performance was improved by using variance reducing methods such as bagging and random forests. The use of boosted regression trees did not yield any significant improvement in prediction accuracy. In the classification setting, binary as well as multiple classes were considered. The distinction between classes was based on commonly used thresholds in clinical care to achieve immunotherapy. SVM and classification trees yielded high prediction accuracy for the binary case: a misclassification rate of 0.0242 and 0 respectively for the independent test set. In the multiple classification setting, bagging and random forests were implemented, yet, did not improve performance over the single classification tree. SVM produced a misclassification rate of 0.103, and the corresponding number for the single classification tree was 0.109. It was concluded that SVM and Decision trees are suitable methods for predicting TMB based on targeted gene panels. However, to obtain reliable predictions, there is a need to move from a pan-cancer setting to a diagnosis-based setting. Furthermore, parameters affecting TMB, like pre-analytical factors need to be included in the statistical analysis.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)