Exploring DeepSEA CNN and DNABERT for Regulatory Feature Prediction of Non-coding DNA

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Prediction and understanding of the regulatory effects of non-coding DNA is an extensive research area in genomics. Convolutional neural networks have been used with success in the past to predict regulatory features, making chromatin feature predictions based solely on non-coding DNA sequences. Non-coding DNA shares various similarities with the human spoken language. This makes Language models such as the transformer attractive candidates for deciphering the non-coding DNA language. This thesis investigates how well the transformer model, usually used for NLP problems, predicts chromatin features based on genome sequences compared to convolutional neural networks. More specifically, the CNN DeepSEA, which is used for regulatory feature prediction based on noncoding DNA, is compared with the transformer DNABert. Further, this study explores the impact different parameters and training strategies have on performance. Furthermore, other models (DeeperDeepSEA and DanQ) are also compared on the same tasks to give a broader comparison value. Lastly, the same experiments are conducted on modified versions of the dataset where the labels cover different amounts of the DNA sequence. This could prove beneficial to the transformer model, which can understand and capture longrange dependencies in natural language problems. The replication of DeepSEA was successful and gave similar results to the original model. Experiments used for DeepSEA were also conducted on DNABert, DeeperDeepSEA, and DanQ. All the models were trained on different datasets, and their results were compared. Lastly, a Prediction voting mechanism was implemented, which gave better results than the models individually. The results showed that DeepSEA performed slightly better than DNABert, regarding AUC ROC. The Wilcoxon Signed-Rank Test showed that, even if the two models got similar AUC ROC scores, there is statistical significance between the distribution of predictions. This means that the models look at the dataset differently and might be why combining their prediction presents good results. Due to time restrictions of training the computationally heavy DNABert, the best hyper-parameters and training strategies for the model were not found, only improved. The Datasets used in this thesis were gravely unbalanced and is something that needs to be worked on in future projects. This project works as a good continuation for the paper Whole-genome deep-learning analysis identifies contribution of non-coding mutations to autism risk, Which uses the DeepSEA model to learn more about how specific mutations correlate with Autism Spectrum Disorder. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)