SARS-CoV-2 Lineage Clustering : Using Unsupervised Machine Learning

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Amanda Hedlund; Fonzie Forsman; [2022]

Nyckelord: ;

Sammanfattning: The methods of sequencing genetic information and the access to this information has proved to be very useful in the research and understanding of viruses. It can for example be used to develop vaccines, manage pandemics, and attempt to map the virus’ spread and development. During the SARS-CoV-2 pandemic, a nomenclature for the virus has been created by the Pango database with the help of the GISAID database and other genetic databases. This study examines if a new grouping of the SARS-CoV-2 genomes from Sweden and Spain could provide new information or show trends in the genetic data by using two different clustering algorithms: k-means and agglomerative clustering. The k-means algorithm was chosen since it is scalable, which fits the large dataset. The agglomerative algorithm was chosen because it is an hierarchical algorithm that also can work as a summarization of data. The results mainly indicated a bias in the GISAID database, with the samples collected not being representative of the population and the true spread of the SARS-CoV-2 virus. The results also showed that the k-means algorithm can create groupings of similar quality as the Pango lineages in some aspects, but also that it is hard to quantify how good a grouping is with this type of data. The agglomerative clustering showed that the sequences are overall similar, but there are some difference between bigger variants of the virus. To further test and evaluate these conclusions, a bigger data set consisting of multiple countries should be tested.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)