Semi-Supervised Learning for Predicting Biochemical Properties

Detta är en Kandidat-uppsats från Uppsala universitet/Institutionen för informationsteknologi

Författare: Travis Persson; [2021]

Nyckelord: ;

Sammanfattning: The predictive performance of supervised learning methods relies on large amounts of labeled data. Data sets used in Quantitative Structure Activity Relationship modeling often contain a limited amount of labeled data, while unlabeled data is abundant. Semi-supervised learning can improve the performance of supervised methods by incorporating a larger set of unlabeled samples with fewer labeled instances. A semi-supervised learning method known as Label Spreading was compared to a Random Forest in its effectiveness for correctly classifying the binding properties of molecules on ten different sets of compounds. Label Spreading using a k-Nearest Neighbors (LS-KNN) kernel was found to, on average, outperform the Random Forest. Using a randomly sampled labeled data set of sizes 50 and 100, LS-KNN achieved a mean accuracy of 4.03% and 1.97% higher than that of the Random Forest.The outcome was similar for the mean area under the Receiver Operating Characteristic curve (AUC). For large sets of labeled data, the performances between the methods were indistinguishable. It was also found that sampling labeled data from generated clusters using a k-Means clustering algorithm, as opposed to random sampling, increased the performance of all applied methods. For a labeled data set ofsize 50, Label Spreading using a Radial Basis Function kernel increased its meanaccuracy and AUC by 7.52% and 3.08%, respectively, when sampling from clusters. In conclusion, semi-supervised learning could be beneficial when applied to similar modeling scenarios. However, the improvements heavily depend on the underlying data, suggesting that there is no one-size-fits-all method.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)