Classifying Previous Covid-19 Infection : Advanced Logistic Regression Approach

Detta är en Uppsats för yrkesexamina på avancerad nivå från Umeå universitet/Institutionen för matematik och matematisk statistik

Författare: Daniel Westerholm; [2023]

Nyckelord: Covid-19; logistic regression; clustering;

Sammanfattning: The study aimed to developed a logistic model based on antibody proteins, vaccinations and demographic factors that predicts previous infection in Covid-19. The data set comprised of 2750 individuals from eldercare homes in Sweden, with four test dates executed between October of 2021 and August of 2022.  Exploratory data analysis revealed bimodal patterns in the antibodies against nucleocapsid protein within the non-infected group, raising suspicions of false negatives in the data. Due to the binary nature of the response and to be interpretable for further research, logistic regressions were used to model the relation between predictors and the logit of the response. Because of low performance scores and high probability for the presence of false negatives, K-means clustering algorithm was performed on the data. As a clustering variable, the logarithm of base 2 of the nucleocapsid protein was used, because of its theoretical relationship with previous infection in Covid-19.  Observations were reclassified using the clustering technique, and two new logistic models were fitted to the data. The final model contained polynomial terms to handle the non-linear relationship between the logit of the response and the predictors. We found a significant relationship between the logarithm of 2 of nucleocapsid protein and previous Covid-19 infection in the final model, with high prediction results. We reached an F1-score of 0.94, indicating a well-performing model.  Additionally, an algorithm was created to predict the days since infection, involving the change in nucleocapsid protein from one test date to the next, and a GAM model for fitting a smooth line to the data between nucleocapsid protein as response against the days since infection. Using this algorithm, we reached an absolute mean error between predicted results and actual days since infection of 23 days. This algorithm was later applied to observations reclassified in the clustering process.  In conclusion, the study successfully reclassified false negative observations with previous Covid-19 infection, and fitted a logistic model with high prediction score with F1-score of 0.94. Finally, an algorithm was created that estimated the days since infection with an absolute mean error of 23 days. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)