Extracting Adverse Drug Reactions from Product Labels using Deep Learning and Natural Language Processing

Detta är en Master-uppsats från KTH/Skolan för kemi, bioteknologi och hälsa (CBH)

Författare: Shachi Bista; [2020]

Nyckelord: ;

Sammanfattning: Pharamacovigilance relates to activities involving drug safety monitoring in the post-marketing phase of the drug development life-cycle. Despite rigorous trials and experiments that drugs undergo before they are available in the market, they can still cause previously unobserved side-effects (also known as adverse events) due to drug–drug interaction, genetic, physiological or demographic reasons. The Uppsala Monitoring Centre (UMC) is the custodian of the global reporting system, VigiBase, for adverse drug reactions in collaboration with the World Health Organization (WHO). VigiBase houses over 20 million case reports of suspected adverse drug reactions from all around the world. However, not all case reports that the UMC receives pertains to adverse reactions that are novel in the safety profile of the drugs. In fact, many of the reported reactions found in the database are known adverse events for the reported drugs. With more than 3 million potential associations between all possible drugs and all possible adverse events present in the database, identifying associations that are likely to represent previously unknown safety concerns requires powerful statistical methods and knowledge of the known safety profiles of the drugs. Therefore, there is a need for a knowledge base with mappings of drugs to their known adverse reactions. To-date, such a knowledge base does not exist. The purpose of this thesis is to develop a deep-learning model that learns to extract adverse reactions from product labels — regulatory documents providing the current state of knowledge of the safety profile of a given product — and map them to a standardized terminology with high precision. To achieve this, I propose a two-phase algorithm, with a first scanning phase aimed at finding regions of the text representing adverse reactions, and a second mapping phase aiming at normalizing the detected text fragments into Medical Dictionary for Regulatory Activities (MedDRA) terms, the terminology used at the UMC to represent adverse reactions. A previous dictionary-based algorithm developed at the UMC achieved a scanning F1 of 0.42 (0.31 precision, 0.66 recall) and mapping macro-averaged F1 of 0.43 (0.39 macro-averaged precision, 0.64 macro-averaged recall). State-of-the-art methods achieve F1 above 0.8 and above 0.7 for the scanning and mapping problems respectively. To develop algorithms for adverse reaction extraction, I use the 2019 ADE Evaluation Challenge data, a dataset made by the FDA with 100 product labels annotated for adverse events and their mappings to MedDRA. This thesis explores three architectures for the scanning problem: 1) a Bidirectional Long Short-Term Memory (BiLSTM) encoder followed by a softmax classifier, 2) a BiLSTM encoder with Conditional Random Field (CRF) classifier and finally, 3) a BiLSTM encoder with CRF classifier with Embeddings from Language Model (ELMo) embeddings. For the mapping problem, I explore Information Retrieval techniques using the search engines whoosh and Solr, as well as a Learning to Rank algorithm. The BiLSTM encoder with CRF gave the highest performance on finding the adverse events in the texts, with an F1 of 0.67 (0.75 precision, 0.61 recall), representing a 0.06 absolute increase in F1 over the simpler BiLSTM encoder with softmax. Using the ELMo embeddings was proven detrimental and lowered the F1 to 0.62. Error analysis revealed the adopted Inside, Beginning, Outside (IOB2) labelling scheme to be poorly adapted for denoting discontinuous and compound spans while introducing ambiguity in the training data. Based on the gold standard annotated mappings, I also evaluated the whoosh and Solr search engines, with and without Learning to Rank. The best performing search engine on this data was Solr, with a macro-averaged F1 of 0.49 compared to the macro-averaged F1 of 0.47 for the whoosh search engine. Adding a Learning to Rank algorithm on top of each engine did not improve mapping performance, as both macro-averaged F1 dropped by over 0.1 when using the re-ranking approach. Finally, the best performing scanning and mapping algorithms beat the aforementioned dictionary-based baseline F1 by 0.25 in the scanning phase and 0.06 in the mapping phase. A large source of error for the Solr search engine came from tokenisation issues, which had a detrimental impact on the performance of the entire pipeline. In conclusion, modern Natural Language Processing (NLP) techniques can significantly improve the performance of adverse event detection from free-formtext compared to dictionary-based approaches, especially in cases where context is important.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)