Quality Control and Analysis of RNA-seq Data from Breast Cancer Tumor Samples

Detta är en Master-uppsats från Lunds universitet/Examensarbeten i bioinformatik

Författare: Christian Brueffer; [2013]

Nyckelord: Biology and Life Sciences;

Sammanfattning: Abstract: Background: Breast cancer is the most common kind of cancer among women in Sweden. While the short- to mid-term survival chances are good, the long-term survival chances are poor, and a large number of women are also likely being overtreated and thus suffer from unnecessary side effects. The South Sweden Cancerome Analysis Network - Breast (SCAN-B) Initiative aims at improving breast cancer outcome by developing new diagnostics and predictive tests based on RNA sequencing (RNA-seq) technology. With RNA-seq being a complicated technology with many error sources, quality control is needed to gain confidence in the obtained data. Results: During this project an RNA-seq quality control pipeline was built and integrated into the existing SCAN-B RNA-seq analysis pipeline. The quality control pipeline was used to evaluate the quality of 2547 RNA-seq libraries. The evaluation showed good overall quality of the data. While the quality of the first sequenced libraries is not optimal, quality has increased steadily and settled in on a high level. Conclusions: Quality control is essential for the RNA-seq analysis process. The metrics used in this project provide good insight into the quality of the evaluated datasets. However, cancer cells feature a distinct genomic landscape which can make the interpretation of metrics difficult. Thus, care has to be taken when drawing conclusions about the quality of RNA-seq data from cancer-derived samples. Popular science summary: Quality Control of Data from Breast Cancer Tumor Samples Breast cancer is the most common kind of cancer among women in Sweden. Challenges remain in improving long-term survival and personalizing the most effective treatments with the least side-effects. To improve this situation, new techniques based on profiling the gene expression of individual tumors are being developed, one example being RNA sequencing. This project was about quality control for the data being produced by RNA sequencing. Breast cancer is the most common kind of cancer among women in Sweden. While the short- to mid-term survival chances are good, the long-term survival chances are much worse and thus certain patients require more personalized and effective therapies. Furthermore, a large number of women whose disease has very good prognosis are likely being overtreated and thus suffer from unnecessary side-effects. The South Sweden Cancerome Analysis Network - Breast Initiative (SCAN-B; http://scan.bmc.lu.se) aims at improving breast cancer outcomes by developing new diagnostics and treatment-predictive tests based on RNA sequencing (RNA-seq) technology of patient tumors. RNA-seq is a tool to determine the RNA sequences and their abundance in a sample. It can be used to analyze the specific characteristics of different tumors such as gene expression levels and gene mutations. However, the technology is complex and includes many potential sources of noise. Quality control is needed to gain confidence in the obtained data. During this project a computational RNA-seq quality control pipeline was built and integrated into the existing SCAN-B RNA-seq analysis pipeline. This quality control pipeline was used to evaluate the quality of RNA-seq datasets from breast cancer tumor samples of ~600 patients according to a variety of metrics. These measure the quality of the sample prepared for sequencing, the sequencing process as well as the basic analysis steps. The evaluation showed good overall quality of the data. While the quality of the data from the first sequenced samples is not optimal, quality has increased steadily and settled in on a high level. Different problems that occurred during the sequencing process could be correlated with specific low metrics. However, care has to be taken when interpreting quality metrics from cancer-derived data. Cancer is a disease that arises from accumulated changes at the genome level. In principle, cancer-associated genomic changes can contribute to some poorer-appearing quality metrics, even when all steps involved in the RNA-seq worked correctly. Quality control is essential for the RNA-seq analysis process. The metrics used in this project provide good insight into the quality of the evaluated datasets. However, cancer cells feature a distinct genomic landscape which can make the interpretation of metrics difficult. Thus, care has to be taken when drawing conclusions about the quality of RNA-seq data from cancer-derived samples. Advisor: Lao Saal Master´s Degree Project 30 credits in Bioinformatics 2013 Department of Oncology, Lund University

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)