Seminar: Ensemble Learning for Detecting Gene-Gene Interactions in Colorectal Cancer
Supervisor: Dr. Ting Hu
Ensemble Learning for Detecting Gene-Gene
Interactions in Colorectal Cancer
Department of Computer Science
Wednesday, December 13, 2017, 11:00 a.m., Room EN 2022
The fundamental task of human genetics is to detect genetic variations that primarily contribute to a disease phenotype. The most popular method for understanding etiology of human inheritable diseases (e.g., cancer) is to utilize genome-wide association studies (GWAS). Colorectal cancer (CRC) is a common cause of deaths in developed countries; specifically, it has a high incidence rate in the province of Newfoundland and Labrador. Therefore, finding the affecting genetic factors associated with CRC can help better understand the disease in order to more effectively treat and prevent it. This study seeks to identify genetic variations associated with CRC using machine learning including feature selection and ensemble learning algorithms. In this study, we analyze a GWAS dataset on CRC collected from Newfoundland population. First, we perform quality control steps on the raw genetic data and prepare it for the machine learning methods. Second, we investigate six feature selection methods through a comparative study by applying them to a simulated dataset and CRC GWAS data. The best feature selection method, in terms of gene-gene interactions, is then used to choose a subset of more relevant features for the next step analysis. Subsequently, two ensemble algorithms, Random Forests and Gradient Boosting machine, are applied to the reduced data to identify significant interacting genetic markers associated with CRC. Last, the findings from machine learning methods are biologically validated using online databases and enrichment analysis tools. From the results of the ensemble algorithms, 44 significant SNPs are detected in which 29 of them have corresponding genes in DNA. Among them, genes DCC, ALK and ITGA1 are previously found to be associated with CRC. In addition, there are genes E2F3 and NID2 which have the potential of having associations with CRC, because of their already known associations with other types of cancer. The biological interpretations of these genes reveal biological pathways that may help predict the risk of the disease and better understand the etiology of the disease.