SEMINAR: Utilizing RNNs and Ensemble Learning for Enhanced Bacterial sRNA Classification
Moustafa Elsisy
Honours Project
Supervisor: Dr. Lourdes Pena-Castillo
Utilizing RNNs and Ensemble Learning for Enhanced Bacterial sRNA Classification
Department of Computer Science
Friday, April 5, 2019, 1:30 p.m., Room EN-2022
Abstract
Bacterial small RNAs(sRNAs) are involved in the regulation of gene expression within the majority of bacterial species. These RNA molecules are traditionally validated using laboratory-based techniques. Unfortunately, there is a significant amount of potential sRNAs in the literature, which makes it prohibitive to validate all of them using such methods. While a number of models have recently been proposed to computationally identify sRNAs, they do not achieve high magnitudes of both precision and recall when evaluated over real subsequences of the genome. Here we propose a machine-learning model that utilizes sequence-based features of RNA molecules along with features relating to their genomic context in order to distinguish sRNAs from random genomic sequences. We used a dataset consisting of five different bacterial species, through which we composed a feature space based on the tetranucleotide composition, free energy of predicted secondary structure, distances to the closest predicted promotor/terminator sites, distances to the closest left/right ORF, and whether the sRNA is transcribed on the same strand as each ORF. The proposed model exhibits an Average Precision score of 0.75 over real genomic subsequences, thus improving state-of-the-art performance in sRNA identification. Our results indicate that our feature space is conserved across bacterial species, and that combining sequence-based features with genomic context ones yields a model that outperforms preceeding models.