Seminar: Scalable Feature Selection Methods by Augmenting Sparse Least Squares

Hanieh Marvikhorasani
M.Sc. Candidate
Supervisor: Dr. Hamid Usefi

Scalable Feature Selection Methods by Augmenting Sparse Least Squares

Department of Computer Science
Tuesday, November 19, 2019, 10:00 a.m., Room EN 2022


Feature selection has been used widely for selecting a subset of genes (features) from microarray
datasets, which help discriminate healthy samples from those with a particular disease. However, most feature selection methods suffer from high computational complexity when applied to these datasets due to the large number of genes present. Usually, a small subset of these genes have a contributing factor to the disease, and the rest of the genes are irrelevant to the condition. This study proposes a sparse method (SLS) based on singular value decomposition and least squares to filter out irrelevant features. In this thesis, we shall also consider reducing the size of datasets by clustering genes and selecting representative genes from each cluster based on two different metrics. These dataset size-reduction methods are incorporated into three state-of-the-art feature selection methods, namely, mRMR, SVM-RFE, and HSIC-Lasso. These methods are applied to three Inflammatory Bowel Disease (IBD) datasets and combined with support vector machines and random forest classifiers. Experimental results show that the proposed SLS method significantly reduces the running time of feature selection algorithms and improves the prediction power of the machine learning models. SLS is integrated into a novel feature selection method (DRPT), which, when combined with SVM, is able to generate models to discriminate between healthy subjects and subjects with Ulcerative Colitis (UC) based on the expression values of genes in colon samples. The best models were validated on two validation datasets and achieving higher predictive performance than a model generated by a recently published biomarker discovery tool (BioDiscML).