Seminar: Cryptographic hardness of feature selection in strict and pure synthetic genetic datasets
Majid Beheshti Mohtasham
Supervisors: Drs. Saeed Samet and Antonina Kolokolova
Cryptographic hardness of feature selection in strict and pure synthetic genetic datasets
Department of Computer Science
Tuesday, November 6, 2018, 3:30 p.m., Room EN 2022
A common task in knowledge discovery is finding a few features correlated with the outcome in a sea of mostly irrelevant data. This task is particularly formidable in genetic datasets containing thousands to millions of Single Nucleotide Polymorphisms (SNPs) for each individual; the goal here is to find a small subset of SNPs that is correlated with whether an individual has a disease. Although determining a correlation between any given SNP and the disease label is relatively straightforward, detecting subsets of SNPs such that the correlation is only apparent when the whole subset is considered seems to be much harder. In this thesis, we study the hardness of this problem, in particular for a widely used method of generating synthetic SNP datasets.
More specifically, we consider datasets generated by ”pure and strict” models, such as ones produced by the popular GAMETES software. In these datasets, there is a high correlation between a predefined target set of features (SNPs) and the label, however any subset of the target set appears uncorrelated with the outcome (marginal probabilities of the disease for any value in each strict subset of a target set, including a single SNP, are indistinguishable from the average disease probability).
Our main result is showing feature selection in such pure and strict datasets is hard under cryptographic assumptions. In this way, we give a reduction from the problem of Learning Parity with Noise (LPN) to feature selection in pure and strict datasets. LPN, which has been used as a basis of secret-key cryptosystems and other cryptographic primitives, is considered to be hard for a variety of distributions; the best-known algorithms for LPN are exponential in the number of bits of the secret, even for quantum algorithms. Thus, our results uncover a surprising connection: we show that an efficient heuristic feature selection in pure and strict datasets, could be used to break cryptosystems believed to be resistant even to quantum attacks!