Representing Transcription Factor Dimers Using Forked-Position Weight Matrices

Aida Ghayour
M.Sc. Thesis Proposal

Co-Supervisors: Drs. Hamid Usefi and Touati Benoukraf

“Representing Transcription Factor Dimers Using Forked-Position Weight Matrices”

Department of Computer Science
Thursday, May 30, 2019, 11:00 a.m., Room EN 2022


Abstract

Transcription factors (TFs) are important structures in cells that control gene expression. They bind to particular DNA sequences, the TF binding sites (TFBSs), and hereby are attracted to particular locations of the genome where they can employ transcriptional co-factors and/or chromatin regulators to inflect spatiotemporal gene regulation. Thus, the identification of TFBSs in genomic sequences and their quantitative representation is of high importance for scientists to understand and predict gene expression. The present most common routine for identifying TF’s binding sites is ChIP-seq, which involves applying Chromatin Immunoprecipitation (ChIP) to extract all DNA fragment that interact with a protein of interest, followed by high through- put DNA sequencing (seq). With the help of existing motif discovery applications, we can characterize a DNA motif prototype targeted by a specific TF. Modeling these binding sites, are done using Position Weight Matrices (PWM), a commonly used 4*n matrix that represents the consensus binding sequence by keeping frequency of nucleotides at each position of sequence. These matrices are powerful tools for predicting the binding site of TFs in gene regulation studies. Although present method of TFBSs modeling is very convincing, it has some drawbacks. Most transcription factors do not work alone. Because of this contribution of multiple TFs, a single sequence matrix/logo is not sufficiently expressing as it is not properly designed to illustrate the impact of other TFs engaged in transcription process. To tackle this issue, we purpose a model which consists of multiple PWMs (or Seq-logos) which is the result of segregating a single motif, to have a better representation of a binding site of a TF of interest, in the presence of other factors.