COMP 4550: Bioinformatics: Biological Data Analysis

This course is an elective for the Data-centric Computing Stream.

The course is designed as an interdisciplinary advanced course for both Computer Science and Biology students in bioinformatics, and as a bridge between both disciplines.

This is an advanced course to provide students with the basis to perform their own analysis of high-throughput data using R and Bioconductor. Students, who succeed in this course, should be comfortable programming in R and be able to use available Bioconductor packages to analyse a variety of biological data such as expression data, high-throughput cell-based assay data and mass spectrometry protein data, and to use a variety of approaches available within the R environment, such as clustering, graphs, classification approaches, such as random forests and support vector machines, and enrichment analysis methods.

Lab In addition to classes, this course has one structured laboratory session per week.

Prerequisites: Biology 3951 or COMP 3550, and Statistics 2500 or Statistics 2550, or permission of the course instructor.

Availability: This course is usually offered once per year, in Fall or Winter.

Course Objectives

This course provides students with the basis to analyse a variety of biological data within an integrated programming environment for data manipulation, calculation and graphical display. Students will learn to extract meaningful information from data generated by high-throughput experimentation. The course will introduce one such integrated programming environment and will explore the computational and statistical foundations of the most commonly used biological data analysis procedures.

In the introductory Bioinformatics course ( Computer Science 3550), students will have:

  1. Understood the basis of bioinformatics methods, for example, how multiple sequences aligners actually construct the alignments, what steps are involved in the analysis of gene expression, what multiple testing correction is and how it is done;
  2. Achieved basic Perl programming skills; and
  3. Used online databases and computational tools.

On the other hand, in this advanced course, although some topics such as gene expression, enrichment analysis and proteomics are also covered, the students will be learning how to do the analysis on their own, that is, without relying on the existence of a graphical and friendly computer program that will do the required analysis by choosing the appropriate parameters and clicking on some buttons.

Representative Workload
  • Assignments and Project 25%
  • Lab Work and Quizzes 20%
  • In-class Exams 30%
  • Final Exam 25%
Representative Course Outline
  • Introduction to R and Bioconductor
  • Exploratory data analysis and hypothesis testing
  • Gene Expression data analysis
  • Mass Spectrometry Protein data analysis
  • Clustering and visualization
  • Machine learning: concepts and packages
    • Feature selection
    • Cross-validation
    • Multiclass problems
    • Ensemble methods
    • Bayesian methods
  • Graphs and Networks
    • Protein interactions
    • Pathways
    • Co-expression graphs
  • Biological Annotation
  • Gene set enrichment analysis

Students will perform hands on analysis of experimental biological data using mainly R and Bioconductor. Additional software that may be used includes Cytoscape.

  • R programming exercises
  • Exploratory data analysis: graphics/plots generation
  • Processing expression data
  • Processing proteomics data
  • Clustering data and cluster visualization
  • Data classification using supervised machine language
  • Using graphs for data visualization
  • Annotating data
  • Performing enrichment analysis
  • Introduction to Cytoscape
  • Students can receive credit for only one of Computer Science 4550 or Biology 4606.

Page last updated May 24th 2021