CSE 555/655 - Biological and Linguistic Sequence Analysis

Instructor: Brian Roark

Class time: Tuesday/Thursday 4:00-5:30pm    Jan. 4 - Mar. 17, 2011

Class location: Wilson Clark Center 403, scheduled to be videoconf'd to BICC 131B on the main campus

Office hours: Tu 10-12, Central Building 115, or by appointment

Required textbooks:

Dan Gusfield Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
Richard Durbin, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Skip to overview of lectures.


The goal of this course is to give a broad but detailed introduction to the key algorithms and modeling techniques used for sequence processing in both biological and linguistic applications, with an emphasis on exact and approximate sequence matching problems.


There is no official programming language for this course, but there will be a some amount of programming required to complete assignments, hence facility with some programming language (or willingness to acquire such facility) is assumed.


10% of your grade will depend on in-class discussion and participation, 15% on 2 in class presentations (for HW1 and final project), 40% on the homeworks, 15% on the midterm and 20% on the final project. Note that late homeworks won't get much credit.

What we'll cover and an approximate schedule

Date     Topic Reading AssignmentFAQs
Jan.4 Introduction to biological and linguistic strings/sequences; formal representation; overview of main problems Gusfield Ch.10
Durbin Ch.1
Jan.6 Introduction to string edit distance, dynamic programming and approximate alignment; motivation for efficient exact match Gusfield Ch.11
Durbin Ch.2
Jan.11 Deterministic exact string matching (a): simple approaches;
intro to Knuth-Morris-Pratt and Boyer-Moore algorithms
Jan.13 Deterministic exact string matching (b): Knuth-Morris-Pratt and Boyer-Moore algorithms Gusfield
Jan.18 Deterministic exact string matching (c): Aho-Corasick algorithm for sets of patterns; regular expression patterns Gusfield
Jan.20 No class     
Jan.25 Suffix trees: introduction and linear-time construction algorithms; some applications (exact string matching); suffix automata and suffix arrays Gusfield Ch.5-6   
Jan.27 Student presentations on HW1 variations   HW2 
Feb.1 Efficient approximate matching: linear space, bounded approximate matching and exclusion methods. Brief introduction to HMM alignment models Gusfield Ch.12
Durbin Ch.3
Feb.3 Hidden Markov models for tagging, bracketing, segmentation and pairwise alignment; dynamic programming; finite-state transducers Durbin Ch.4   
Feb.8 HMM parameter re-estimation; forward-backward for Expectation Maximization; Learning HMM alignment models; pronunciation alignments Ristad and Yianilos (1998)   
Feb.10 In class midterm   HW 3 
Feb.15 Discriminative Modeling for Gene Prediction Bernal et al. (2007)   
Feb.17 Introduction to multiple sequence alignments; families; profile HMMs; Perceptron Algorithm for profiles Durbin Ch.5    
Feb.22 Aligning multiple sequences; Minimum sum-of-pairs alignment; higher dimensional dynamic programming; iterative pairwise alignment Durbin Ch.6
Gusfield Ch.14
Feb.24 Introduction to phylogenic tree building; Ultrametric and additive distance trees; distance-based tree construction; parsimony Durbin Ch.7
Gusfield Ch.17
HW 4
Final proj
Mar.1 Probabilistic models of phylogeny Durbin Ch.8   
Mar.3 Context free modeling for Protein and RNA secondary structure; Context free inference; RNA structure prediction Durbin Ch.9
Searls (2002)
Mar.8 RNA structure prediction (cont.); Protein folding; Mildly context-sensitive models Durbin Ch.10
Hockenmaier et al. (2006)
Mar.10 Guest lecture (Chris Whelan) on topics in sequencing      
Mar.15 In class final presentations      
Mar.17 In class final presentations      


Bernal et al. (2007)   Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction. PLoS Comput Biol 3(3): e54. doi:10.1371/journal.pcbi.0030054
Hockenmaier et al. (2006)    Julia Hockenmaier, Aravind K. Joshi and Ken A. Dill. Routes are trees: The parsing perspective on protein folding. Proteins: Structure, Function, and Bioinformatics, 66(1):1-15, 2006.
Ristad and Yianilos (1998)    Eric Sven Ristad and Peter N. Yianilos. Learning String Edit Distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522-532, 1998.
Searls (2002)   David B. Searls. The language of genes. Nature, 420:211-217. 2003.