CSE 554/654 - Text-Based Language Processing Systems

Instructor: Brian Roark

Class time: M/W   4:00 - 5:30 PM    Jan. 5 - Mar. 18, 2009

Class location: Wilson Clark Center - Room 403,
videoconf'd to OHSU's Marquam Hill Campus, BICC 131B

Office hours: Th 10-12, Central Building 115, or by appointment

Required textbooks:

None, reading will come from papers available on-line

Skip to overview of topics.


With a focus on bio-medical text, the goal of this course is to present the current best practices in building systems that cluster, label or transform raw text to improve information access. Such systems are often chained together within larger applications that retrieve documents, extract information, summarize, answer questions and translate to other languages. This course will provide a hands-on, project oriented introduction to such applications.


There is no official programming language for this course, but there will be a some amount of scripting or programming required to complete assignments, hence facility with some programming language (or willingness to acquire such facility) is assumed.

Homework and term projects

The course will be structured around end-to-end query-directed text processing systems that retrieve documents, perform query-directed summarization/question answering, and automatic translation. Simple baseline components will be in place within a baseline system, each of which can be independently improved. For homework projects, students will select particular components, try to improve performance over the baseline using various techniques, and evaluate the impact on system performance. For the term project, students will be given more leeway in selecting a topic for further investigation.


10% of your grade will depend on in-class discussion, 15% on in-class presentations, 15% each on 3 homework projects and 30% on a term project and presentation.

What we'll cover and an approximate schedule (in progress, may change)

Date     Topic Tentative Reading Lecture videoslides
Jan.5 Overview of class structure; introduction to the text processing "pipeline", including IR, IE, QA, summarization and MT; homework and term project options      
Jan.7 Introduction to Information Retrieval (IR) and Information extraction (IE)     
Jan.12 Introduction to Question Answering (QA) and Automatic Summarization Tutorial    
Jan.14 Introduction to Machine Translation (MT)      
Jan.19 No class, Martin Luther King, Jr. Day    
Jan.21 Statistical methods and knowledge-based methods; finite-state automata and transducers; pipelining systems HR07   
Jan.26 Raw text processing; text normalization; domain specific text processing; key issues in bio-medical text processing Norm01   
Jan.28 Topics in text normalization and IR; student HW project presentations TMTA07
Feb.2 Topics in IE; student HW project presentations GKM05   
Feb.4 Topics in QA; student HW project presentations RH02   
Feb.9 Topics in QA DFL07   
Feb.11 Topics in IE CoHer05
Feb.16 No class, Presidents Day    
Feb.18 Topics in Summarization; student HW project presentations LJHMZS07   
Feb.23 Topics in Summarization OER05
Feb.25 Topics in MT; student HW project presentations GolSan01    
Mar.2 Topics in MT Chi05    
Mar.4 Topics in MT; student HW project presentations      
Mar.9 Topics in natural language processing (NLP) for text-based applications;
student HW project presentations
Mar.11 Generalizing methods for use with uncertain input (e.g., spoken language);Surveying the state-of-the-art: large research programs and system competitions; open problems; likely future directions      
Mar.16,18 Term project presentations      

Chi05   David Chiang. A Hierarchical Phrase-Based Model for Statistical Machine Translation. Proceedings of the Annual Meeting of the ACL, pp. 263-270, 2005.
CoHer05   Aaron M. Cohen and William Hersh. A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics, 6(1):57-71, 2005.
CoHun08 K. Bretonnel Cohen and Lawrence Hunter. Getting started in text mining. PLoS Computational Biology, 4(1), 2008.
DFL07   Dina Demner-Fushman and Jimmy Lin. Answering Clinical Questions with Knowledge-Based and Statistical Techniques. Computational Linguistics, 33(1):63-103, 2007.
GolSan01   Tim Gollins and Mark Sanderson. Improving Cross Language Retrieval with Triangulated Translation. 24th annual international ACM SIGIR conference on research and development in information retrieval, pp. 90-95, 2001.
GKM05   Trond Grenager, Dan Klein and Chris Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 371-378, 2005.
HR07   Kristy Hollingshead and Brian Roark. Pipeline Iteration. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 952-959, 2007.
LJHMZS07   Xu Ling, Jing Jiang, Xin He, Qiaozhu Mei Chengxiang Zhai, and Bruce Schatz. Generating Gene Summaries from Biomedical Literature: A Study of Semi-Structured Summarization. Information Processing and Management, 43(6):1777-1791, 2007.
Mil05   Rada Mihalcea. Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling. Proceedings of HLT-EMNLP, 2005.
OER05   Jahna Otterbacher, Gunes Erkan and Dragomir R. Radev. Using Random Walks for Question-focused Sentence Retrieval. Proceedings of HLT-EMNLP, 2005.
RH02   Deepak Ravichandran and Eduard Hovy. Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 41-47, 2002.
SH03   Ariel Schwartz and Mari Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. Proceedings of the 8th Pacific Symposium on Biocomputing, pp. 451-462, 2003.
Norm01   Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. Normalization of non-standard words. Computer Speech and Language, 15(3):287-333, 2001.
TMTA07   Yoshimasa Tsuruoka, John McNaught, Jun'i;chi Tsujii, and Sophia Ananiadou. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics, 23(20):2768-2774, 2007.