Fall 2019 / DM847
Introduction to Bioinformatics
General Information
Course introduction
The purpose of this course is to give an understanding of computational problems in modern biomedical research. We will start with concrete medical questions, develop a formal problem description, setup an algorithmic/statistical model, solve it and subsequently derive real-world answers from within the solved model. The course aims for giving a basic understanding of which problems arise in modern molecular biology and clinical research, and how these problems can be solved with appropriate computational tools. It is a class that needs regular attendance. Precondition for admittance to the exam will be the preparation of exercise sheets as well as the course project.
Expected learning outcome
- Explain and understand the central dogma of molecular biology, central aspects of gene regulation, the basic principle of epigenetic DNA modifications, and specialties w.r.t. bacteria & phage genetics
- Model ontologies for biomedical data dependencies
- Design of systems biology databases
- Explain and implement DNA & amino acid sequence analysis methods (HMMs, scoring matrices, and efficient statistics with them on data structures like suffix arrays)
- Explain and implement statistical learning methods on biological networks (network enrichment)
- Explain the specialties of bacterial genetics (the operon prediction trick)
- Explain and implement methods for suffix trees, suffix arrays, and the Burrows-Wheeler transformation
- Explain de novo sequence pattern screening with EM algorithm and entropy models.
- Explain and implement basic methods for supervised and unsupervised data mining, as well as their application to biomedical OMICS data sets
Topics Covered
The following main topics are contained in the course:
- Central dogma of molecular genetics, epigenetics, and bacterial and phage genetics
- Design of online databases for molecular biology content (ontologies, and example databases: NCBI, CoryneRegNet, ONDEX)
- DNA and amino acid sequence pattern models (HMMS, scoring matrices, mixed models, efficient statistics with them on big data sets)
- Specialities in bacterial genetics (sequence models and functional models for operons prediction)
- De novo identification of transcription factor binding motifs (recursive expectation maximization, entropy-based models)
- Analysis of next-generation DNA sequencing data sets (memory-aware short sequence read mapping data with Burrows Wheeler transformation and suffix arrays, bi-modal peak calling)
- Visualization of biological networks (graph layouting: small but highly variable graphs vs. huge but rather static graphs)
- Systems biology and statistics on networks (network enrichment with CUSP, jActiveModules and KeyPathwayMiner)
- Basic supervised and unsupervised classification methods for OMICS data analysis
Requirements
During the course the students have to complete exercise sheets and participate on one large project at the end of the semester. The project will be evaluated with pass/fail and needs to be passed in order to be eligible for the oral exam at the end of the semester.
Lectures
# |
Date |
Content |
Slides |
Readings |
1 |
Tue, 10.09.2019 |
Introduction to Molecular Biology 1 |
here |
|
2 |
Wed, 11.09.2019 |
Introduction to Molecular Biology 2 |
|
|
3 |
Tue, 24.09.2019 |
Statistics 1 |
here |
|
4 |
Thu, 26.09.2019 |
Statistics 2 |
here |
|
5 |
Tue, 01.10.2019 |
Exact String Matching |
here |
|
6 |
Thu, 03.10.2019 |
Sequence Alignment |
here |
|
7 |
Tue, 08.10.2019 |
Short Sequence Mapping (BWT) |
here |
|
8 |
Thu, 10.10.2019 |
TFBS |
here (updated 22.10) |
|
9 |
Tue, 22.10.2019 |
TFBS / MEME |
here |
|
10 |
Thu, 24.10.2019 |
MEME |
(continuation) |
|
11 |
Tue, 29.10.2019 |
Hidden Markov Models |
here |
|
12 |
Thu, 31.10.2019 |
Canceled |
|
|
13 |
Tue, 05.11.2019 |
Statisitcs 3 |
here |
|
14 |
Thu, 07.11.2019 |
Evolution & GWAS |
here |
|
15 |
Tue, 12.11.2019 |
Evolution & GWAS |
|
|
16 |
Thu, 14.11.2019 |
Project |
|
|
17 |
Tue, 19.11.2019 |
Machine Learning 1 |
here (updated) |
|
18 |
Thu, 21.11.2019 |
Machine Learning 2 |
(slides above) |
|
19 |
Tue, 26.11.2019 |
Networks & Systems Medicine |
here |
|
20 |
Thu, 28.11.2019 |
Network Enrichment |
(slides above) |
|
Exercises
The exercises are not required to be handed in; nevertheless, we expect you to actively participate during the exercise session. Only in the case you want to have specific feedback for your solution you must send your solution to the TA before the session.
# |
TA Session |
Topic |
Download |
Solution |
1 |
Fri, 04.10.2019 |
Statistic |
Exercise |
|
2 |
Fri, 11.10.2019 |
Alignment |
Exercise |
|
3 |
Wed, 24.10.2019 |
BWT & Suffix Trees |
Exercise |
|
4 |
Fri, 01.11.2019 |
MEME |
Exercise
Sequences
|
Solution |
5 |
Fri, 08.11.2019 |
HMM |
Exercise |
|
6 |
Fri, 15.11.2019 |
Survival Analysis |
Exercise |
|
7 |
Fri, 22.11.2019 |
Population Genetics |
Exercise |
|
8 |
Fri, 29.11.2019 |
Q&A for Project |
|
|
9 |
Tue, 03.12.2019 |
Recap Session |
|
|
10 |
Wed, 04.12.2019 |
Q&A for Project |
|
|
Assignment
General Notes
Here, you find all necessary information for the mandatory assignment for the course. Please note, that passing this assignment is necessary in order to be eligible to take the oral exam. Grading will be pass/fail with internal censor.
There will be no extensions to the deadline!
Deadline Final Hand-In: December 20th.
It is allowed and encouraged to work in teams of 5 students. Make sure, when submitting your reports and code, that all your team members' names are included.
Materials
Test Your Result
Once you have implemented the software and you have learned a classifier you can test your accuracy on the classification of File7. Please use the following webservice in order to check your results:
Make sure your file has the following format:
raw_file_name TAB class_label
raw_file_name is the filename of the raw-data file, while class_label refers to either halls or citrus.
An example file can be found here: example
Additional Links
Procedure of the oral exam
Examination Dates & Times
- Jan 13th: MM552
- Jan 14th: DM847
- Jan 15th: DM847
Further, I have posted lists with estimated time slots on my door. If you prefer to be examined at a certain time, please put your name next to the corresponding time slot. We will try our best to accommodate for your wishes but the times might be subject to change.
Procedure
The exam will last between 15-20 minutes. At the beginning, one topic from the list below will be drawn randomly. For each topic the examinee should be prepared to make a short presentation of max. 5 minutes. It is allowed to bring one page of hand-written notes (DIN A4 or US-Letter) for each of the topics. The examinee will have 2 minutes to briefly study the notes for the drawn topic before the presentation. The notes may be consulted during the exam if needed but it will negatively influence the evaluation of the examinee's performance. During the presentation, only the blackboard can be used (you cannot use overhead transparencies or powerpoint slides, for instance).
After the short presentation, additional question about the presentation's topic but also about other topics in the curriculum will be asked.
Below is the list of possible topics and some suggested content. The listed content are only suggestions and is not necessarily complete nor must everything be covered in the short presentation. It is the responsibility of the examinee to gather and select among all relevant information for each topic from the course material. The parts of the lecture not mentioned below might still be part of the oral exam but are just not deemed suitable for an oral presentation.
Topics for the short presentation
- Exact String Matching
- Sequence Alignment
- Scoring Schemes
- Needleman-Wunsch
- Smith-Waterman
- BLAST
- ...
- Short Read Mapping
- ChIP-Seq
- Burrows Wheeler Transformation
- ...
- Transcription Factor Binding Sites
- PWMs
- Suffix Trees
- Enhanced Suffix Arrays
- ...
- Sequence Logos and Motif Discovery
- Sequence Logos / Information Content
- Expectation Maximization
- MEME
- ...
- Hidden Markov Models
- Markov Chain
- Classic HMM Problems
- Modeling of Proteins
- ...
- Statistics and GWAS
- Evolutionary Models
- GWAS Analysis
- Multiple Testing
- ...
- Data Mining
- k-means
- Decision Trees
- Random Forest
- Performance Evaluation
- ...
- Network Enrichment
- Active Modules
- CUPS
- KPM
- ...
Materials
All lecture slides are relevant for the exams.
Books:
- Gussfield, D. "Algorithms on strings, trees, and sequences." Computer Science and Computional Biology (Cambrigde, 1999) (1997).
- Durbin, Richard, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.