DM813 Mandatory Assignment 1: Dynamic Programming and Sequence Alignment

Biological Background

Homology is very informative for understanding the biological role of proteins, DNA sequences, and RNA sequences. If an evolutionary relationship can be inferred between a sequence of interest and another well characterized sequence, a functional relationship often follows. For example, if a new protein is found to contain a portion of its sequence that is homologous to a domain of known function, it is likely the protein may possess the catalytic properties associated with the domain. Furthermore, if the two sequences can be aligned, functional relations can be extended to specific amino acids or nucleotides. Dynamic programming is one way to compare related biological sequences, aligning them in an optimal way. For this project, you will write a program that performs local and global alignment of two sequences.

Input Data

Your program will read two arguments from the command line, which are the names of the two input files. The first input file is the sequence data; the second input file is the scoring matrix. These files are further described below.

Sequence data

Sample input files for this project will be provided. The actual input files used with your program will be different than these files. All input files will have the same FASTA format. The FASTA format is a common, simple format for biological sequence data. Each sequence record in the file contains a title line, which always begins with the character > , followed by multiple lines of sequence. Here is an example file with two short sequences:

>HBD_HUMAN Hemoglobin subunit delta
MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFG
KEFTPQMQAAYQKVVAGVANALAHKYH 
>NGB_HUMAN Neuroglobin
MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCRQFSSPEDCLSSPE
FLDHIRKVMLVIDAAVTNVEDLSSLEEYLASLGRKHRAVGVKLSSFSTVGESLLYMLEKC
LGPAFTPATRAAWSQLYGAVVQAMSRGWDGE

Your program will always read in two sequences to align. As such, all input files will have exactly two sequences. Your program should read in two sequences of any length, and allow the sequence lines of each record to be any length.

Scoring matrix

Your program will read in a scoring matrix (whose name is passed as the second argument to your program). The scoring matrix has several lines of comments, followed by a header line of space-separated amino acids, followed by several lines of the substituted amino acids. A snippet of this file should make things clear:

... 
# Entropy = 0.6979, Expected = -0.5209 
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  * 
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4 
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4 
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4 
...

For example, in this scoring matrix, a A-A match receives a score of +4, while a A-R mismatch receives a score of -1.

The following files are provided:

Sample input file: Sample sequence input for your program. Your program will be run with many other input files, not just this one.
testcase01.fasta: Two seqs are identical. Global and local alignment should be trivial, and the same.
testcase02.fasta: identical sequences with a large internal deletion in one sequence and a small deletion in the other. Should test affine gap penalties.
testcase03.fasta: terminal deletions in one of two identical sequences
testcase04.fasta: Real world alignment of the AMACR proteins from human and M. tuberculosis.
testcase05.fasta: Real world alignment of two largely dissimilar proteins that share one small zinc finger domain. The global alignment of these two proteins should be more or less impossible, though a local alignment should reveal a region of homology.
Correct Output: This file contains concatenated output from a program producing correct alignments for the five test cases given above.
BLOSUM62: Your program will use this scoring matrix (which is BLOSUM62).
Archive: All of the above files in one tar file.

What your program must do

Your program has two command line arguments. They are:
1. Name of sequence file.
2. Name of score matrix file.
Your program must read in two sequences from the first file and a scoring matrix from the second file, perform dynamic programming to find the optimal local and global alignment of the two sequences, and write the optimal local and global alignment to standard output, along with the alignment score. Pay attention to the output format section described below. Details of the global alignment, local alignment, and affine gap penalties are described below.

Global Alignment

Recall that global alignment finds the optimal path in the dynamic programming matrix that traverses from one corner of the dynamic programming matrix to the opposite corner of the matrix. Review for example Jones and Pevzner Chapter 6.6 for details about global alignment.

Local Alignment

Recall that local alignment attempts to find the best aligned subsequence between our two sequences. To achieve this, you will have to consider a fourth move at each step in the dynamic programming matrix. This "start" move represents starting the local alignment at that point in the matrix with a score of zero. Review Jones and Pevzner Chapter 6.8 for details about local alignment.

Gap Penalties

Up to this point, we have only considered linear gap penalties that sum with increasing numbers of gaps. That is, for a gap of length 5, the penalty for that gap is 5d for a gap penalty of d. A more realistic way to model biological insertions and deletions is to use affine gap penalties. Affine gap penalties charge a different penalty for beginning a gap than they do for extending a gap. The gap-open penalty b is usually much more than the gap-extend penalty d. Our gap of length 5 would now be assessed a penalty of (b+5d). With small d, long gaps are not penalized disproportionately when compared to short gaps. Your program must use these affine gap penalties when performing global and local alignment. Review Jones and Pevzner Chapter 6.9 . You are required to use the approach given in Chapter 6.9 for the implementation of handling affine gap penalties.

The gap penalties your program will use are: -10 for gap open (begin) and -1 for gap extension.

Output format

Your program must write to standard output. It will write a header for global alignment: (#GLOBAL SCORE=xxx) including the score, the global alignment, a header for local alignment (#LOCAL SCORE=xxx) including the score, and the local alignment. Each of the alignments must be of the format:

SequenceTitle XXXXXX---XXXX-XX...
SequenceTitle X--XXXXXXX-XXXXX...

where each line consists of the sequence title followed by a tab, followed by the aligned sequence. This can continue for as many lines as necessary to print the entire aligned sequences. Use the "-" character for a gap. Here is the local alignment section from a sample output file:

#LOCAL SCORE=770 
NGB_HUMAN    MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCRQFSSPEDCLSSPE 
HBD_HUMAN    LTPEEKTAVNALWGKVNVDAV--GGEALGRLLVVYPWTQRFFE-SFGDLSSPDAVMGNPK 

NGB_HUMAN    FLDHIRKVMLVIDAAVTNVEDLSSLEEYLASLGRKHRAVGVKLSSFSTVGESLLYMLEKC  
HBD_HUMAN    VKAHGKKVLGAFSDGLAHLDNLKGTFSQLSEL--HCDKLHVDPENFRLLGNVLVCVLARN

NGB_HUMAN    LGPAFTPATRAAWSQLYGAVVQAMSRGWD 
HBD_HUMAN    FGKEFTPQMQAAYQKVVAGVANALAHKYH

Report / Submission

This assignment should be done alone. The submission of the source code has to be supplemented by a write-up (max. 10 pages). The report has to be written in English. The write-up should contain in any case at least:

A title page.
An introduction to the problem.
A description and discussion of design decisions you made (including programming language).
A discussion on the asymptotic runtime and asymptotic memory requirements of your implementation (let m and n be the length of the two input sequences, use big-O notation using n and m for this discussion).
A discussion on the test that you performed.
A discussion on how to achieve o(m×n) memory requirements (note the little-o notation) to compute the optimal score. (Note that reporting the optimal alignment is a more complicated problem than reporting the optimal score only; feel free to discuss also how to report an optimal alignment with optimal space requirements and non-optimal and/or an optimal runtime). W.r.t. the memory requirements, please comment on the following: Does your implementation achieve an asymptotically minimal memory requirement? What would the advantages and disadvantages of such an implementation be?
A conclusion.
References used.

Create the following directory structure on one of IMADAs pool machines

mandatory1/
mandatory1/report/
mandatory1/source/
mandatory1/results/

In the source directory there should be the following files:

Your (well commented) source code. You are free to choose the programming language.
A file named make which compiles your program, if your program requires compilation.
Running make should produce an executable named align. If your program requires no compilation (e.g. a scripting language), create a shell script named align that simply runs your program. Note that your program has to be executable on the IMADAs pool machines.

The results directory should contain a README file, in which you briefly describe the tests you made and what files can be found in the corresponding directory.

After putting all the data and the report (use pdf file format) to the correct location, change to the directory mandatory1 and type aflever DM813.

Additionally, you shall hand in a printout of your report and of your source code (the 10 pages limit applies only to the report and not to the source code). (Department secretaries office).

The strict deadline for submission (electronically and printouts) is Friday, November, 26th, 10:00

Useful References

An Introduction to Bioinformatics Algorithms, Chapter 6 (The first and necessary part of this Chapter is freely available as a Sample Chapter from the books webpage).
Understanding Bioinformatics, 2007, by Marketa Zvelebil and Jeremy O. Baum, Chapter 4, Chapter 5 (available via Blackboard).
Bioinformatics and Functional Genomics, 2009, Second Edition, by Jonathan Pevsner, Chapter 3.
Wikipedia entry for the FASTA format.

Frequently Asked Questions (FAQ)

Are we allowed to work in groups?

No, this assignment has to be done individually. However for the second mandatory assignment it is planned that groups of max. 2 students are allowed.

Design by 1234.info | Modified by Daniel Merkle | CSS 2.0

Algorithms for Biological Sequence Analysis

DM813, Autumn 2010

Daniel Merkle