DM813 Mandatory Assignment 1: Dynamic Programming and Sequence Alignment
Biological Background
Homology is very informative for understanding the biological role of proteins, DNA sequences, and RNA sequences. If an evolutionary relationship can be inferred between a sequence of interest and another well characterized sequence, a functional relationship often follows. For example, if a new protein is found to contain a portion of its sequence that is homologous to a domain of known function, it is likely the protein may possess the catalytic properties associated with the domain. Furthermore, if the two sequences can be aligned, functional relations can be extended to specific amino acids or nucleotides. Dynamic programming is one way to compare related biological sequences, aligning them in an optimal way. For this project, you will write a program that performs local and global alignment of two sequences.
Input Data
Your program will read two arguments from the command line, which are the names of the two input files. The first input file is the sequence data; the second input file is the scoring matrix. These files are further described below.
Sequence data
Sample input files for this project will be provided. The actual input files used with your program will be different than these files. All input files will have the same FASTA format. The FASTA format is a common, simple format for biological sequence data. Each sequence record in the file contains a title line, which always begins with the character > , followed by multiple lines of sequence. Here is an example file with two short sequences:
>HBD_HUMAN Hemoglobin subunit delta MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPK VKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFG KEFTPQMQAAYQKVVAGVANALAHKYH >NGB_HUMAN Neuroglobin MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCRQFSSPEDCLSSPE FLDHIRKVMLVIDAAVTNVEDLSSLEEYLASLGRKHRAVGVKLSSFSTVGESLLYMLEKC LGPAFTPATRAAWSQLYGAVVQAMSRGWDGE
Your program will always read in two sequences to align. As such, all input files will have exactly two sequences. Your program should read in two sequences of any length, and allow the sequence lines of each record to be any length.
Scoring matrix
Your program will read in a scoring matrix (whose name is passed as the second argument to your program). The scoring matrix has several lines of comments, followed by a header line of space-separated amino acids, followed by several lines of the substituted amino acids. A snippet of this file should make things clear:
... # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 ...
For example, in this scoring matrix, a A-A match receives a score of +4, while a A-R mismatch receives a score of -1.
The following files are provided:
|
What your program must do
Your program has two command line arguments. They are:
1. Name of sequence file.
2. Name of score matrix file.
Your program must read in two sequences from the first file and a
scoring matrix from the second file, perform dynamic programming to
find the optimal local and global alignment of the two sequences, and
write the optimal local and global alignment to standard output, along
with the alignment score. Pay attention to the output format section
described below. Details of the global alignment, local alignment, and
affine gap penalties are described below.
Global Alignment
Recall that global alignment finds the optimal path in the dynamic programming matrix that traverses from one corner of the dynamic programming matrix to the opposite corner of the matrix. Review for example Jones and Pevzner Chapter 6.6 for details about global alignment.
Local Alignment
Recall that local alignment attempts to find the best aligned subsequence between our two sequences. To achieve this, you will have to consider a fourth move at each step in the dynamic programming matrix. This "start" move represents starting the local alignment at that point in the matrix with a score of zero. Review Jones and Pevzner Chapter 6.8 for details about local alignment.
Gap Penalties
Up to this point, we have only considered linear gap penalties that sum with increasing numbers of gaps. That is, for a gap of length 5, the penalty for that gap is 5d for a gap penalty of d. A more realistic way to model biological insertions and deletions is to use affine gap penalties. Affine gap penalties charge a different penalty for beginning a gap than they do for extending a gap. The gap-open penalty b is usually much more than the gap-extend penalty d. Our gap of length 5 would now be assessed a penalty of (b+5d). With small d, long gaps are not penalized disproportionately when compared to short gaps. Your program must use these affine gap penalties when performing global and local alignment. Review Jones and Pevzner Chapter 6.9 . You are required to use the approach given in Chapter 6.9 for the implementation of handling affine gap penalties.
The gap penalties your program will use are: -10 for gap open (begin) and -1 for gap extension.
Output format
Your program must write to standard output. It will write a header
for global alignment: (#GLOBAL SCORE=xxx
) including the score, the
global alignment, a header for local alignment (#LOCAL SCORE=xxx
)
including the score, and the local alignment. Each of the alignments
must be of the format:
SequenceTitle XXXXXX---XXXX-XX... SequenceTitle X--XXXXXXX-XXXXX...
where each line consists of the sequence title followed by a tab, followed by the aligned sequence. This can continue for as many lines as necessary to print the entire aligned sequences. Use the "-" character for a gap. Here is the local alignment section from a sample output file:
#LOCAL SCORE=770 NGB_HUMAN MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCRQFSSPEDCLSSPE HBD_HUMAN LTPEEKTAVNALWGKVNVDAV--GGEALGRLLVVYPWTQRFFE-SFGDLSSPDAVMGNPK NGB_HUMAN FLDHIRKVMLVIDAAVTNVEDLSSLEEYLASLGRKHRAVGVKLSSFSTVGESLLYMLEKC HBD_HUMAN VKAHGKKVLGAFSDGLAHLDNLKGTFSQLSEL--HCDKLHVDPENFRLLGNVLVCVLARN NGB_HUMAN LGPAFTPATRAAWSQLYGAVVQAMSRGWD HBD_HUMAN FGKEFTPQMQAAYQKVVAGVANALAHKYH
Report / Submission
This assignment should be done alone. The submission of the source code has to be supplemented by a write-up (max. 10 pages). The report has to be written in English. The write-up should contain in any case at least:
- A title page.
- An introduction to the problem.
- A description and discussion of design decisions you made (including programming language).
- A discussion on the asymptotic runtime and asymptotic memory requirements of your implementation (let m and n be the length of the two input sequences, use big-O notation using n and m for this discussion).
- A discussion on the test that you performed.
- A discussion on how to achieve o(m×n) memory requirements (note the little-o notation) to compute the optimal score. (Note that reporting the optimal alignment is a more complicated problem than reporting the optimal score only; feel free to discuss also how to report an optimal alignment with optimal space requirements and non-optimal and/or an optimal runtime). W.r.t. the memory requirements, please comment on the following: Does your implementation achieve an asymptotically minimal memory requirement? What would the advantages and disadvantages of such an implementation be?
- A conclusion.
- References used.
Create the following directory structure on one of IMADAs pool machines
mandatory1/ mandatory1/report/ mandatory1/source/ mandatory1/results/
In the source
directory there should be the following files:
- Your (well commented) source code. You are free to choose the programming language.
- A file named
make
which compiles your program, if your program requires compilation. - Running
make
should produce an executable namedalign
. If your program requires no compilation (e.g. a scripting language), create a shell script namedalign
that simply runs your program. Note that your program has to be executable on the IMADAs pool machines.
The results
directory should contain a README
file, in which you briefly describe the tests you made and what files can be found in the corresponding directory.
After putting all the data and the report (use pdf file format) to the correct location, change to the directory mandatory1
and type aflever DM813
.
Additionally, you shall hand in a printout of your report and of your source code (the 10 pages limit applies only to the report and not to the source code). (Department secretaries office).
The strict deadline for submission (electronically and printouts) is Friday, November, 26th, 10:00
Useful References
- An Introduction to Bioinformatics Algorithms, Chapter 6 (The first and necessary part of this Chapter is freely available as a Sample Chapter from the books webpage).
- Understanding Bioinformatics, 2007, by Marketa Zvelebil and Jeremy O. Baum, Chapter 4, Chapter 5 (available via Blackboard).
- Bioinformatics and Functional Genomics, 2009, Second Edition, by Jonathan Pevsner, Chapter 3.
- Wikipedia entry for the FASTA format.
Frequently Asked Questions (FAQ)
- Are we allowed to work in groups?
No, this assignment has to be done individually. However for the second mandatory assignment it is planned that groups of max. 2 students are allowed.