DM823 String Algorithms
Fall 2016 Rolf Fagerberg
Official Course Description
See the
course
description at the web pages of the faculty.
Time and Place
The course runs over the entire semester, but is only 5 ECTS. Most weeks
we will meet only once, a few weeks we will meet twice (this includes
the first week). We have been given three slots: Mon 10-12, Tue 15-17
and Wed 14-16, but we will mainly use the Wed 14-16 slot. Actual times
will appear below at least one week ahead. The room will often be the
IMADA seminar room, but
not always.
Teaching Material
The course material will be handouts (no textbook).
Examination
The exam is oral, with grades on the 7-point marking scale. A
programming project must be passed in order to attend the exam.
The exam date is January 26.
At the oral exam, you will draw an exam topic
delineating a part of curriculum which you are to present in the first
half of the examination. More details of the exam form are described at
the bottom of the list of exam topics.
Lectures
Date |
Time |
Room |
Contents |
Reading |
Monday, September 5 |
10-12 |
Imada Computer Lab (section at end, past folding wall, which used to
be U49D) |
Introduction to course. String
terminology. Borders. Naive algorithm for exact pattern
matching, and some heuristics for it. |
Course introduction slides. The teacher's notes
on basic terminology,
simple solution to exact pattern
matching problem.
|
Wednesday, September 7 |
14-16 |
Imada seminar room |
The Knuth-Morris-Pratt (KMP) linear time algorithm for exact pattern
matching.
|
The teacher's notes on the KMP algorithm.
The teacher's notes on finding the border table
used in the KMP algorithm.
|
Monday, September 19 |
10-12 |
U61 |
Periods. Primitivity, cyclic shifts. The Boyer-Moore (BM)
algorithm. Proof of linearity (for finding first match) of BM
algorithm.
|
The teacher's notes on the BM algorithm.
The teacher's notes on a property of
primitive strings. The proof of linearity (for finding first match) is
removed from curriculum (as I have not found the time to write notes for
it).
|
Wednesday, September 21 |
14-16 |
Imada seminar room |
The Galil variant of the BM algoritm with linear time for finding all
occurrences. First part of finding the shift table for the BM algorithm:
finding the prefix tabel.
|
The teacher's notes on the Galil variant of the BM
algorithm.
The teacher's notes on the prefix table.
|
Wednesday, September 28 |
14-16 |
Imada seminar room |
End of finding the shift table for the BM algorithm. Searching in a
sorted set of strings.
|
The teacher's notes on finding the BM shift
table.
Start of the teacher's notes on
searching a sorted set of strings.
|
Wednesday, October 5 |
14-16 |
Imada seminar room |
Handout of the programming project. Associated
file with data sets. More on searching in a
sorted set of strings.
|
End of the teacher's notes on
searching a sorted set of strings.
|
Wednesday, October 12 |
14-16 |
Imada seminar room |
Suffix trees and suffix arrays. O(n log n) time construction of suffix
arrays by the doubling algorithm.
|
The teacher's notes on suffix arrays and
suffix trees. The doubling algorithm is described in
Scalable Construction of Text
Indexes, T. Bingmann, S. Gog, F. Kurpicz, arXiv:1610.03007, pages 4-6.
|
Monday, October 24 |
10-12 |
U12 |
Linear time construction of suffix arrays.
|
Linear work suffix
array construction, J. Kärkkäinen, P. Sanders, S. Burkhardt G, Journal of
the ACM, vol 53 (6), 918 -936, 2005 (pdf-link available from an SDU
IP-address), sections 1.0, 1.1, 2, and 3.
|
Wednesday, November 9 |
14-16 |
Imada seminar room |
Creation of the lcp-array from the suffix array in linear time.
|
Section 3 (two pages) in
Linear-Time
Longest-Common-Prefix Computation in Suffix Arrays and Its
Applications, T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park, in
Proceedings of Combinatorial Pattern Matching 2001, Lecture Notes in
Computer Science, vol 2089, 181-192, Springer 2001 (pdf-link available
from an SDU IP-address. Note a few errors in the paper: In the
statements of Fact 2, Fact 3, Lemma 3, and Theorem 1, ">1" should be
"=>1", and in the code in Fig. 3, the last IF-statement should be
executed at each iteration of the FOR-loop (so it should be moved out of
the first IF-statement)).
|
Monday, November 14 |
10-12 |
U31 |
Suffix + LCP arrays vs. suffix trees. Uses of suffix trees.
|
Pages 566-576 of the book Data Structures and Algorithms in
Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons,
1998, ISBN 0471193089), given as handout.
The teacher's notes on suffix arrays and
suffix trees.
|
Monday, November 28 |
10-12 |
U146 |
String compression: LZ78, LZ77, the
Burrows-Wheeler transform, the move-to-front heuristic.
|
The description of the
Burrows-Wheeler transform and the move-to-front heuristic on this
project
description from a Princeton course. You may also look at Sections 1-3
in the
original
paper by Burrows and Wheeler. Note that the encoding essentially can be
found by constructing the suffix array on the string (with '$'
appended). The information on the LZ78 method in Section 14.1.6 (pages
576-578) in the book Data Structures and Algorithms in Java (1st
ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN
0471193089), given as handout. The LZ77 method differs from LZ78 in that
the substrings pointed
to by the generated codewords may be any substring appearing in the
part of the string scanned so far. To find such maximal substrings equal
to a prefix of the remaining part of the string, one can use a suffix
tree on the scanned part of the string. This needs to be
updated/extended after scanning the next part of the string, which can
be achieved by a linear time construction algorithm for suffix trees
by Ukkonen (not covered in course).
|
Wednesday, November 30 |
14-16 |
Imada seminar room |
Sorting and searching strings.
|
The teacher's notes on the RadixQuicksort
algorithm. You may also look at the original paper: Fast Algorithms for
Sorting and Searching Strings, J. Bentley and R. Sedgewick,
Proceedings of the 8th annual ACM-SIAM symposium on Discrete algorithms,
360-369, 1997.
|
Monday December 5 |
10-12 |
U10 |
Dynamic programming algorithms: Hirschbergs method for longest common
subsequence in linear space.
|
The teacher's notes on Hirschberg's method.
|
Wednesday December 14 |
14-16 |
Imada seminar room |
More on dynamic programming algorithms: Hunt-Szymanski algorithm for
longest common subsequence, start on edit distance.
|
The teacher's notes on the Hunt-Szymanski
algorithm, and first part of the teacher's
notes on Edit Distance.
|
Tuesday December 20 |
15-17 |
Imada seminar room |
More on edit distance (global alignment). Alignments with
gaps. Discussion of oral exam.
|
Rest of the teacher's notes on Edit Distance.
Exam topics
|
Maintained by Rolf Fagerberg
(rolf@imada.sdu.dk)
|
|