DM823 String Algorithms
Fall 2016 Rolf Fagerberg
Official Course Description
See the
course
description at the web pages of the faculty.
Time and Place
The course runs over the entire semester, but is only 5 ECTS. Most weeks
we will meet only once, a few weeks we will meet twice (this includes
the first week). We have been given three slots: Mon 1012, Tue 1517
and Wed 1416, but we will mainly use the Wed 1416 slot. Actual times
will appear below at least one week ahead. The room will often be the
IMADA seminar room, but
not always.
Teaching Material
The course material will be handouts (no textbook).
Examination
The exam is oral, with grades on the 7point marking scale. A
programming project must be passed in order to attend the exam.
The exam date is January 26.
At the oral exam, you will draw an exam topic
delineating a part of curriculum which you are to present in the first
half of the examination. More details of the exam form are described at
the bottom of the list of exam topics.
Lectures
Date 
Time 
Room 
Contents 
Reading 
Monday, September 5 
1012 
Imada Computer Lab (section at end, past folding wall, which used to
be U49D) 
Introduction to course. String
terminology. Borders. Naive algorithm for exact pattern
matching, and some heuristics for it. 
Course introduction slides. The teacher's notes
on basic terminology,
simple solution to exact pattern
matching problem.

Wednesday, September 7 
1416 
Imada seminar room 
The KnuthMorrisPratt (KMP) linear time algorithm for exact pattern
matching.

The teacher's notes on the KMP algorithm.
The teacher's notes on finding the border table
used in the KMP algorithm.

Monday, September 19 
1012 
U61 
Periods. Primitivity, cyclic shifts. The BoyerMoore (BM)
algorithm. Proof of linearity (for finding first match) of BM
algorithm.

The teacher's notes on the BM algorithm.
The teacher's notes on a property of
primitive strings. The proof of linearity (for finding first match) is
removed from curriculum (as I have not found the time to write notes for
it).

Wednesday, September 21 
1416 
Imada seminar room 
The Galil variant of the BM algoritm with linear time for finding all
occurrences. First part of finding the shift table for the BM algorithm:
finding the prefix tabel.

The teacher's notes on the Galil variant of the BM
algorithm.
The teacher's notes on the prefix table.

Wednesday, September 28 
1416 
Imada seminar room 
End of finding the shift table for the BM algorithm. Searching in a
sorted set of strings.

The teacher's notes on finding the BM shift
table.
Start of the teacher's notes on
searching a sorted set of strings.

Wednesday, October 5 
1416 
Imada seminar room 
Handout of the programming project. Associated
file with data sets. More on searching in a
sorted set of strings.

End of the teacher's notes on
searching a sorted set of strings.

Wednesday, October 12 
1416 
Imada seminar room 
Suffix trees and suffix arrays. O(n log n) time construction of suffix
arrays by the doubling algorithm.

The teacher's notes on suffix arrays and
suffix trees. The doubling algorithm is described in
Scalable Construction of Text
Indexes, T. Bingmann, S. Gog, F. Kurpicz, arXiv:1610.03007, pages 46.

Monday, October 24 
1012 
U12 
Linear time construction of suffix arrays.

Linear work suffix
array construction, J. Kärkkäinen, P. Sanders, S. Burkhardt G, Journal of
the ACM, vol 53 (6), 918 936, 2005 (pdflink available from an SDU
IPaddress), sections 1.0, 1.1, 2, and 3.

Wednesday, November 9 
1416 
Imada seminar room 
Creation of the lcparray from the suffix array in linear time.

Section 3 (two pages) in
LinearTime
LongestCommonPrefix Computation in Suffix Arrays and Its
Applications, T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park, in
Proceedings of Combinatorial Pattern Matching 2001, Lecture Notes in
Computer Science, vol 2089, 181192, Springer 2001 (pdflink available
from an SDU IPaddress. Note a few errors in the paper: In the
statements of Fact 2, Fact 3, Lemma 3, and Theorem 1, ">1" should be
"=>1", and in the code in Fig. 3, the last IFstatement should be
executed at each iteration of the FORloop (so it should be moved out of
the first IFstatement)).

Monday, November 14 
1012 
U31 
Suffix + LCP arrays vs. suffix trees. Uses of suffix trees.

Pages 566576 of the book Data Structures and Algorithms in
Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons,
1998, ISBN 0471193089), given as handout.
The teacher's notes on suffix arrays and
suffix trees.

Monday, November 28 
1012 
U146 
String compression: LZ78, LZ77, the
BurrowsWheeler transform, the movetofront heuristic.

The description of the
BurrowsWheeler transform and the movetofront heuristic on this
project
description from a Princeton course. You may also look at Sections 13
in the
original
paper by Burrows and Wheeler. Note that the encoding essentially can be
found by constructing the suffix array on the string (with '$'
appended). The information on the LZ78 method in Section 14.1.6 (pages
576578) in the book Data Structures and Algorithms in Java (1st
ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN
0471193089), given as handout. The LZ77 method differs from LZ78 in that
the substrings pointed
to by the generated codewords may be any substring appearing in the
part of the string scanned so far. To find such maximal substrings equal
to a prefix of the remaining part of the string, one can use a suffix
tree on the scanned part of the string. This needs to be
updated/extended after scanning the next part of the string, which can
be achieved by a linear time construction algorithm for suffix trees
by Ukkonen (not covered in course).

Wednesday, November 30 
1416 
Imada seminar room 
Sorting and searching strings.

The teacher's notes on the RadixQuicksort
algorithm. You may also look at the original paper: Fast Algorithms for
Sorting and Searching Strings, J. Bentley and R. Sedgewick,
Proceedings of the 8th annual ACMSIAM symposium on Discrete algorithms,
360369, 1997.

Monday December 5 
1012 
U10 
Dynamic programming algorithms: Hirschbergs method for longest common
subsequence in linear space.

The teacher's notes on Hirschberg's method.

Wednesday December 14 
1416 
Imada seminar room 
More on dynamic programming algorithms: HuntSzymanski algorithm for
longest common subsequence, start on edit distance.

The teacher's notes on the HuntSzymanski
algorithm, and first part of the teacher's
notes on Edit Distance.

Tuesday December 20 
1517 
Imada seminar room 
More on edit distance (global alignment). Alignments with
gaps. Discussion of oral exam.

Rest of the teacher's notes on Edit Distance.
Exam topics

Maintained by Rolf Fagerberg
(rolf@imada.sdu.dk)

