DM823 String Algorithms
Fall 2013, 2nd quarter Rolf Fagerberg
Official Course Description
See the
course
description at the web pages of the faculty.
Time and Place
The course starts Monday, November 4.
Textbook
As textbook, we will use the following:
Algorithms
on Strings
By M. Crochemore, C. Hancart, and Thierry Lecroq
Published by Cambridge University Press, 2007
ISBN 978-0-521-84899-2
Examination
The exam is oral, with grades on the 7-point marking scale. The exam
date is January 23 (was January 21, but is now moved to make rational
use of external examinator time), in rooms
O95 and
O96. A
programming project must be passed in order to
attend the exam. There is a separate document
describing the experiments to be performed on the
data sets provided.
There will be a spørgetime (session for asking questions on the exam and
the curriculum) Wednesday, January 22, at 10:15 in
U49.
At the oral exam, you will draw an exam topic
delineating a part of curriculum which you are to present in the first half
of the examination. More details of the exam form are described at the
bottom of the list of exam topics.
The grades at the exam ended up with the following
distribution.
Lectures
Date |
Time |
Room |
Contents |
Reading |
Monday, November 4 |
12-14 |
Imada seminar room |
Introduction to course. String
terminology. Borders. Naive algorithm for exact pattern
matching, and some heuristics for it. The Knuth-Morris-Pratt (KMP)
linear time algorithm for exact pattern matching. |
Pages 2-4, 12 (top half), 28-32 (top half), 40 in textbook. The teachers
notes on the KMP algorithm.
|
Thursday, November 7 |
12-14 |
Imada seminar room |
Complexity of the KMP algorithm. Computing the border table used in
the KMP algorithm.
|
Pages 40 to 42 in textbook.
|
Monday, November 11 |
12-14 |
Imada seminar room |
Periods, primitivity. The Boyer-Moore (BM) algorithm. Proof of linearity
(for finding first match) of BM algorithm. Note: the Boyer-Moore
algorithm is called W-Memoryless-Suffix-Search in the textbook
(page 107).
|
Pages 10 (bottom), 11 (top), 16 (middle) in textbook. Pages 102-113
in textbook.
|
Thursday, November 14 |
12-14 |
Imada seminar room |
End of proof of linearity (for finding first match) of BM algorithm.
Proposition on primitive words and cyclic shifts.
|
Same as last time. The teachers notes on a
property of primitive strings.
|
Monday, November 18 |
12-14 |
Imada seminar room |
The Galil variant of the BM algoritm having linear time for finding all
occurrences. Note: the algorithm is called WL-Memoryless-Suffix-Search
in the textbook (page 112). Start of preprocessing for the BM algorithm:
finding the prefix table in linear time.
|
Page 112 in textbook. The teachers notes on the
Galil variant of the BM algorithm. Pages 42 (middle) to 45 (bottom) in
textbook.
|
Thursday, November 21 |
12-14 |
Imada seminar room |
Handout of the programming project. End of
preprocessing for the BM algorithm.
|
Section 3.3 (pages 113-118) in textbook.
|
Monday, November 25 |
12-14 |
Imada seminar room |
Searching in a sorted set of strings.
|
Sections 4.1-3 (pages 146-158) in textbook.
|
Thursday November 28 |
12-14 |
Imada seminar room |
Further details or searching in a sorted set of strings. Start on
suffix arrays.
|
Sections 4.3 (pages 155-158) and start of 4.4 (pages 158-159) in textbook.
|
Monday December 2 |
12-14 |
Imada seminar room |
Suffix arrays and their construction. Handout of
separate document describing the experiments to
be performed on the data sets provided.
|
Sections 4.4-5 (pages 158-169) in textbook. As an alternative (or
compliment) to Section 4.5, you can use the very readable original paper:
Linear work suffix
array construction, J. Kärkkäinen, P. Sanders, S. Burkhardt G, Journal of
the ACM, vol 53 (6), 918 -936, 2005 (pdf-link available from an SDU
IP-address) - it is enough to read Sections 1.0, 1.1, 2, and 3 (less
than four pages).
|
Thursday December 5 |
12-14 |
Imada seminar room |
End of suffix array construction in linear time. Creation of the
lcp-array from the suffix array in linear time.
|
Section 4.5 (pages 164-169) in textbook (or, as last time, the
original paper).
Section 4.6 (pages 169-174) in textbook. As an alternative/complement,
you may read Section 3 (two pages) in the original paper
Linear-Time
Longest-Common-Prefix Computation in Suffix Arrays and Its Applications,
T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park, in Proceedings of
Combinatorial Pattern Matching 2001, Lecture Notes in Computer Science,
vol 2089, 181-192, Springer 2001 (pdf-link available from an SDU
IP-address,
note a few errors in the paper: In the statements of Fact 2, Fact 3, Lemma 3,
and Theorem 1, ">1" should be "=>1", and in the code in Fig. 3, the last
IF-statement should be executed at each iteration of the FOR-loop (so it
should be moved out of the first IF-statement)).
|
Monday December 9 |
12-14 |
Imada seminar room |
End of creation of the lcp-array from the suffix array in linear
time. Tries and suffix trees.
|
Section 4.6 (pages 169-174) in textbook (or, as last time, Section 3 in
the
original
paper). Pages 566-576 of the book Data Structures and Algorithms in
Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons,
1998, ISBN 0471193089), given as handout. Pages 177-186 and 219-223 in
the textbook (Answering the four queries on page 222 via a suffix tree
is easy by appropriate annotation (preprocessing via a DFS traversal of
the tree) of the nodes of the tree, e.g. by the number of leaves in the
subtree of a node, or by the minimal index in a leaf in the subtree of
the node. This remark replaces pages 224-227, which discusses how to do
it in various "automaton" versions of the suffix tree). Section 6.4
(pages 230-231) in the textbook.
|
Thursday December 12 |
12-14 |
Imada seminar room |
More uses of suffix trees. String compression: LZ78, LZ77, the
Burrows-Wheeler transform, the move-to-front heuristic.
|
Section 6.4 (pages 230-231) in the textbook. The description of the
Burrows-Wheeler transform and the move-to-front heuristic on this
project
description from a Princeton course. You may also look at Sections 1-3
in the
original
paper by Burrows and Wheeler. Note that the encoding essentially can be
found by constructing the suffix array on the string (with '$'
appended). The information on the LZ78 method in Section 14.1.6 (pages
576-578) in the book Data Structures and Algorithms in Java (1st
ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN
0471193089), previously given as handout. The LZ77 method was only given
on the blackboard. It differs from LZ78 in that the substrings pointed
to by the generated codewords may be any substring appearing in the
part of the string scanned so far. To find such maximal substrings equal
to a prefix of the remaining part of the string, one can use a suffix
tree on the scanned part of the string. This needs to be
updated/extended after scanning the next part of the string, which can
be achieved by a linear time construction algorithm for suffix trees
by Ukkonen (not covered in course).
|
Monday December 16 |
12-14 |
Imada seminar room |
Sorting and searching strings. Start on dynamic programming
algorithms (recap of longest common subsequence).
|
Fast Algorithms for
Sorting and Searching Strings, J. Bentley and R. Sedgewick,
Proceedings of the 8th annual ACM-SIAM symposium on Discrete algorithms,
360-369, 1997. Section 7.3 (pages 262-267) in the textbook.
|
Thursday December 19 |
12-14 |
Imada seminar room |
Handout of exam topics. More dynamic programming
algorithms: Hirschbergs method for longest common subsequence in linear
space, edit distance (alignment), alignment with gaps.
|
Rest of sections 7.1-4 (pages 243-276) in the textbook.
|
Maintained by Rolf Fagerberg
(rolf@imada.sdu.dk)
|
|