DM79 Algorithms for Web Indexing and Searching
Fall 2007
Rolf Fagerberg
Official Course Description
See the
course
description at the web pages of the faculty.
Time and Place
The course runs from September 3 to December 20, except for the fall break
in Week 42 (October 1519). Hence, the first lecture takes place Monday,
September 3.
The schedule is as follows:
 Mondays 08.15 to 10.00.
 Thursdays 10.15 to 12.00.
 Fridays 10.15 to 12.00.
We will use a mix of Mondays, Thursdays, and Fridays, announced at least a
week ahead. Since there is a programming project involving a nontrivial
amount of work, we will only use approximately 1.5 double lectures per week
on average.
The Monday and Thursday lessons take place in Imadas
seminar room. The Friday
lectures in a room to be announced for individually each time.
The course will be based on handouts (no textbook).
Examination
The exam date will be January 14, and the exam form is:
Programming project with report, and oral examination. External examiner,
grades according to the 7point marking scale. The report and the oral
examination each counts 50% towards the grade.
There will be a spørgetime (session for asking questions on the exam and
the curriculum) Friday, January 11, at 13.15 in Imadas
seminar room.
At the oral exam, you will draw an exam
question delineating a part of curriculum which you are to present. Note
that some questions have overlap (but different focuses). The details of
the exam form are described at the bottom of the list of exam questions,
but if you have questions on the exam form, email me these or ask at the
spørgetime.
The curriculum is everything mentioned under the heading Reading in
the list of lectures below (respecting the exclusions and reading
directions stated).
Project
Regarding the format of the project report, see
this mail sent to all participants
earlier. Regarding deadline, I prefer (and you should too, I think) to have
it delivered in print in my box at the department office by Friday,
December 21, at noon. The very last deadline I will accept is via email,
January 2, at noon.
Lectures
 September 3. Introduction to course
(slides). Overview of search engines
(slides). Discussion of possible other
placements of the lecture hours, and of the exam form.
Reading: The
Anatomy of a LargeScale Hypertextual Web Search Engine. Sergey
Brin and Lawrence Page. Computer Networks 30 (17): 107117, 1998. Also in
Proceedings of the 7th International WWW conference, 1998.
Internet
Searching. Peter Norvig. In Computer Science: Reflections on the
Field, Reflections from the Field. The National Academies Press, 2004.
 September 6. Continuation of overview of search engines
(slides from last time). The I/Omodel
(slides).
Reading: The slides.
 September 10. Sorting in the I/Omodel
(slides). Discussion of project
(slides), for which we decided there will
be one large group.
Reading: The slides.
 September 20. Crawling (slides).
Reading: The slides.
HighPerformance
Web Crawling. Marc Najork and Allan Heydon. Compaq SRC Research
Report 173, 2001. Also appeared in Handbook of Massive Data Sets,
Kluwer, 2001.
BreadthFirst
Search Crawling Yields HighQuality Pages. Marc Najork and
Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web
Conference, pages 114118, 2001.
 September 21. More on crawling.
Reading: Slides above. Pages 239241 in HTTP: The Definitive
Guide. David Gourley, Brian Totty. O'Reilly, 2002. ISBN:
1565925092.
 September 27. Lecture this Thursday: More on crawling.
Reading:
Efficient URL Caching for World Wide Web Crawling . Andrei
Z. Broder and Marc Najork and Janet L. Wiener. In Proceedings of the 12th
international conference on World Wide Web, pages 679689, 2003. .
Practical Issues of Crawling Large Web Collections. Carlos
Castillo and Ricardo BaezaYates. Technical Report, 2005.
The following papers give further details of highquality crawlers. They
are optional reading, and not part of curriculum.
Design and
Implementation of a HighPerformance Distributed Web
Crawler. Vladislav Shkapenyuk and Torsten Suel. In Proceedings of
IEEE International Conference on Data Engineering (ICDE), 2002
Ubicrawler: A scalable fully distributed web crawler. Paolo Boldi,
Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Software: Practice
& Experience, 34(8):711726, 2004.
Mercator: A Scalable, Extensible Web Crawler. Allan Heydon and
Marc Najork. World Wide Web 2(4):219229, 1999.
 September 28. Indexing (slides).
Reading: Building
a Distributed FullText Index for the Web. Sergey Melnik, Sriram
Raghavan, Beverly Yang, and Hector GarciaMolina. Stanford Technical
Report 200055. Short version of paper appeared in the proceedings of the
10th International WWW conference, 2001. Section 3.1.1 can be read lightly,
and Section 4.2 can be skipped.
Pages 109122, 145151, 156161, 169170 and 175179 of: Managing Gigabytes: Compressing and
Indexing Documents and Images, 2nd edition. Ian H. Witten,
Alistair Moffat, and Timothy C. Bell. Morgan Kaufmann Publishing, San
Francisco, 1999. ISBN 1558605703.
 October 4. More on indexing.
Reading: See above.
 October 11. Wordbased ranking: the vector space model.
Reading: Pages 3141 in Understanding
Search Engines: Mathematical Modeling and Text Retrieval.
M.W. Berry and M. Browne. SIAM Book Series: Software, Environments,
and Tools, (June 1999), ISBN: 0898714370.
Pages 2430 in Modern
Information Retrieval. Ricardo BaezaYates, Berthier
RibieroNeto. Addison Wesley Higher Education, 1999. ISBN: 020139829X.
 October 12. More on the vector space model.
Reading: As above.
 No lectures in the fall break in Week 42 (October 15 to 19).
 No lectures in the exam period for quartercourses (weeks 43 and 44,
October 22 to November 2), since several participants have exams.
 November 5. Lecture this Monday. Algorithms for Pagerank.
Reading: Efficient
Computation of PageRank. Taher H. Haveliwala. Stanford
Technical Report 199931.
 November 8. Lecture this Thursday. More on algorithms for
PageRank.
Reading: I/OEfficient
Techniques for Computing Pagerank. YenYu Chen, Qingqing Gan, and
Torsten Suel. ACM Conference on Information and Knowledge Engineering
(CIKM), 2002.
Sections 3.12 of Optimal Sparse Matrix Dense
Vector Multiplication in the I/OModel. Michael A. Bender, Gerth
Stølting Brodal, Rolf Fagerberg, Riko Jacob, and Elias Vicari. In
Proc. 19th ACM Symposium on Parallelism in Algorithms and Architectures,
pages 6170, 2007.
 November 12. Convergence and other issues of PageRank.
Reading: Deeper
Inside PageRank. Amy N. Langville and Carl D. Meyer. Internet
Mathematics Vol. 1, No. 3: 335380, 2005.
The first theorem stated on the Wikipedia page on the
PerronFrobenius
theorem.
 November 16. Other link based ranking schemes: HITS and SALSA.
Reading: A Survey
of Eigenvector Methods for Web Information Retrieval. Amy
N. Langville and Carl D. Meyer. SIAM Review, Vol. 47, No. 1, pp. 135161,
2005.
 November 19.
Advanced lexicon structures: tries, compressed tries, suffix tries and
generalized suffix tries. Prefix search, edit distance based search and
substring search.
Reading: Pages 566576 in Data
Structures and Algorithms in Java. Michael T. Goodrich and
Roberto Tamassia. Wiley, 1998. ISBN: 0471193089.
Pages 116119 in Algorithms
on Strings, Trees, and Sequences. Dan Gusfield. Cambridge
University Press, 1997. ISBN: 0521585198.
 November 26. String sorting and searching by Ternary Search
Trees. Suffix arrays, searching in suffix arrays.
Reading: Fast
Algorithms for Sorting and Searching Strings. Jon Bentley and
Robert Sedgewick. Proceedings of the eighth annual ACMSIAM symposium on
Discrete algorithms. New Orleans, January, 1997. Pages 360369. (This
paper also describes the idea behind the edit distance based search method
in the last lecture, although in the slightly simpler version of matching
with dont'care characters).
Pages 149155 in Algorithms
on Strings, Trees, and Sequences. Dan Gusfield. Cambridge
University Press, 1997. ISBN: 0521585198.
 December 3. The lcp array and its use for efficient searching
in suffix arrays. From suffix arrays and lcp arrays to suffix trees and
back. Linear time construction of suffix arrays.
Reading: Pages 149155 in Algorithms
on Strings, Trees, and Sequences. Dan Gusfield. Cambridge
University Press, 1997. ISBN: 0521585198.
Sections 2 and 3 in Linear
work suffix array construction. Juha Kärkkäinen and
Peter Sanders and Stefan Burkhardt. J. ACM, 53(6), 918936,
2006. Conference version in Proc. of ICALP 2003.
 December 6. End of linear time construction of suffix
arrays. Linear time construction of lcp array from suffix array.
Reading: Section 3 in LinearTime
LongestCommonPrefix Computation in Suffix Arrays and Its
Applications.. Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo
Arikawa, Kunsoo Park. Proceedings of Combinatorial Pattern Matching (CPM),
Springer, 2001, pages 181192. Link to text is free from an Imada IP. The
text contains a few confusing errors.
 December 7. The lectures were cancelled due to lack of
students.
 December 10. Empirical investigations of the structure of the
web graph. Models for the web graph.
Reading:
Graph structure in the web . Andrei Broder, Ravi Kumar,
Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata,
Andrew Tomkins, and Janet Wiener. In proceedings of Ninth
International World Wide Web Conference (WWW9), 2000.
Nonproof parts of Stochastic
models for the Web graph . R. Kumar, P. Raghavan, S. Rajagopalan,
D. Sivakumar, A. Tomkins, and E. Upfal. Proceedings of the 41th IEEE
Symp. on Foundations of Computer Science. November 2000, pp. 5765.
 December 17. Documents similarity.
Reading: On the
resemblance and containment of documents. Andrei Broder. In
Compression and Complexity of Sequences (SEQUENCES'97), pages 2129. IEEE
Computer Society, 1998. (Section 4 can be read lightly. We did only discuss
the resemblance, not containment, and only the MINbased measure.)
 December 21. Algorithms for intersection of inverted file
lists.
Reading: Intersection
in Integer Inverted Indices. Peter Sanders and Frederik
Transier. In Proceedings of the Ninth Workshop on Algorithm Engineering and
Experiments (ALENEX'07) 2007.
Maintained by Rolf Fagerberg
(rolf@imada.sdu.dk)

