DM79 Algorithms for Web Indexing and Searching

Fall 2007
Rolf Fagerberg


Official Course Description

See the course description at the web pages of the faculty.

Time and Place

The course runs from September 3 to December 20, except for the fall break in Week 42 (October 15-19). Hence, the first lecture takes place Monday, September 3.

The schedule is as follows:

  • Mondays 08.15 to 10.00.
  • Thursdays 10.15 to 12.00.
  • Fridays 10.15 to 12.00.

We will use a mix of Mondays, Thursdays, and Fridays, announced at least a week ahead. Since there is a programming project involving a non-trivial amount of work, we will only use approximately 1.5 double lectures per week on average.

The Monday and Thursday lessons take place in Imadas seminar room. The Friday lectures in a room to be announced for individually each time.

The course will be based on handouts (no textbook).

Examination

The exam date will be January 14, and the exam form is:

Programming project with report, and oral examination. External examiner, grades according to the 7-point marking scale. The report and the oral examination each counts 50% towards the grade.

There will be a spørgetime (session for asking questions on the exam and the curriculum) Friday, January 11, at 13.15 in Imadas seminar room.

At the oral exam, you will draw an exam question delineating a part of curriculum which you are to present. Note that some questions have overlap (but different focuses). The details of the exam form are described at the bottom of the list of exam questions, but if you have questions on the exam form, email me these or ask at the spørgetime.

The curriculum is everything mentioned under the heading Reading in the list of lectures below (respecting the exclusions and reading directions stated).

Project

Regarding the format of the project report, see this mail sent to all participants earlier. Regarding deadline, I prefer (and you should too, I think) to have it delivered in print in my box at the department office by Friday, December 21, at noon. The very last deadline I will accept is via email, January 2, at noon.

Lectures

  • September 3. Introduction to course (slides). Overview of search engines (slides). Discussion of possible other placements of the lecture hours, and of the exam form.

    Reading: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Computer Networks 30 (1-7): 107-117, 1998. Also in Proceedings of the 7th International WWW conference, 1998.

    Internet Searching. Peter Norvig. In Computer Science: Reflections on the Field, Reflections from the Field. The National Academies Press, 2004.

  • September 6. Continuation of overview of search engines (slides from last time). The I/O-model (slides).

    Reading: The slides.

  • September 10. Sorting in the I/O-model (slides). Discussion of project (slides), for which we decided there will be one large group.

    Reading: The slides.

  • September 20. Crawling (slides).

    Reading: The slides.

    High-Performance Web Crawling. Marc Najork and Allan Heydon. Compaq SRC Research Report 173, 2001. Also appeared in Handbook of Massive Data Sets, Kluwer, 2001.

    Breadth-First Search Crawling Yields High-Quality Pages. Marc Najork and Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web Conference, pages 114-118, 2001.

  • September 21. More on crawling.

    Reading: Slides above. Pages 239-241 in HTTP: The Definitive Guide. David Gourley, Brian Totty. O'Reilly, 2002. ISBN: 1-56592-509-2.

  • September 27. Lecture this Thursday: More on crawling.

    Reading:

    Efficient URL Caching for World Wide Web Crawling . Andrei Z. Broder and Marc Najork and Janet L. Wiener. In Proceedings of the 12th international conference on World Wide Web, pages 679-689, 2003. .

    Practical Issues of Crawling Large Web Collections. Carlos Castillo and Ricardo Baeza-Yates. Technical Report, 2005.

    The following papers give further details of high-quality crawlers. They are optional reading, and not part of curriculum.

    Design and Implementation of a High-Performance Distributed Web Crawler. Vladislav Shkapenyuk and Torsten Suel. In Proceedings of IEEE International Conference on Data Engineering (ICDE), 2002

    Ubicrawler: A scalable fully distributed web crawler. Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Software: Practice & Experience, 34(8):711-726, 2004.

    Mercator: A Scalable, Extensible Web Crawler. Allan Heydon and Marc Najork. World Wide Web 2(4):219-229, 1999.

  • September 28. Indexing (slides).

    Reading: Building a Distributed Full-Text Index for the Web. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Stanford Technical Report 2000-55. Short version of paper appeared in the proceedings of the 10th International WWW conference, 2001. Section 3.1.1 can be read lightly, and Section 4.2 can be skipped.

    Pages 109-122, 145-151, 156-161, 169-170 and 175-179 of: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Morgan Kaufmann Publishing, San Francisco, 1999. ISBN 1-55860-570-3.

  • October 4. More on indexing.

    Reading: See above.

  • October 11. Word-based ranking: the vector space model.

    Reading: Pages 31-41 in Understanding Search Engines: Mathematical Modeling and Text Retrieval. M.W. Berry and M. Browne. SIAM Book Series: Software, Environments, and Tools, (June 1999), ISBN: 0-89871-437-0.

    Pages 24-30 in Modern Information Retrieval. Ricardo Baeza-Yates, Berthier Ribiero-Neto. Addison Wesley Higher Education, 1999. ISBN: 020139829X.

  • October 12. More on the vector space model.

    Reading: As above.

  • No lectures in the fall break in Week 42 (October 15 to 19).

  • No lectures in the exam period for quarter-courses (weeks 43 and 44, October 22 to November 2), since several participants have exams.

  • November 5. Lecture this Monday. Algorithms for Pagerank.

    Reading: Efficient Computation of PageRank. Taher H. Haveliwala. Stanford Technical Report 1999-31.

  • November 8. Lecture this Thursday. More on algorithms for PageRank.

    Reading: I/O-Efficient Techniques for Computing Pagerank. Yen-Yu Chen, Qingqing Gan, and Torsten Suel. ACM Conference on Information and Knowledge Engineering (CIKM), 2002.

    Sections 3.1-2 of Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model. Michael A. Bender, Gerth Stølting Brodal, Rolf Fagerberg, Riko Jacob, and Elias Vicari. In Proc. 19th ACM Symposium on Parallelism in Algorithms and Architectures, pages 61-70, 2007.

  • November 12. Convergence and other issues of PageRank.

    Reading: Deeper Inside PageRank. Amy N. Langville and Carl D. Meyer. Internet Mathematics Vol. 1, No. 3: 335-380, 2005.

    The first theorem stated on the Wikipedia page on the Perron-Frobenius theorem.

  • November 16. Other link based ranking schemes: HITS and SALSA.

    Reading: A Survey of Eigenvector Methods for Web Information Retrieval. Amy N. Langville and Carl D. Meyer. SIAM Review, Vol. 47, No. 1, pp. 135-161, 2005.

  • November 19.

    Advanced lexicon structures: tries, compressed tries, suffix tries and generalized suffix tries. Prefix search, edit distance based search and substring search.

    Reading: Pages 566-576 in Data Structures and Algorithms in Java. Michael T. Goodrich and Roberto Tamassia. Wiley, 1998. ISBN: 0-471-19308-9.

    Pages 116-119 in Algorithms on Strings, Trees, and Sequences. Dan Gusfield. Cambridge University Press, 1997. ISBN: 0521585198.

  • November 26. String sorting and searching by Ternary Search Trees. Suffix arrays, searching in suffix arrays.

    Reading: Fast Algorithms for Sorting and Searching Strings. Jon Bentley and Robert Sedgewick. Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, January, 1997. Pages 360-369. (This paper also describes the idea behind the edit distance based search method in the last lecture, although in the slightly simpler version of matching with dont'-care characters).

    Pages 149-155 in Algorithms on Strings, Trees, and Sequences. Dan Gusfield. Cambridge University Press, 1997. ISBN: 0521585198.

  • December 3. The lcp array and its use for efficient searching in suffix arrays. From suffix arrays and lcp arrays to suffix trees and back. Linear time construction of suffix arrays.

    Reading: Pages 149-155 in Algorithms on Strings, Trees, and Sequences. Dan Gusfield. Cambridge University Press, 1997. ISBN: 0521585198.

    Sections 2 and 3 in Linear work suffix array construction. Juha Kärkkäinen and Peter Sanders and Stefan Burkhardt. J. ACM, 53(6), 918-936, 2006. Conference version in Proc. of ICALP 2003.

  • December 6. End of linear time construction of suffix arrays. Linear time construction of lcp array from suffix array.

    Reading: Section 3 in Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications.. Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, Kunsoo Park. Proceedings of Combinatorial Pattern Matching (CPM), Springer, 2001, pages 181-192. Link to text is free from an Imada IP. The text contains a few confusing errors.

  • December 7. The lectures were cancelled due to lack of students.

  • December 10. Empirical investigations of the structure of the web graph. Models for the web graph.

    Reading: Graph structure in the web . Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. In proceedings of Ninth International World Wide Web Conference (WWW9), 2000.

    Non-proof parts of Stochastic models for the Web graph . R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Proceedings of the 41th IEEE Symp. on Foundations of Computer Science. November 2000, pp. 57-65.

  • December 17. Documents similarity.

    Reading: On the resemblance and containment of documents. Andrei Broder. In Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE Computer Society, 1998. (Section 4 can be read lightly. We did only discuss the resemblance, not containment, and only the MIN-based measure.)

  • December 21. Algorithms for intersection of inverted file lists.

    Reading: Intersection in Integer Inverted Indices. Peter Sanders and Frederik Transier. In Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX'07) 2007.


Maintained by Rolf Fagerberg (rolf@imada.sdu.dk)