DM79 Algorithms for Web Indexing and Searching
Fall 2004
Rolf Fagerberg
Official Course Description
See the
course
description at the web pages of the faculty.
Time and Place
The course runs from September 2 to December 16, except for the fall
break in week 42 (October 11-15).
The schedule is as follows:
- Tuesdays (some) 14.15 to 16.00. We will use
approximately every second Tuesday. These will be announced one week
ahead.
- Thursdays 14.15 to 16.00.
All lessons take place in Imadas seminar room (across
room U49A).
Most lessons will be lectures, but a few may be used for exercises.
Exam
The exam is a 30 minutes oral exam (no preparation time) and will take
place on Thursday, January 27. A list of questions (ps, pdf), from which you will draw one
at the exam, is now available. Curriculum (ps,
pdf) is also available.
There is a "spørgetime" (session allowing you to ask questions to
the lecturer on the curriculum and the exam) on January 25 at 10:15 in
room U49D.
The grades at the exam ended up as follows:
ps,
pdf.
Lectures
- September 2. Introduction to course (slides:
ps.gz,
pdf). Some Google facts (slides:
ps.gz,
pdf). Overview of search engines
(slides: ps.gz,
pdf).
Reading: The
Anatomy of a Large-Scale Hypertextual Web Search Engine.
Sergey Brin and Lawrence Page. Proceedings of the 7th International
WWW conference, 1998.
- September 7. Discussion of project (slides:
ps.gz,
pdf). Further discussion of PageRank.
We also decided to do double lectures (14.15 to 16.00) on Tuesdays,
approximately every second Tuesday (instead of one lecture every
Tuesday). Tuesday lectures will be announce the week before.
Reading: The slides.
- September 9. Dealing with data on disk: the I/O-model
(slides: ps.gz,
pdf), external sorting (slides:
ps.gz,
pdf). Start on crawling.
Reading: The slides (Note: the lower bound for external sorting is
just optional reading and is NOT part of the curriculum. The same goes
for the paper mentioned at the start of the slides (which was not
handed out)). High-Performance
Web Crawling. Marc Najork and Allan Heydon. Compaq SRC
Research Report 173, 2001. Also appeared in Handbook of Massive
Data Sets, Kluwer, 2001.
We also settled the project groups, and
discussed the process a bit more: An obvious placement of global team
discussions (i.e. all sub-groups of a team) is the Tuesdays where we
are not having lectures.
- September 14. Crawling (slides:
ps.gz,
pdf).
Reading: Breadth-First
Search Crawling Yields High-Quality Pages. Marc Najork and
Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web
Conference, pages 114-118, May 2001.
- September 16. More on crawling. Start on indexing
(slides: ps.gz,
pdf).
Reading:
Pages 239-241 in HTTP: The Definitive
Guide. David Gourley, Brian Totty. O'Reilly, 2002. ISBN:
1-56592-509-2.
Breadth-First
Search Crawling Yields High-Quality Pages. Marc Najork and
Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web
Conference, pages 114-118, May, 2001.
- September 21. The will be no lectures this
Tuesday. Instead, global team meetings should be arranged (possible at
another day/time if this suits the team better).
- September 23 More on indexing.
Reading:
All except Sections 4.2-3 of: Building a
Distributed Full-Text Index for the Web. Sergey Melnik,
Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Stanford
Technical Report 2000-55. Short version of paper appeared in the
proceedings of the 10th International WWW conference, May 2-5, 2001,
Hong Kong.
Pages 109-122, 145-151, 156-161, 169-170 and 175-179 of: Managing Gigabytes: Compressing and
Indexing Documents and Images, 2nd edition. Ian H. Witten,
Alistair Moffat, and Timothy C. Bell. Morgan Kaufmann Publishing, San
Francisco, 1999. ISBN 1-55860-570-3.
- September 28 More on indexing: index
compression. PageRank computation.
Reading: Efficient
Computation of PageRank. Taher H. Haveliwala. Stanford
Technical Report 1999-31.
- September 30 More on computing PageRank. Start on
querying: The vector space model.
Reading: Pages 31-41 in Understanding
Search Engines: Mathematical Modeling and Text Retrieval.
M.W. Berry and M. Browne. SIAM Book Series: Software, Environments,
and Tools, (June 1999), ISBN: 0-89871-437-0.
- October 5 The will be no lectures this Tuesday.
- October 7 More on the vector space model. More bits and
pieces on querying. Retrieval performance evaluation.
Reading: Pages 24-30, 74-81, 99-106, 118-119, and 207-208 in
Modern Information
Retrieval. Ricardo Baeza-Yates, Berthier
Ribiero-Neto. Addison Wesley Higher Education, 1999. ISBN: 020139829X.
- No lectures during the fall break in week 42 (Oct. 11 to 15).
- October 19. There will be no lectures this Tuesday (as
the seminar room is used for a computer science colloquium).
- October 21. Tries. Sorting strings. Ternary search
trees. It was agreed that the dates for remaining hand-ins of meeting
minutes from the project work are: October 28, November 25, and
December 20 (all of which are Thursdays).
Reading:
Pages 566-576 in Data
Structures and Algorithms in Java. Michael T. Goodrich and
Roberto Tamassia. Wiley, 1998. ISBN: 0-471-19308-9.
Pages 116-119 in Algorithms
on Strings, Trees, and Sequences. Dan Gusfield. Cambridge
University Press, 1997. ISBN: 0521585198.
Fast
Algorithms for Sorting and Searching Strings. Jon Bentley
and Robert Sedgewick. Proceedings of the eighth annual ACM-SIAM
symposium on Discrete algorithms. New Orleans, January, 1997. Pages
360-369.
- October 26. Lectures this day was cancelled (in last
moment) because a kandidateksamen had been scheduled in the room
(actions are now being taken to better prevent double-booking for the
rest of the semester).
- October 28. Suffix tree construction. We will also gather
information in order to settle exam dates for DM79 - bring you
existing exam dates for other courses.
Reading: Pages 94-107 in Algorithms
on Strings, Trees, and Sequences. Dan Gusfield. Cambridge
University Press, 1997. ISBN: 0521585198.
- November 2. No lectures this Tuesday.
- November 4. End of proof of Ukkonens suffix tree
construction algorithm. Second lesson was a
guest lecture
by Lawrence Lessig on copyright issues relating to the Internet.
- November 9. There will be no lectures this Tuesday, as
the seminar room is used for a computer science colloquium.
- November 11. Suffix arrays. Latent Semantic Indexing.
Reading:
Pages 149-155 in Algorithms
on Strings, Trees, and Sequences. Dan Gusfield. Cambridge
University Press, 1997. ISBN: 0521585198.
Pages 33, 41-45, and 53-58 in Understanding
Search Engines: Mathematical Modeling and Text Retrieval.
M.W. Berry and M. Browne. SIAM Book Series: Software, Environments,
and Tools, (June 1999), ISBN: 0-89871-437-0.
- November 16. More on latent semantic indexing.
Reading: Sections 1, 2, 4, and 5 in Spectral Analysis of
Data. Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry,
and Jared Saia. Proceedings of the 33rd Annual ACM Symposium on Theory
of Computing (STOC), pages 619--626, 2001.
- November 18. Other types of link based ranking: HITS and
(briefly) SALSA.
Reading: A
Survey of Eigenvector Methods of Web Information
Retrieval. Amy N. Langville and Carl D. Meyer, 2004. To
appear in SIAM Review.
- November 23. No lectures this Tuesday.
- November 25. Structure of the Web Graph. The date for the
exam was moved to either January 26 or January 27 (depending on the
forthcoming teaching schedule of censor, which will be settled during
December).
Reading:
Graph structure in the web . Andrei Broder, Ravi Kumar,
Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata,
Andrew Tomkins, and Janet Wiener. In proceedings of Ninth
International World Wide Web Conference (WWW9), 2000.
- November 30. Details of SALSA.
Reading: A
Survey of Eigenvector Methods of Web Information
Retrieval. Amy N. Langville and Carl D. Meyer, 2004. To
appear in SIAM Review.
- December 2. Models for the web graph.
Reading: Non-proof parts of Stochastic
models for the Web graph . R. Kumar, P. Raghavan,
S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Proceedings of
the 41th IEEE Symp. on Foundations of Computer Science. November 2000,
pp. 57-65.
- December 7. No lectures this Tuesday.
Note that you need to sign up for next semesters courses at the
administration web pages. If you are
taking elective courses at Imada next semester, you should also give
information on your selected courses
to Imada if you wish to
influence the schedule of the elective courses. Deadline for both is
Friday, December 10. As a new thing, signing up for a course also
means signing up for its exam (i.e. there will not be any signing up
for exams in March, and you will loose an exam attempt if you sign up
for a course but do not show up for the exam, unless you in the mean
time sign off for the exam).
- December 9. Bipartite cliques in the web graph. Document
similarity.
Reading:
Trawling the Web for emerging cyber-communities. Ravi Kumar,
Prabhakar Raghavan, Rajagopalan Rajagopalan, and Andrew
Tomkins. Computer Networks, 31 (11-16), pages 1481-1493, 1999. Also in
Proceedings of the 8th WWW conference.
On
the resemblance and containment of documents. Andrei
Broder. In Compression and Complexity of Sequences (SEQUENCES'97),
pages 21-29. IEEE Computer Society, 1998. (Section 4 can be read
lightly).
- December 14 Document similarity. Google cluster
architecture. The exam date has now been finally settled to Thursday,
January 27. The deadline for the project has been moved to Tuesday,
January 11, 11:00, where a small inauguration reception for the
crawler projects will take place (room to be announced later).
Reading: Web Search for a
Planet: The Google Cluster Architecture. Luiz André Barroso,
Jeffrey Dean, and Urs Hölzle. IEEE Micro, 23(2), pp. 22-28, 2003.
- December 16. Game theory and the Internet. There will be
a "spørgetime" (session allowing you to ask questions to the lecturer
on the curriculum and the exam) on January 25 at 10:15 in room U49D.
Reading: Algorithms,
games, and the Internet. Christos Papadimitriou. Proceedings
of the 33rd STOC, pp. 749-753, ACM Press, 2001.
Maintained by Rolf Fagerberg
(rolf@imada.sdu.dk)
|
|