|
CS 430 / INFO 430
Information Retrieval
Fall 2007
Books and Readings
|
|
Course Book Many of the lectures are linked to a new book, which is due to be published shortly: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, An introduction to information retrieval. Cambridge University Press, 2008. A preliminary version of this book is online at http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html. General Books The following books cover much of the material for this course.
Discussion Classes Readings for discussion classes are to be studied in preparation for the classes on Wednesday evenings. In preparation for this class, explore three information retrieval systems and compare them:
Use each service separately for the following information discovery task:
Evaluate each search service. What do you consider the strengths and weaknesses of each service? When would you use them? (a) Does the service search full text or surrogates? What is the underlying corpus? What affect does this have on your results? (b) Is fielded searching offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order? (c) From a usability viewpoint. What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer? Discussion Class 2, September 5 Read and be prepared to discuss: G. Salton, A. Wong and C. S. Yang, "A vector space model for automatic indexing". Communications of the ACM Volume 18 , Issue 11 (November 1975) pages: 613 - 620. http://doi.acm.org/10.1145/361219.361220 This paper describes many of the concepts behind the vector space model and the SMART system. {Note that to access this paper from the ACM Digital Library, you need to use a computer with a Cornell IP address.} Discussion Class 3, September 12 For this class, read and be prepared to discuss the following: K. Sparck Jones, "A statistical interpretation of term specificity and its application in retrieval". Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf. Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/~ser/idfpapers/letters.pdf. The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. {Stephen Robertson has mounted these papers on his Web site with permission from the publisher.} Discussion Class 4, September 19 Read and be prepared to discuss the following paper: Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, "Indexing by latent semantic analysis". Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. http://www3.interscience.wiley.com/cgi-bin/issuetoc?ID=10049584 {Note that to access this paper from Wiley InterScience, you need to use a computer with a Cornell IP address.} Read and be prepared to discuss the following paper, concentrating on Sections 1 to 4, and 5.3. You do not need to study the details of the methods described in Sections 5.1 and 5.2. Section 6 is for general interest only. E. Voorhees, D. Harman, Overview of the Eighth Text REtrieval Conference (TREC-8). http://trec.nist.gov/pubs/trec8/papers/overview_8.ps. This is one of a sequence of publications. The full sequence of TREC publications is at http://trec.nist.gov/pubs.html. {Note that this paper is in PostScript format. You can view it using the GhostView viewer, which is available on the Web for downloading for all standard computer systems. The PDF version of the file on the TREC Web site is damaged. Here is a PDF version that was generated from the PostScript file.} Discussion Class 6, October 10 Read and be prepared to discuss Sections 1 and 2 (up to page 16): Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, "Searching the Web". ACM Transactions on Internet Technology, This paper describes the state of the art of research into Web searching in 2001. {Note that to access this paper from the ACM Digital Library, you need to use a computer with a Cornell IP address.} Discussion Class 7, October 17 Read and be prepared to discuss: Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm Note. A second copy of this paper is available at http://www-db.stanford.edu/~backrub/google.html. {Sections 4.1 and 4.2 are out of date. Browse through them for historic interest only.} Discussion Class 8, October 24 Read and be prepared to discuss: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI '04, 2004. http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf Discussion Class 9, October 31 Read and be prepared to discuss: T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, "Accurately Interpreting Clickthrough Data as Implicit Feedback", Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005. http://www.cs.cornell.edu/People/tj/publications/joachims_etal_05a.pdf Discussion Class 10, November 7 [Note. The list of sections for the discussion class has been changed.] Read and be prepared to discuss: Marti Hearst, User Interfaces and Visualization. In: Modern
Information Retrieval, edited by Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, Addison-Wesley Longman, 1999. This is a long chapter. You are encouraged to read the full chapter, but the discussion class will concentrate on Sections 10.1, 10.2, 10.3, 10.5, 10.7, and 10.9. Discussion Class 11, November 14 Read and be prepared to discuss: Howard Wactlar, Informedia - Search and Summarization in the Video Medium. Proceedings of Imagina 2000 Conference, Monaco, January 31 - February 2, 2000. http://www.informedia.cs.cmu.edu/documents/imagina2000.pdf. Discussion Class 12, November 28 Read and be prepared to discuss: Caroline R. Arms and William Y. Arms, "Mixed Content and Mixed Metadata: Information Discovery in a Messy World." In Metadata in Practice, edited by Diane Hillmann and Elaine Westbrooks, ALA Editions in 2004. http://www.cs.cornell.edu/wya/papers/ALA-2003.php. |
[ Home | Syllabus | Readings | Assignments | Examinations | Academic Integrity ]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 5, 2007