Sunday, February 11, 2007

Class Wed. 2/7

I spent a lot of time finishing up the slides from Chapter 1, but also touched on some ideas that get discussed in detail in Chapter 2 and beyond.

The definitions of Precision and Recall are important. It is usually easy to measure precision, as long as you can tell when a specific document is relevant to a specific query. Recall is much harder to calculate, since the number of relevant documents in a large collection may well be unknown. I also talked about the pooled approach to IR evaluation. Without the pooled approach, evaulation of recall would be very difficult if not impossible on large collections. An overview of TREC, including a discussion of pooled document evaluation, is presented in Donna Harman's overview of TREC 4, available at http://trec.nist.gov/pubs/trec4/t4_proceedings.html.

In another post I mention pivoted document length normalization, which is credited (in my mind at least) to Amit Singhal, then at Cornell and now at Google. PDLN is covered in Chapter 2. Somebody asked a question on Wednesday that brought query zoning to mind, and that is also credited to Singhal. His paper from SIGIR'97 is also worth reading, I think. This and related work can be seen at http://singhal.info/publications.html

No comments: