Tuesday, February 20, 2007

Monday 2/19/07

Distributed annotated copies of the onjava article dated 1/15/03, and the today.java dated 7/30/03.

Most people seem to have finished homework 1, and we discussed that a little. Getting lucene to recompile was the hardest part, at least for me.

In response to questions, I talked about phrase-based retrieval and n-gram retrieval (both character and word n-grams) as alternatives to the bag of words model. Note that words, phrases, and n-grams have their pros and cons - all three are just the way you decide what terms are to be indexed. Once the "term space" is identified, the vector space, probabilistic, or boolean models of retrieval are options.

Unix tools can be used to do "sanity checks" on IR results.

No comments: