Wednesday, February 28, 2007

Homework 2

This homework is due Monday, March 12

1) find the module or modules in Lucene that handle stopwords. Is there a static stopword list? If so, where is it?

2) how would Lucene be modified in order to count the occurrences of individual stopwords?

3) Make the necessary changes, and rebuild the index on the Lucene src tree as in homework 1, and have it print a report saying how many times each stopword occurred (tf) and the total number of documents in which that stopword occurred (df). Then using (one of) Salton and Buckley's suggested tf.idf formulae for documents, print the term weight that should be given to each stopword.

1 comment:

Luke said...

Not to make more work for everyone, but I think it would be an interesting exercise to run the same experiment on the Reuters corpus to see what different results we'd get for the stopword term weights.