Wednesday, February 28, 2007

Homework 2

This homework is due Monday, March 12

1) find the module or modules in Lucene that handle stopwords. Is there a static stopword list? If so, where is it?

2) how would Lucene be modified in order to count the occurrences of individual stopwords?

3) Make the necessary changes, and rebuild the index on the Lucene src tree as in homework 1, and have it print a report saying how many times each stopword occurred (tf) and the total number of documents in which that stopword occurred (df). Then using (one of) Salton and Buckley's suggested tf.idf formulae for documents, print the term weight that should be given to each stopword.

Notes from 2/26, plans for 2/28

On Monday I talked about some more writing project topics. I started talking about probabilistic IR, using the slides posted.

I'll do more with probabilistic IR this evening.

Monday, February 26, 2007

Notes from 2/21, plans for 2/26

Spent a LOT of time last Wednesday talking about writing project topics. Here are some more:

  • There are other packages besides Lucene, e.g. Lemur, and Clairlib, and maybe others. A comparison of those packages would be a good topic.
  • The connection between IR and other areas, such as machine learning or NLP, can be explored.
For Monday evening 2/26, I'll talk about the Salton and Buckley paper, and introduce the concept of probabilistic IR.

Tuesday, February 20, 2007

writing project

My usual procedure is to wait until later in the semester, and assign a writing project that is due at the end of the semester. People rarely complain, but who needs more stress at the end of the semester anyway?

So let's get an early start. Within ten days, say by Monday March 5, I'd like you to tell me, in a short email, the topic of your paper. It has to have something to do with IR, and NOT something that we'll be going over in class in detail, although going in depth in some topic that we mention in class is fine. Describe your topic in a paragraph, and list at least three references that you're thinking of consulting.

The final paper should be about ten pages, with at least ten references. Don't let all the references be from Wikipedia. The paper will be due on Wednesday, April 18.

There are lots of possible topics! We can start with the
SIGIR Call for Papers

and move on to the
CIKM Call for Papers

I'm not expecting original research results of conference quality (although that'd be nice) but you'll need to do something more than just a rehash of existing work. It's always a good idea to summarize work in an area, and then suggest future work that somebody could do for a 698 project, or a thesis. Another approach is to study some technique, and then present a new example that would help people understand it. If you want to write a program (e.g. an extension or modification to Lucene) you can include that as an appendix, and it won't count towards the ten-page limit.

It doesn't bother me if your writing project happens to be related to your job, or dovetails with something you're doing in another class.

Monday 2/19/07

Distributed annotated copies of the onjava article dated 1/15/03, and the today.java dated 7/30/03.

Most people seem to have finished homework 1, and we discussed that a little. Getting lucene to recompile was the hardest part, at least for me.

In response to questions, I talked about phrase-based retrieval and n-gram retrieval (both character and word n-grams) as alternatives to the bag of words model. Note that words, phrases, and n-grams have their pros and cons - all three are just the way you decide what terms are to be indexed. Once the "term space" is identified, the vector space, probabilistic, or boolean models of retrieval are options.

Unix tools can be used to do "sanity checks" on IR results.

Friday, February 16, 2007

Homework 1 now due Wednesday 2/21

No class on Wednesday February 14

Notes from Monday 2/12. Went over the slides on the vector space model, including a short example with tf.idf and calculation of a similarity coefficient. Did NOT cover the problem of document (or query) length normalization, nor the boolean model.

Monday, February 12, 2007

gzipped tarfile

http://www.cs.umbc.edu/~nicholas/676/lucene-2.0.0.tar.gz

Lucene demo

I unpacked the Lucene package , and everything is available under the course web site. Under docs, look at "Getting Started", and the basic Lucene demo.

I have extracted Lucene, and compiled it. The gzipped tarfile is called lucene-2.0.0.tar.gz, and I'll put a link on the course web site in case you'd like (and have room for) your own copy.

Building Lucene required me to install apache ant, which I hope you won't need. The hardest part was modifying my CLASSPATH, as shown below:

# to build and use lucene
setenv ANT_HOME $HOME/www/676/apache-ant-1.7.0
setenv LUCHOME $HOME/www/676/lucene-2.0.0
setenv CLASSPATH .\:$LUCHOME/lucene-demos-2.0.0.jar\:$LUCHOME/lucene-core-2.0.0.jar


A tour of the Lucene directory might be a good idea. Building an index of the Lucene source files is gratifying!

Homework 1: Repeat this demo, as far as creating an index of the Lucene source. Then figure out how one would modify the IndexFiles program so that it counts the number of files indexed, and the number of bytes read. If you know Java well enough and have the disk space to recompile it, you may demonstrate that your modifications work. You should all get the same answers! We'll talk about this in class on Wednesday a little, and this homework will be due next Monday.

Sunday, February 11, 2007

Class Wed. 2/7

I spent a lot of time finishing up the slides from Chapter 1, but also touched on some ideas that get discussed in detail in Chapter 2 and beyond.

The definitions of Precision and Recall are important. It is usually easy to measure precision, as long as you can tell when a specific document is relevant to a specific query. Recall is much harder to calculate, since the number of relevant documents in a large collection may well be unknown. I also talked about the pooled approach to IR evaluation. Without the pooled approach, evaulation of recall would be very difficult if not impossible on large collections. An overview of TREC, including a discussion of pooled document evaluation, is presented in Donna Harman's overview of TREC 4, available at http://trec.nist.gov/pubs/trec4/t4_proceedings.html.

In another post I mention pivoted document length normalization, which is credited (in my mind at least) to Amit Singhal, then at Cornell and now at Google. PDLN is covered in Chapter 2. Somebody asked a question on Wednesday that brought query zoning to mind, and that is also credited to Singhal. His paper from SIGIR'97 is also worth reading, I think. This and related work can be seen at http://singhal.info/publications.html

two important papers

In the discussion of the vector space model, Grossman and Frieder mention Salton's November 1975 CACM paper. I recommend that you read it. It's available through the ACM Digital Library, and through Google Scholar.

They also mention Pivoted Document Length Normalization, which appeared in the 1996 SIGIR conference. The main author is Amit Singhal. That paper is likely still the best explanation of PDLN, which within a year or two of its introduction was widely accepted in IR.

Monday, February 5, 2007

David D Lewis's web page

Reuters corpus

IR packages

So I mentioned the Lucene project last time.

Andrew McCallum (UMass) still makes the
Bag of Words software available.

What other packages are available to those who want to build their own IR systems? They may or may not support multiple languages or multiple retrieval models. They may be general purpose, or perhaps restricted to Web search for example.

The Lemur package from CMU

Two packages from NIST
PRISE
NIRVE

References to these and a couple more can be found at the
SIGIR site. There's some good stuff out here!

MG is a little older, and is discussed in detail in the book Managing Gigabytes. Zettair is related to MG, I think.