Monday, April 9, 2007

Wednesday, April4, 2007

On Monday 4/2 we began the clustering tutorial found at http://www.cs.umbc.edu/~nicholas/clustering

Finished that on Wednesday April 4

Two search engines that use clustering include vivisimo and iboogie. Both still seem to be around.

Also talked about the programming project. Basically, the project is to use the ngram package (located in lucene's contrib directory) with the Reuters corpus.

Since the Reuters corpus uses SGML markup and several documents in a single file, parsing the documents is non-trivial. Originally I had said to index only the text inside paragraphs, but that may be too awkward.

So, what would it take to index the whole Reuters corpus, using ngrams, (with n=5)? One approach would be to use a perl script to make a file for each document, which would involve several thousand files. With n-grams, common ngrams will occur in each document with roughly the same frequency, so the markup shouldn't make much difference.

So that's the project: use the ngrams package to index the Reuters corpus, and use some sample queries, maybe titles with empty documents, to show that it works. Queries may have SGML markup.

I had also described Homework 3: Using a script or lucene or a program of your choice, tell me the ten most common 5-grams in the Reuters corpus.

No comments: