Wednesday, April 18, 2007

schedule of talks

5/2
Ron Roff
Joel G.
Mike Wilson
JC Montminy

5/7
Chris
Mansi Radke
Beenish
Luke

5/9
Stephen
Ginny
Jason
Sayeed


5/14
Justin
Mike
Marcin
Sandor
Aparna

Tuesday, April 17, 2007

class Monday evening 4/17

So, you may have noticed that we had no class yesterday evening. The Catonsville area and UMBC in particular suffered a power outage that closed us down from noon until 6pm. The department mail servers were running, but I had no machine that had power, so there was no way for me to notify you.

In hindsight, I should have put a notice on the door, but frankly that slipped my mind.

Anyway, I apologize to those who made the trip to campus for nothing.

I plan to cover cross-language IR on Wednesday. I'll be posting a paper or two shortly.

Wednesday, April 11, 2007

format for writing project presentations

max. 15 minutes - I suggest you rehearse

2-3 minutes/slide

title and your name
executive summary <= 50 words
what piqued your interest in this topic?

present an example (multiple slides, but make it snappy)
OR
explain what you learned

what assumptions are made? what are the advantages and disadvantages?

future work - questions that still need to be answered
conclusions

peer-review is fine! don't be mean to each other

Monday, April 9, 2007

Monday April 9

The material on passage-based retrieval won't take too long to present. We'll probably discuss the programming project a little. With the Writing Project due soon, i.e. Wednesday of next week April 18? Homework 3 will be due the following Monday April 23.

Wednesday, April4, 2007

On Monday 4/2 we began the clustering tutorial found at http://www.cs.umbc.edu/~nicholas/clustering

Finished that on Wednesday April 4

Two search engines that use clustering include vivisimo and iboogie. Both still seem to be around.

Also talked about the programming project. Basically, the project is to use the ngram package (located in lucene's contrib directory) with the Reuters corpus.

Since the Reuters corpus uses SGML markup and several documents in a single file, parsing the documents is non-trivial. Originally I had said to index only the text inside paragraphs, but that may be too awkward.

So, what would it take to index the whole Reuters corpus, using ngrams, (with n=5)? One approach would be to use a perl script to make a file for each document, which would involve several thousand files. With n-grams, common ngrams will occur in each document with roughly the same frequency, so the markup shouldn't make much difference.

So that's the project: use the ngrams package to index the Reuters corpus, and use some sample queries, maybe titles with empty documents, to show that it works. Queries may have SGML markup.

I had also described Homework 3: Using a script or lucene or a program of your choice, tell me the ten most common 5-grams in the Reuters corpus.

Monday, April 2, 2007

Wednesday 3/28
We went over the McNamee and Mayfield paper, in some detail, and then we finished Salton and Buckley's relevance feedback paper. Shannon's 1948 paper is available on the web, and is still worth reading.

The use of LSA in cross-lanaguage IR is described in several places, e.g.
http://lsi.research.telcordia.com/lsi/papers/XLANG96.pdf