Monday, March 26, 2007

plans for Monday 3/26 and Wednesday 3/28

Tonight I'll be talking about relevance feedback. Let me know if you think this is a good topic or not :-)

Do people want to learn more about n-grams? An article on generalized n-grams just appeared in Information Processing and Management.

In my opinion, IP&M is one of the very best IR journals. You can access it online at
http://www.sciencedirect.com/science/journal/03064573

Wednesday is a little open - more on RF, more on n-grams, or maybe an introduction to clustering.

Information Retrieval: Data Structures & Algorithms


edited by William B. Frakes and Ricardo Baeza-Yates



http://www.pimpumpam.com/motoridiricerca/ir/toc.htm

Thursday, March 15, 2007

more on the writing project

You may implement something, but that's not necessary. You'll probably need to read up on your topic, focus QUICKLY on some particular subtopic, and explain it to me in ten pages... If you write something explaining that topic to me, maybe with a simple example, that would be fine.

A student wrote:

> Hello. I just had a question about the writing project.
>
> What exactly are you expecting for this project? I'm not entirely certain
> exactly what I'm supposed to write -- is this a research style project
> where I'm supposed to implement something and investigate a new topic, or
> am I going to go over a bunch of subtopics in the topic I have provided
> and inform you about them? I thought I had a better idea over what
> exactly was necessary, but I find myself a little confused about it right
> now.

Wednesday, March 14, 2007

Notes from March 12 and March 14

Introduced LSA on Monday.

Finished LSA on Wednesday, and talked about n-grams. I probably should have passed out Damashek 95 beforehand. I was asked what character set was used in the acquaintance plots, and I thought it was all unicode but I don't know.

Friday, March 9, 2007

clarification for hw 2

The idea of hw 2 is to give experience with computing tf.idf weights, and to see how stop words are treated. One approach is to compute the tf.idf score for each of the 35 stopwords, on a per document basis. The output would be a 330 by 35 matrix, and most of the scores should be positive but close to zero. Reading the output may be a little cumbersome, but this would be fine.

Another approach is to keep track of the min and max values of tf, for each of the 35 stopwords, as the documents are parsed and the index built. Then calculate idf for each stopword, and print for each stop word the min tf.idf and max tf.idf.