Wednesday, May 9, 2007

programming project

When you turn in your project, I'd like to see the code you wrote or modified; figures for how large the indices are for words and n-grams, for example details from a directory listing; and a couple of sample queries and the results generated.

April 23, 25 and 30

I note that my blogging has taken a break.

Web search could probably be a course on its own, covering search as well as web services, RSS, and maybe semantic web stuff. In an IR course, web search may be good as a running example, but it can't take over the course.

A newer textbook can cover new topics, and that's important in an IR course. Is it my imagination, though, or is it true that some older textbooks are better than some new ones, even if some material is dated?

Callan's paper is fine, but I feel the need to add a survey of peer-to-peer IR.

May 2, 7 and 9

So far, the student talks have gone well. People have been staying in the time limits very well, without too much prodding from me, and that's good. I'm learning from listening to the talks, and that's good too!

Wednesday, April 18, 2007

schedule of talks

5/2
Ron Roff
Joel G.
Mike Wilson
JC Montminy

5/7
Chris
Mansi Radke
Beenish
Luke

5/9
Stephen
Ginny
Jason
Sayeed


5/14
Justin
Mike
Marcin
Sandor
Aparna

Tuesday, April 17, 2007

class Monday evening 4/17

So, you may have noticed that we had no class yesterday evening. The Catonsville area and UMBC in particular suffered a power outage that closed us down from noon until 6pm. The department mail servers were running, but I had no machine that had power, so there was no way for me to notify you.

In hindsight, I should have put a notice on the door, but frankly that slipped my mind.

Anyway, I apologize to those who made the trip to campus for nothing.

I plan to cover cross-language IR on Wednesday. I'll be posting a paper or two shortly.

Wednesday, April 11, 2007

format for writing project presentations

max. 15 minutes - I suggest you rehearse

2-3 minutes/slide

title and your name
executive summary <= 50 words
what piqued your interest in this topic?

present an example (multiple slides, but make it snappy)
OR
explain what you learned

what assumptions are made? what are the advantages and disadvantages?

future work - questions that still need to be answered
conclusions

peer-review is fine! don't be mean to each other

Monday, April 9, 2007

Monday April 9

The material on passage-based retrieval won't take too long to present. We'll probably discuss the programming project a little. With the Writing Project due soon, i.e. Wednesday of next week April 18? Homework 3 will be due the following Monday April 23.

Wednesday, April4, 2007

On Monday 4/2 we began the clustering tutorial found at http://www.cs.umbc.edu/~nicholas/clustering

Finished that on Wednesday April 4

Two search engines that use clustering include vivisimo and iboogie. Both still seem to be around.

Also talked about the programming project. Basically, the project is to use the ngram package (located in lucene's contrib directory) with the Reuters corpus.

Since the Reuters corpus uses SGML markup and several documents in a single file, parsing the documents is non-trivial. Originally I had said to index only the text inside paragraphs, but that may be too awkward.

So, what would it take to index the whole Reuters corpus, using ngrams, (with n=5)? One approach would be to use a perl script to make a file for each document, which would involve several thousand files. With n-grams, common ngrams will occur in each document with roughly the same frequency, so the markup shouldn't make much difference.

So that's the project: use the ngrams package to index the Reuters corpus, and use some sample queries, maybe titles with empty documents, to show that it works. Queries may have SGML markup.

I had also described Homework 3: Using a script or lucene or a program of your choice, tell me the ten most common 5-grams in the Reuters corpus.

Monday, April 2, 2007

Wednesday 3/28
We went over the McNamee and Mayfield paper, in some detail, and then we finished Salton and Buckley's relevance feedback paper. Shannon's 1948 paper is available on the web, and is still worth reading.

The use of LSA in cross-lanaguage IR is described in several places, e.g.
http://lsi.research.telcordia.com/lsi/papers/XLANG96.pdf

Monday, March 26, 2007

plans for Monday 3/26 and Wednesday 3/28

Tonight I'll be talking about relevance feedback. Let me know if you think this is a good topic or not :-)

Do people want to learn more about n-grams? An article on generalized n-grams just appeared in Information Processing and Management.

In my opinion, IP&M is one of the very best IR journals. You can access it online at
http://www.sciencedirect.com/science/journal/03064573

Wednesday is a little open - more on RF, more on n-grams, or maybe an introduction to clustering.

Information Retrieval: Data Structures & Algorithms


edited by William B. Frakes and Ricardo Baeza-Yates



http://www.pimpumpam.com/motoridiricerca/ir/toc.htm

Thursday, March 15, 2007

more on the writing project

You may implement something, but that's not necessary. You'll probably need to read up on your topic, focus QUICKLY on some particular subtopic, and explain it to me in ten pages... If you write something explaining that topic to me, maybe with a simple example, that would be fine.

A student wrote:

> Hello. I just had a question about the writing project.
>
> What exactly are you expecting for this project? I'm not entirely certain
> exactly what I'm supposed to write -- is this a research style project
> where I'm supposed to implement something and investigate a new topic, or
> am I going to go over a bunch of subtopics in the topic I have provided
> and inform you about them? I thought I had a better idea over what
> exactly was necessary, but I find myself a little confused about it right
> now.

Wednesday, March 14, 2007

Notes from March 12 and March 14

Introduced LSA on Monday.

Finished LSA on Wednesday, and talked about n-grams. I probably should have passed out Damashek 95 beforehand. I was asked what character set was used in the acquaintance plots, and I thought it was all unicode but I don't know.

Friday, March 9, 2007

clarification for hw 2

The idea of hw 2 is to give experience with computing tf.idf weights, and to see how stop words are treated. One approach is to compute the tf.idf score for each of the 35 stopwords, on a per document basis. The output would be a 330 by 35 matrix, and most of the scores should be positive but close to zero. Reading the output may be a little cumbersome, but this would be fine.

Another approach is to keep track of the min and max values of tf, for each of the 35 stopwords, as the documents are parsed and the index built. Then calculate idf for each stopword, and print for each stop word the min tf.idf and max tf.idf.

Wednesday, February 28, 2007

Homework 2

This homework is due Monday, March 12

1) find the module or modules in Lucene that handle stopwords. Is there a static stopword list? If so, where is it?

2) how would Lucene be modified in order to count the occurrences of individual stopwords?

3) Make the necessary changes, and rebuild the index on the Lucene src tree as in homework 1, and have it print a report saying how many times each stopword occurred (tf) and the total number of documents in which that stopword occurred (df). Then using (one of) Salton and Buckley's suggested tf.idf formulae for documents, print the term weight that should be given to each stopword.

Notes from 2/26, plans for 2/28

On Monday I talked about some more writing project topics. I started talking about probabilistic IR, using the slides posted.

I'll do more with probabilistic IR this evening.

Monday, February 26, 2007

Notes from 2/21, plans for 2/26

Spent a LOT of time last Wednesday talking about writing project topics. Here are some more:

  • There are other packages besides Lucene, e.g. Lemur, and Clairlib, and maybe others. A comparison of those packages would be a good topic.
  • The connection between IR and other areas, such as machine learning or NLP, can be explored.
For Monday evening 2/26, I'll talk about the Salton and Buckley paper, and introduce the concept of probabilistic IR.

Tuesday, February 20, 2007

writing project

My usual procedure is to wait until later in the semester, and assign a writing project that is due at the end of the semester. People rarely complain, but who needs more stress at the end of the semester anyway?

So let's get an early start. Within ten days, say by Monday March 5, I'd like you to tell me, in a short email, the topic of your paper. It has to have something to do with IR, and NOT something that we'll be going over in class in detail, although going in depth in some topic that we mention in class is fine. Describe your topic in a paragraph, and list at least three references that you're thinking of consulting.

The final paper should be about ten pages, with at least ten references. Don't let all the references be from Wikipedia. The paper will be due on Wednesday, April 18.

There are lots of possible topics! We can start with the
SIGIR Call for Papers

and move on to the
CIKM Call for Papers

I'm not expecting original research results of conference quality (although that'd be nice) but you'll need to do something more than just a rehash of existing work. It's always a good idea to summarize work in an area, and then suggest future work that somebody could do for a 698 project, or a thesis. Another approach is to study some technique, and then present a new example that would help people understand it. If you want to write a program (e.g. an extension or modification to Lucene) you can include that as an appendix, and it won't count towards the ten-page limit.

It doesn't bother me if your writing project happens to be related to your job, or dovetails with something you're doing in another class.

Monday 2/19/07

Distributed annotated copies of the onjava article dated 1/15/03, and the today.java dated 7/30/03.

Most people seem to have finished homework 1, and we discussed that a little. Getting lucene to recompile was the hardest part, at least for me.

In response to questions, I talked about phrase-based retrieval and n-gram retrieval (both character and word n-grams) as alternatives to the bag of words model. Note that words, phrases, and n-grams have their pros and cons - all three are just the way you decide what terms are to be indexed. Once the "term space" is identified, the vector space, probabilistic, or boolean models of retrieval are options.

Unix tools can be used to do "sanity checks" on IR results.

Friday, February 16, 2007

Homework 1 now due Wednesday 2/21

No class on Wednesday February 14

Notes from Monday 2/12. Went over the slides on the vector space model, including a short example with tf.idf and calculation of a similarity coefficient. Did NOT cover the problem of document (or query) length normalization, nor the boolean model.

Monday, February 12, 2007

gzipped tarfile

http://www.cs.umbc.edu/~nicholas/676/lucene-2.0.0.tar.gz

Lucene demo

I unpacked the Lucene package , and everything is available under the course web site. Under docs, look at "Getting Started", and the basic Lucene demo.

I have extracted Lucene, and compiled it. The gzipped tarfile is called lucene-2.0.0.tar.gz, and I'll put a link on the course web site in case you'd like (and have room for) your own copy.

Building Lucene required me to install apache ant, which I hope you won't need. The hardest part was modifying my CLASSPATH, as shown below:

# to build and use lucene
setenv ANT_HOME $HOME/www/676/apache-ant-1.7.0
setenv LUCHOME $HOME/www/676/lucene-2.0.0
setenv CLASSPATH .\:$LUCHOME/lucene-demos-2.0.0.jar\:$LUCHOME/lucene-core-2.0.0.jar


A tour of the Lucene directory might be a good idea. Building an index of the Lucene source files is gratifying!

Homework 1: Repeat this demo, as far as creating an index of the Lucene source. Then figure out how one would modify the IndexFiles program so that it counts the number of files indexed, and the number of bytes read. If you know Java well enough and have the disk space to recompile it, you may demonstrate that your modifications work. You should all get the same answers! We'll talk about this in class on Wednesday a little, and this homework will be due next Monday.

Sunday, February 11, 2007

Class Wed. 2/7

I spent a lot of time finishing up the slides from Chapter 1, but also touched on some ideas that get discussed in detail in Chapter 2 and beyond.

The definitions of Precision and Recall are important. It is usually easy to measure precision, as long as you can tell when a specific document is relevant to a specific query. Recall is much harder to calculate, since the number of relevant documents in a large collection may well be unknown. I also talked about the pooled approach to IR evaluation. Without the pooled approach, evaulation of recall would be very difficult if not impossible on large collections. An overview of TREC, including a discussion of pooled document evaluation, is presented in Donna Harman's overview of TREC 4, available at http://trec.nist.gov/pubs/trec4/t4_proceedings.html.

In another post I mention pivoted document length normalization, which is credited (in my mind at least) to Amit Singhal, then at Cornell and now at Google. PDLN is covered in Chapter 2. Somebody asked a question on Wednesday that brought query zoning to mind, and that is also credited to Singhal. His paper from SIGIR'97 is also worth reading, I think. This and related work can be seen at http://singhal.info/publications.html

two important papers

In the discussion of the vector space model, Grossman and Frieder mention Salton's November 1975 CACM paper. I recommend that you read it. It's available through the ACM Digital Library, and through Google Scholar.

They also mention Pivoted Document Length Normalization, which appeared in the 1996 SIGIR conference. The main author is Amit Singhal. That paper is likely still the best explanation of PDLN, which within a year or two of its introduction was widely accepted in IR.

Monday, February 5, 2007

David D Lewis's web page

Reuters corpus

IR packages

So I mentioned the Lucene project last time.

Andrew McCallum (UMass) still makes the
Bag of Words software available.

What other packages are available to those who want to build their own IR systems? They may or may not support multiple languages or multiple retrieval models. They may be general purpose, or perhaps restricted to Web search for example.

The Lemur package from CMU

Two packages from NIST
PRISE
NIRVE

References to these and a couple more can be found at the
SIGIR site. There's some good stuff out here!

MG is a little older, and is discussed in detail in the book Managing Gigabytes. Zettair is related to MG, I think.

Wednesday, January 31, 2007

The project used a few years ago.

Fall 2002 project

Notes from class 1/29

As usual, setting up the computer support took some time. Getting my laptop booted up and connected should be planned for. Now that I know how to use Skype better, talking to remote students should be easier.

I asked people to look at the project that was used a few years ago. I also asked people to try to post a comment to the blog, but I haven't seen anything yet.

On Wednesday 1/31, I'll take some Polaroids! If I can get Google Talk to work on my laptop, that will be good.

If this course was going to be on-line or hybrid, access to the lecture slides is obviously critical. A set time for conference calls with students might be a good idea, although awkward for students in distant time zones. Maybe Skype could be used to record the lecture - but breaking it up slide by slide would help even more.

So Wednesday evening I plan to take pictures, and finish the slides started last time. We'll also discuss the project.

Tuesday, January 9, 2007


Information Retrieval

MW 5:30-6:45pm
Room to be announced
Dr. Charles Nicholas
nicholas@umbc.edu

Dr. Ian Soboroff taught this course a few years back, and I like his introduction:

This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.

This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details TBA.

You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.

Text
The text will be Grossman and Frieder, available at the UMBC bookstore (at least it's been ordered) as well as Amazon. We will follow this book fairly closely. Details about which chapters will be covered, and when, will follow. Other readings will be assigned, and made available.

Grading
There will be a multi-phase programming project, details to be announced, worth about 50% of the grade. Homeworks will be another 25%. There will also be a writing project, worth 25%. Presentations on the programming project will take the place of the final exam.

Academic Integrity
"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."

Welcome

Welcome to CS 676, a course on information retrieval!

It's not enough for a course to have a web site. Nowadays you must have a blog as well! So this blog will be used for much of the official and unofficial communication related to the course.

Only students in the class, and I, will be able to post to the blog. Readership alone is open to the public.