Wednesday, May 9, 2007
programming project
April 23, 25 and 30
Web search could probably be a course on its own, covering search as well as web services, RSS, and maybe semantic web stuff. In an IR course, web search may be good as a running example, but it can't take over the course.
A newer textbook can cover new topics, and that's important in an IR course. Is it my imagination, though, or is it true that some older textbooks are better than some new ones, even if some material is dated?
Callan's paper is fine, but I feel the need to add a survey of peer-to-peer IR.
May 2, 7 and 9
Wednesday, April 18, 2007
schedule of talks
Ron Roff
Joel G.
Mike Wilson
JC Montminy
5/7
Chris
Mansi Radke
Beenish
Luke
5/9
Stephen
Ginny
Jason
Sayeed
5/14
Justin
Mike
Marcin
Sandor
Aparna
Tuesday, April 17, 2007
class Monday evening 4/17
In hindsight, I should have put a notice on the door, but frankly that slipped my mind.
Anyway, I apologize to those who made the trip to campus for nothing.
I plan to cover cross-language IR on Wednesday. I'll be posting a paper or two shortly.
Wednesday, April 11, 2007
format for writing project presentations
2-3 minutes/slide
title and your name
executive summary <= 50 words
what piqued your interest in this topic?
present an example (multiple slides, but make it snappy)
OR
explain what you learned
what assumptions are made? what are the advantages and disadvantages?
future work - questions that still need to be answered
conclusions
peer-review is fine! don't be mean to each other
Monday, April 9, 2007
Monday April 9
Wednesday, April4, 2007
Finished that on Wednesday April 4
Two search engines that use clustering include vivisimo and iboogie. Both still seem to be around.
Also talked about the programming project. Basically, the project is to use the ngram package (located in lucene's contrib directory) with the Reuters corpus.
Since the Reuters corpus uses SGML markup and several documents in a single file, parsing the documents is non-trivial. Originally I had said to index only the text inside paragraphs, but that may be too awkward.
So, what would it take to index the whole Reuters corpus, using ngrams, (with n=5)? One approach would be to use a perl script to make a file for each document, which would involve several thousand files. With n-grams, common ngrams will occur in each document with roughly the same frequency, so the markup shouldn't make much difference.
So that's the project: use the ngrams package to index the Reuters corpus, and use some sample queries, maybe titles with empty documents, to show that it works. Queries may have SGML markup.
I had also described Homework 3: Using a script or lucene or a program of your choice, tell me the ten most common 5-grams in the Reuters corpus.
Monday, April 2, 2007
We went over the McNamee and Mayfield paper, in some detail, and then we finished Salton and Buckley's relevance feedback paper. Shannon's 1948 paper is available on the web, and is still worth reading.
The use of LSA in cross-lanaguage IR is described in several places, e.g.
http://lsi.research.telcordia.com/lsi/papers/XLANG96.pdf
Monday, March 26, 2007
plans for Monday 3/26 and Wednesday 3/28
Do people want to learn more about n-grams? An article on generalized n-grams just appeared in Information Processing and Management.
In my opinion, IP&M is one of the very best IR journals. You can access it online at
http://www.sciencedirect.com/science/journal/03064573
Wednesday is a little open - more on RF, more on n-grams, or maybe an introduction to clustering.
Information Retrieval: Data Structures & Algorithms
edited by William B. Frakes and Ricardo Baeza-Yates
http://www.pimpumpam.com/motoridiricerca/ir/toc.htm
Thursday, March 15, 2007
more on the writing project
A student wrote:
> Hello. I just had a question about the writing project.
>
> What exactly are you expecting for this project? I'm not entirely certain
> exactly what I'm supposed to write -- is this a research style project
> where I'm supposed to implement something and investigate a new topic, or
> am I going to go over a bunch of subtopics in the topic I have provided
> and inform you about them? I thought I had a better idea over what
> exactly was necessary, but I find myself a little confused about it right
> now.
Wednesday, March 14, 2007
Notes from March 12 and March 14
Finished LSA on Wednesday, and talked about n-grams. I probably should have passed out Damashek 95 beforehand. I was asked what character set was used in the acquaintance plots, and I thought it was all unicode but I don't know.
Friday, March 9, 2007
clarification for hw 2
Another approach is to keep track of the min and max values of tf, for each of the 35 stopwords, as the documents are parsed and the index built. Then calculate idf for each stopword, and print for each stop word the min tf.idf and max tf.idf.
Wednesday, February 28, 2007
Homework 2
1) find the module or modules in Lucene that handle stopwords. Is there a static stopword list? If so, where is it?
2) how would Lucene be modified in order to count the occurrences of individual stopwords?
3) Make the necessary changes, and rebuild the index on the Lucene src tree as in homework 1, and have it print a report saying how many times each stopword occurred (tf) and the total number of documents in which that stopword occurred (df). Then using (one of) Salton and Buckley's suggested tf.idf formulae for documents, print the term weight that should be given to each stopword.
Notes from 2/26, plans for 2/28
I'll do more with probabilistic IR this evening.
Monday, February 26, 2007
Notes from 2/21, plans for 2/26
- There are other packages besides Lucene, e.g. Lemur, and Clairlib, and maybe others. A comparison of those packages would be a good topic.
- The connection between IR and other areas, such as machine learning or NLP, can be explored.
Tuesday, February 20, 2007
writing project
So let's get an early start. Within ten days, say by Monday March 5, I'd like you to tell me, in a short email, the topic of your paper. It has to have something to do with IR, and NOT something that we'll be going over in class in detail, although going in depth in some topic that we mention in class is fine. Describe your topic in a paragraph, and list at least three references that you're thinking of consulting.
The final paper should be about ten pages, with at least ten references. Don't let all the references be from Wikipedia. The paper will be due on Wednesday, April 18.
There are lots of possible topics! We can start with the
SIGIR Call for Papers
and move on to the
CIKM Call for Papers
I'm not expecting original research results of conference quality (although that'd be nice) but you'll need to do something more than just a rehash of existing work. It's always a good idea to summarize work in an area, and then suggest future work that somebody could do for a 698 project, or a thesis. Another approach is to study some technique, and then present a new example that would help people understand it. If you want to write a program (e.g. an extension or modification to Lucene) you can include that as an appendix, and it won't count towards the ten-page limit.
It doesn't bother me if your writing project happens to be related to your job, or dovetails with something you're doing in another class.
Monday 2/19/07
Most people seem to have finished homework 1, and we discussed that a little. Getting lucene to recompile was the hardest part, at least for me.
In response to questions, I talked about phrase-based retrieval and n-gram retrieval (both character and word n-grams) as alternatives to the bag of words model. Note that words, phrases, and n-grams have their pros and cons - all three are just the way you decide what terms are to be indexed. Once the "term space" is identified, the vector space, probabilistic, or boolean models of retrieval are options.
Unix tools can be used to do "sanity checks" on IR results.
Friday, February 16, 2007
No class on Wednesday February 14
Wednesday, February 14, 2007
More on Lucene
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
and this one,
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
A Perl version of Lucene is available at
http://search.cpan.org/~tmtm/Plucene-1.25/lib/Plucene.pm
Monday, February 12, 2007
Lucene demo
I have extracted Lucene, and compiled it. The gzipped tarfile is called lucene-2.0.0.tar.gz, and I'll put a link on the course web site in case you'd like (and have room for) your own copy.
Building Lucene required me to install apache ant, which I hope you won't need. The hardest part was modifying my CLASSPATH, as shown below:
# to build and use lucene
setenv ANT_HOME $HOME/www/676/apache-ant-1.7.0
setenv LUCHOME $HOME/www/676/lucene-2.0.0
setenv CLASSPATH .\:$LUCHOME/lucene-demos-2.0.0.jar\:$LUCHOME/lucene-core-2.0.0.jar
A tour of the Lucene directory might be a good idea. Building an index of the Lucene source files is gratifying!
Homework 1: Repeat this demo, as far as creating an index of the Lucene source. Then figure out how one would modify the IndexFiles program so that it counts the number of files indexed, and the number of bytes read. If you know Java well enough and have the disk space to recompile it, you may demonstrate that your modifications work. You should all get the same answers! We'll talk about this in class on Wednesday a little, and this homework will be due next Monday.
Sunday, February 11, 2007
Class Wed. 2/7
The definitions of Precision and Recall are important. It is usually easy to measure precision, as long as you can tell when a specific document is relevant to a specific query. Recall is much harder to calculate, since the number of relevant documents in a large collection may well be unknown. I also talked about the pooled approach to IR evaluation. Without the pooled approach, evaulation of recall would be very difficult if not impossible on large collections. An overview of TREC, including a discussion of pooled document evaluation, is presented in Donna Harman's overview of TREC 4, available at http://trec.nist.gov/pubs/trec4/t4_proceedings.html.
In another post I mention pivoted document length normalization, which is credited (in my mind at least) to Amit Singhal, then at Cornell and now at Google. PDLN is covered in Chapter 2. Somebody asked a question on Wednesday that brought query zoning to mind, and that is also credited to Singhal. His paper from SIGIR'97 is also worth reading, I think. This and related work can be seen at http://singhal.info/publications.html
two important papers
They also mention Pivoted Document Length Normalization, which appeared in the 1996 SIGIR conference. The main author is Amit Singhal. That paper is likely still the best explanation of PDLN, which within a year or two of its introduction was widely accepted in IR.
Monday, February 5, 2007
IR packages
Andrew McCallum (UMass) still makes the
Bag of Words software available.
What other packages are available to those who want to build their own IR systems? They may or may not support multiple languages or multiple retrieval models. They may be general purpose, or perhaps restricted to Web search for example.
The Lemur package from CMU
Two packages from NIST
PRISE
NIRVE
References to these and a couple more can be found at the
SIGIR site. There's some good stuff out here!
MG is a little older, and is discussed in detail in the book Managing Gigabytes. Zettair is related to MG, I think.
Wednesday, January 31, 2007
Notes from class 1/29
I asked people to look at the project that was used a few years ago. I also asked people to try to post a comment to the blog, but I haven't seen anything yet.
On Wednesday 1/31, I'll take some Polaroids! If I can get Google Talk to work on my laptop, that will be good.
If this course was going to be on-line or hybrid, access to the lecture slides is obviously critical. A set time for conference calls with students might be a good idea, although awkward for students in distant time zones. Maybe Skype could be used to record the lecture - but breaking it up slide by slide would help even more.
So Wednesday evening I plan to take pictures, and finish the slides started last time. We'll also discuss the project.
Monday, January 29, 2007
Tuesday, January 9, 2007
Information Retrieval
MW 5:30-6:45pmRoom to be announced
Dr. Charles Nicholas
nicholas@umbc.edu
Dr. Ian Soboroff taught this course a few years back, and I like his introduction:
This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.
This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details TBA.
You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.
Text
The text will be Grossman and Frieder, available at the UMBC bookstore (at least it's been ordered) as well as Amazon. We will follow this book fairly closely. Details about which chapters will be covered, and when, will follow. Other readings will be assigned, and made available.
Grading
There will be a multi-phase programming project, details to be announced, worth about 50% of the grade. Homeworks will be another 25%. There will also be a writing project, worth 25%. Presentations on the programming project will take the place of the final exam.
Academic Integrity
"By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook, the Faculty Handbook, or the UMBC Policies section of the UMBC Directory [or for graduate courses, the Graduate School website]."
Welcome
It's not enough for a course to have a web site. Nowadays you must have a blog as well! So this blog will be used for much of the official and unofficial communication related to the course.
Only students in the class, and I, will be able to post to the blog. Readership alone is open to the public.