The idea of hw 2 is to give experience with computing tf.idf weights, and to see how stop words are treated. One approach is to compute the tf.idf score for each of the 35 stopwords, on a per document basis. The output would be a 330 by 35 matrix, and most of the scores should be positive but close to zero. Reading the output may be a little cumbersome, but this would be fine.
Another approach is to keep track of the min and max values of tf, for each of the 35 stopwords, as the documents are parsed and the index built. Then calculate idf for each stopword, and print for each stop word the min tf.idf and max tf.idf.
Subscribe to:
Post Comments (Atom)
1 comment:
Are you interested in min of zero tfs? For many of the stop words there are some documents that do not contain them at all, hence the tf will be 0.
As another comment there are two ways of doing this assignment. One, the hard way, is to actually count words when building the index. The second is to index all of the stop words, then use the index to find the tf and df.
Post a Comment