- PI_for_all_terms ( lambda x P_doc(“term”) + (1-lambda) x P_coll(“term”) )
- 1 if the doc includes the term.
- 0 if the doc does not have that.
- MLE; that is, count(num of doc with term) / count all docs
- All stop words simply are excluded from the calculation.
- All OOV words are also simply excluded from the calculation.
- term can be extended to other features, such as bi-gram, and so on …
Big Inverted index of all n-grams. “n-gram” queried, all document ids returned. (this would be big. so let’s just do with upto-trigram only… hmm.)
Okay, suppose you have this table.
find(“gram”) : outputs “doc_id:count”, “doc_id:count” …
- somehow it generates some strange outputs.
- The following two cases are being patched after Splitta work.
- (From gigaword_split_file.pl)
s/.$/ . /;
s/. ” $/ . ” /;
- Note that new inputs should go through the same process.