Skip to content

Latest commit

 

History

History
44 lines (31 loc) · 1.29 KB

memo.org

File metadata and controls

44 lines (31 loc) · 1.29 KB

Design of simple, linear-interpolated Multivariate Bernoulli

Basic equation

  • PI_for_all_terms ( lambda x P_doc(“term”) + (1-lambda) x P_coll(“term”) )

P_doc(“term”)

  • 1 if the doc includes the term.
  • 0 if the doc does not have that.

P_coll(“term”)

  • MLE; that is, count(num of doc with term) / count all docs

Excluded terms

  • All stop words simply are excluded from the calculation.
  • All OOV words are also simply excluded from the calculation.

MAYBE?

  • term can be extended to other features, such as bi-gram, and so on …

Implementation (own)

Big Inverted index of all n-grams. “n-gram” queried, all document ids returned. (this would be big. so let’s just do with upto-trigram only… hmm.)

Okay, suppose you have this table.

find(“gram”) : outputs “doc_id:count”, “doc_id:count” …

Index structure

Splitta bug(?)

  • somehow it generates some strange outputs.
  • The following two cases are being patched after Splitta work.
  • (From gigaword_split_file.pl)

    s/.$/ . /;

    s/. ” $/ . ” /;

  • Note that new inputs should go through the same process.