Current Memo

All AFP documents on Gillespie, as collection & per document model
Current Coding: SubDir plays (needed before doing more than AFP one year)
Test experiments with AFP 2010!

DEVEL

Ongoing Improvements

Future Improvements

[#A] (primitive) Indexing

At least one term must occur, to be included in the @doclist
Hmm. maybe simple.

[#B] remove (or don’t count) too short news articles

100 bytes? there are some weird (not normal) news files even in .story files

Previous Improvements

[#A] Collection model generate from subdirs

For collection - catall to dump everything in subdirs.

[#A] Per doc model generate with subdirs

For per-doc models - perstroy_runnner with subdir.

now it works on all files in the given dir and its direct sub dirs

[#A] SubDir plays (needed before doing more than AFP one year)

P_t should traverse all subdirs.

P_t argument change (and all consequent callers)
P_t code change (to traverse and run)

Test of P_h_t_multithread with sketch

(with multiple subdirs)

Main line coding

Collection Model

(run) Get “target” news files (target corpus) all in one folder

(run) catall and generate collection LM model

[#C] (If subdir needed) TODO? (write script) recursively catall and generate collection model

Document Model

(write script) For each file, make each LM model

Produce single sentence prob. (t)

(write matlab script) weighted-sum

input: weight (doc prob), sentence prob, of each document
output: weighted average.

(write scripts) P(t) prob

(write debug3 reader) read_log_prob, read_prob

(write octave caller) lambda sum (interpolate)

check code for get seq_prob to lambda sum

(srilm caller) write ngram runner

model
options
sentence (input)

(write octave caller) weighted sum

(need): weighted-sum input format (simple matrix)?
(already have): weighted-sum matlab code

(write octave caller wrapper) logprob mean

use weighted sum with same weights. :-)

calc P_coll

check collection model file
get P_coll (t) (with -debug 3)

each P_doc(t)

get for each pure P_d(t) (with -debug 3), on all doc
calculate lamda*P_d + (1-lamda)*P_coll for each by call octave

calc P_(t) by weighted sum

do the weighted-sum of the values, with uniform weight

Produce conditional prob.

(write scripts) P(h | t) prob

write script “evidence calculation code”

Wow. Finally.

sanity check, more with sketch.

on AFP 2009 May

[#A] Some possible “look-into” data saving.

Starting from P_t, P_h, P_h|t.
Output of result hash:
Debug 1 : output the hash into file, no sorting, file order
Debug 2 : sorting, higher value first.

POC Experiments on this prototype conditional prob.

ONGOING [#A] Some test on small corpus

AFP_ENG 2009 05
WAY too small. Nothing to be seen. (e.g. “airplane” doesn’t occur, etc)
Maybe I should pick something from the news itself, and try it.
Note that: +-1 logprob fluctation in LM prob output is not reliable, for example, documents report two words both as OOV, reports different probabilitiy. This must be based on discounting method (long document gets lower discount value? or higher discount value, etc). –> just add more documents to see big jumps (say, +-10?), or maybe remove too-short-documents, if this keeps coming up?

[#A] Some test on not-so-small-but-still-small corpus

AFP ENG 2010

[#A] Some test on mid-size corpus

All AFP
This is not doable until better “recursive-sub-dir-visit” for;
Model making: (catall, and etc for collection model)

EXPERIMENTS

MODEL preparation

[#A] See how ngram-count works on large files

1) afp 2010 (no problem)

1-b) afp 2010 per doc (no problem)

2) all afp. (Gillespie, no problem)

2-b) all afp, per doc (Gillespie, ONGOING)

Way too slow (no need to do, since 2010 takes 30+ min)

3) all of the gigaword?

Maybe we need something between 2), 3).

Some additional ideas

some rough ideas & observations

better baseline would be P(h|h), instead of P(h)? (topical relatedness gets some even before starting).
“gain” (P(h|t) / P(h)) seems to (generally) increases with the length of (t & h)

Notes

Currently used/tested SRILM call parameters

ngram-count

(CURRENT) “-text” and “-lm”, and “-write-binary-lm”, all other default

ngram

(CURRENT) all default: no other than “-ppl” (input designation) and “-lm”.

Memo on efficiency

Testing on May 2009 AFP news (20k documents)

Running P_t sequentially currently takes about 3 min (2:48) on Moore.
Multi threads (6) on Gillespie, 58 seconds

RECORDS & POSTPONED

Past Improvements

Binary language model

add binary option as default option

collection model description (user’s own calling)

perstory_runner.pl (per document model)

I believe that ngram automatically loads binary model, so no additional coding on model users.

[#A] bug splitta outputs the last “.” concatted to the last Word.

TODO? [#C] [??] feature catall.pl “do not print a file size less than X”

TODO? [#C] [Very hard - Possible?] Matrix-ize weighted_sum Octave code.

[#A] [Efficiency] Lamda sum in Perl space. (No octave call)

For each news “story” we call twice; once ngram (can’t reduce this), once octave. Maybe starting up octave each time is expansive. Consider this.

[#A] [Efficiency for response] Not using multiple threads/ngram processes

Postponed improvements: “Good to have, but not critical”

TODO? [#C] [Efficiency for throughput] Unable to call two or more instances.

Currently, the file to be passed to ngram -ppl is a fixed name.
should be improved to temporary random name, or something like getName{sent}?
Not really important, since the code does use multithread for P_t, and a single instance can utilize many nubmers of threads.

TODO? [#C] If log-sum is only needed as “weighted sum” (use not-tool-small sum)

we may not need to do the costy log-space-sums.
(by multiply weights to a certain degree, so within octave normal range).
(using reference_weightedsum, or a improved variation, etc).
Not really important Only calculated twice, or three times only per each P(h|t). Not really critical, compared to other efficiency issues.
Well, “not needing octave anymore” would be nice but.

====

Known problems

Discount related questions

When processing document-models;
“Warning: count of count x is zero – lowering maxcount”
“Warning: discount coeff n is out of range: 0”

It seems that both related to sparseness. Not critical, but affecting (e.g. less good smoothing?)

Side notes about tools

SRILM

Interpolate call parameters

“-bayes 0” mix-model is generally what I would expect from simple summation: simple (lambda * model 1 prob) + ((1-lamba) * model 2 prob), for each word point. (Well if you ask me what -bayes non-zero means … I don’t)
so the mixture model call is something like:
ngram -lm doc.model -mix-lm collection.model -ppl test.txt -bayes 0 -debug 3 -lambda 0.1

Perplexity (per word), as calculated in SRILM

ppl = 10^(-logprob / (words - OOVs + sentences))
ppl1 (without </s>) = 10^(-logprob / (words - OOVs))

Discount methods in SRILM defult

When no option is given, it does Good-Turing discount. (the warnings are from those, when counting count of counts, etc)

Why different prob, for all OOV queries?

Q: They share all the same back-off interpolate model, why different?
A: /s
All OOV docs, at least has one </s>. Different /s prob per models.
We now have an option to exclude this </s>, from calculation. (DEFAULT ON, on lamba_sumX)

Octave

Octave “precision” of double is one digit less (than SRILM)

Seems like this causes the small amount of difference in the final result. (try octave> a = 0.00409898)
Octave uses H/W floats. … hmm. no easy way around(?)
Eh, no. Above examples is actually within HW float, but octave cuts it. Prolly some precision cut mechanism in work. What’s it?
“Symbolic toolbox”. vpa(something)? Hmm. no need yet.

Theoretical crosspoints / decisions

THEORETICAL

[#A] Word level model, or Sentence level model?

Basically, what I am trying to do is doing weighted sum of probabilities. There is two way of doing things.
Word Level weighted sum and Sentence Level weighted sum
Say, sentence is: P(w_1, …, w_n).

Sentence level weighted-sum

At sentence level, this can be calculated by weighted_mean_all_d( P_d(w_1, .., w_n) )

Word level weithed-sum

At word level, this can be caluclated by
product { … weighted_mean_all_d( P(w_n | w_n-1,w_n-2, w_n-3 ), weighted_mean_all_d( P(w_n+1 | w_n, w_n-1, w_n-2 ), … weighted_mean_all_d( P(</s> | …) ) }

Not compatible

The problem is that, two values are different. Weighted mean on sentence level (up to each sentence, prob calculated by each document model) produces one value. Product of word level probabilities that gained by per word weighted mean produces another value. They are generally not that far, but not the same.

Which one should we use?

If we want to use “per-word predictability” power, we need to do things on word level. Maybe this is more powerful. (and a bit slower)
If we are not interested in word level, and since our assumption simply assumes the underlying document-model generates a probablility for each given sentence… Then sentence level is good enough.
Try both? Hmm.

For now?

Try both?: no. on sentence level.
Sentence level. Following strictly to P_d(sentence).
Basic premise: A sentence, a probability. Each document model is independent (although weakly linked by coll-model, but this is not relevant here)
Word-level might be useful/needed for “dynamic/better LM”.