- All AFP documents on Gillespie, as collection & per document model
- Current Coding: SubDir plays (needed before doing more than AFP one year)
- Test experiments with AFP 2010!
- At least one term must occur, to be included in the @doclist
- Hmm. maybe simple.
- 100 bytes? there are some weird (not normal) news files even in .story files
- now it works on all files in the given dir and its direct sub dirs
- P_t argument change (and all consequent callers)
- P_t code change (to traverse and run)
- (with multiple subdirs)
- input: weight (doc prob), sentence prob, of each document
- output: weighted average.
- model
- options
- sentence (input)
- (need): weighted-sum input format (simple matrix)?
- (already have): weighted-sum matlab code
- use weighted sum with same weights. :-)
- check collection model file
- get P_coll (t) (with -debug 3)
- get for each pure P_d(t) (with -debug 3), on all doc
- calculate lamda*P_d + (1-lamda)*P_coll for each by call octave
- do the weighted-sum of the values, with uniform weight
Wow. Finally.
- on AFP 2009 May
- Starting from P_t, P_h, P_h|t.
- Output of result hash:
- Debug 1 : output the hash into file, no sorting, file order
- Debug 2 : sorting, higher value first.
- AFP_ENG 2009 05
- WAY too small. Nothing to be seen. (e.g. “airplane” doesn’t occur, etc)
- Maybe I should pick something from the news itself, and try it.
- Note that: +-1 logprob fluctation in LM prob output is not reliable, for example, documents report two words both as OOV, reports different probabilitiy. This must be based on discounting method (long document gets lower discount value? or higher discount value, etc). –> just add more documents to see big jumps (say, +-10?), or maybe remove too-short-documents, if this keeps coming up?
- AFP ENG 2010
- All AFP
- This is not doable until better “recursive-sub-dir-visit” for;
- Model making: (catall, and etc for collection model)
- Way too slow (no need to do, since 2010 takes 30+ min)
- Maybe we need something between 2), 3).
- better baseline would be P(h|h), instead of P(h)? (topical relatedness gets some even before starting).
- “gain” (P(h|t) / P(h)) seems to (generally) increases with the length of (t & h)
- (CURRENT) “-text” and “-lm”, and “-write-binary-lm”, all other default
- (CURRENT) all default: no other than “-ppl” (input designation) and “-lm”.
- Running P_t sequentially currently takes about 3 min (2:48) on Moore.
- Multi threads (6) on Gillespie, 58 seconds
- I believe that ngram automatically loads binary model, so no additional coding on model users.
- For each news “story” we call twice; once ngram (can’t reduce this), once octave. Maybe starting up octave each time is expansive. Consider this.
- Currently, the file to be passed to ngram -ppl is a fixed name.
- should be improved to temporary random name, or something like getName{sent}?
- Not really important, since the code does use multithread for P_t, and a single instance can utilize many nubmers of threads.
- we may not need to do the costy log-space-sums.
- (by multiply weights to a certain degree, so within octave normal range).
- (using reference_weightedsum, or a improved variation, etc).
- Not really important Only calculated twice, or three times only per each P(h|t). Not really critical, compared to other efficiency issues.
- Well, “not needing octave anymore” would be nice but.
====
- When processing document-models;
- “Warning: count of count x is zero – lowering maxcount”
- “Warning: discount coeff n is out of range: 0”
It seems that both related to sparseness. Not critical, but affecting (e.g. less good smoothing?)
- “-bayes 0” mix-model is generally what I would expect from simple summation: simple (lambda * model 1 prob) + ((1-lamba) * model 2 prob), for each word point. (Well if you ask me what -bayes non-zero means … I don’t)
- so the mixture model call is something like:
- ngram -lm doc.model -mix-lm collection.model -ppl test.txt -bayes 0 -debug 3 -lambda 0.1
- ppl = 10^(-logprob / (words - OOVs + sentences))
- ppl1 (without </s>) = 10^(-logprob / (words - OOVs))
- When no option is given, it does Good-Turing discount. (the warnings are from those, when counting count of counts, etc)
- Q: They share all the same back-off interpolate model, why different?
- A: /s
- All OOV docs, at least has one </s>. Different /s prob per models.
- We now have an option to exclude this </s>, from calculation. (DEFAULT ON, on lamba_sumX)
- Seems like this causes the small amount of difference in the final result. (try octave> a = 0.00409898)
- Octave uses H/W floats. … hmm. no easy way around(?)
- Eh, no. Above examples is actually within HW float, but octave cuts it. Prolly some precision cut mechanism in work. What’s it?
- “Symbolic toolbox”. vpa(something)? Hmm. no need yet.
- Basically, what I am trying to do is doing weighted sum of probabilities. There is two way of doing things.
- Word Level weighted sum and Sentence Level weighted sum
- Say, sentence is: P(w_1, …, w_n).
- At sentence level, this can be calculated by weighted_mean_all_d( P_d(w_1, .., w_n) )
- At word level, this can be caluclated by
- product { … weighted_mean_all_d( P(w_n | wn-1,wn-2, wn-3 ), weighted_mean_all_d( P(w_n+1 | w_n, wn-1, wn-2 ), … weighted_mean_all_d( P(</s> | …) ) }
- The problem is that, two values are different. Weighted mean on sentence level (up to each sentence, prob calculated by each document model) produces one value. Product of word level probabilities that gained by per word weighted mean produces another value. They are generally not that far, but not the same.
- If we want to use “per-word predictability” power, we need to do things on word level. Maybe this is more powerful. (and a bit slower)
- If we are not interested in word level, and since our assumption simply assumes the underlying document-model generates a probablility for each given sentence… Then sentence level is good enough.
- Try both? Hmm.
- Try both?: no. on sentence level.
- Sentence level. Following strictly to P_d(sentence).
- Basic premise: A sentence, a probability. Each document model is independent (although weakly linked by coll-model, but this is not relevant here)
- Word-level might be useful/needed for “dynamic/better LM”.