- #test 677
- why? (what a coincident 677 - 677) T or H missing ?!? hmm.
- we also have one such in DEV data, too. (check #dev 141)
- only the first sentence of T is being used in (prolly the reason of search result / maxprob mismatch?), for now.
- how to cope with this?:
- using “both sentences” in the T search.
- do two queries.
- There are some tricky aspects in this.... well, both of them in query? hmm
- See dev #25, #26, test #4, etc.
- sometimes. (test #16, #24)
- Why? Just “lack of corpus observation on content terms?”
- #25 (multiple sentence)
- #26 (same)
- #15 TH
- #14T
- #21T
- 574 - mountain name? hmm.
- perword P(h), perword P(t), perword P(h|t), perword P(h|t) - P(h)
- Note: “effective number of words” in h and t, from P(t)_methods.
- One fixed “boundary” on the gain can be Okay? E.g. Highly covered topic vs. rarely covered topic: “not-related”, “somehow related”, “I’ve seen them” will be different among them.... or not?
- Just … cover all of the topics, and hopes for the best?
- Some indicators? (Top search result #10 P(t) and P(h)? hmm)
- Think about this: this is not simple
- I guess careful “analysis” would be needed. (PtPh list is prepared just for this.)
- I mean, analyzing why what doesn’t work (or work) is more important.
- Then, we can add more “factors”.
- For example, per-word gain? “more than expected” gain?
- evaluation related
Sentence as sum of lexicals.
- Hidelberg castle
- “… was destryed in XX”
Can you find the “…” from search “Heidelberg”, or “Castle?”
- Boeing Headquator, Boeing factory, Boeing wing assembly for 777
- located in Seattle, in Oclahoma, in Japan, Yokohama,
Word-level “relatedness” can’t answer much. But the approach can.
“Boeing HQ” -> “Boeing factory” -> “Boeing 777 wing assembly” ->
- Greb one news, twist something in T to make it the opposite. (still highly similar). Give a sentence as H, extracted from the same doc.( But do we want this? TE is to answer “real-world”. so no, I guess. )
- More likely: one additional “Explanatory” sentences; cases. #596 #593 #687 Almost impossible for this approach to solve.
- 100 bytes? there are some weird (not normal) news files even in .story files
- now it works on all files in the given dir and its direct sub dirs
- P_t argument change (and all consequent callers)
- P_t code change (to traverse and run)
- (with multiple subdirs)
- check doc Plucene::Index::Writer
- call optimize before close the writer.
- output indexed file number via $writer->doc_count;
- run something (on temp already)
- check & compare, make sure it really works. (It seems so. Seems so doesn’t sound so strict, but I have no other reason to belive it won’t work so :-).
- with test code. Yeah!
- … and how?
- … spend some time …
- Is it Okay to use top_N? say, 10k? Spend some time.
- Approximation will (artificially) lower P_t(hypo).
- But it will also lower P_t(text) and everything (?)
- What we do finally is comparing P(hypo) and P(hypo | text): if two things both got lowered. Is this Acceptable? …
- Need more testing.
- It drops “too much”. (very easily get “min” value). Very large Big tail.
- why the following two returns different results?
“a bus collision with a truck in uganda has resulted in at least 30 fatalities and has left a further 21 injured” “30 die in a bus collision in uganda”
- write a simple script and test: “bus” “bus collision” “bus collision in uganda”
- (I am expecting all OR relation. is it something not?)
- it was because of “and”. :-(
- A more complex queries may contain nested queries with ‘and’, ‘or’, ‘not’ or ‘phrase’ relations. (PLUCENE::SEARCH::QUERY)
- input: weight (doc prob), sentence prob, of each document
- output: weighted average.
- model
- options
- sentence (input)
- (need): weighted-sum input format (simple matrix)?
- (already have): weighted-sum matlab code
- use weighted sum with same weights. :-)
- check collection model file
- get P_coll (t) (with -debug 3)
- get for each pure P_d(t) (with -debug 3), on all doc
- calculate lamda*P_d + (1-lamda)*P_coll for each by call octave
- do the weighted-sum of the values, with uniform weight
Wow. Finally.
- on AFP 2009 May
- Starting from P_t, P_h, P_h|t.
- Output of result hash:
- Debug 1 : output the hash into file, no sorting, file order
- Debug 2 : sorting, higher value first.
- large files in a dir makes (10k>) file locating very, very slow.
- GOAL: to make calling “ngram” perl doc as fast as “non-indexed” callings.
- Main cause was big-num of files in a dir. Patched by using month/day subdirs.
- THIS HAS BEEN CANCELED. (see testing)
- It makes this even SLOWER!!!! (Memory was too FULL to do other things :-( strange…).
- Reverted back. Maybe on servers… again, Maybe not.
- path recorder, as a global (same as index). It will be loaded only once, if it is null
- looks to be working good. Keep use this. (20 sec? for each trial? good)
- make gzset unzipper to use “months” too. This will reduce the number of files in dir.
- this wasn’t useful/impactful, and has some side effects. won’t use it.
- this will (maybe) make it faster to process indexed ones. (test on gillespie afp2010)
- Only a few dozen seconds. It affects, but not enough.
- Eh, didn’t really helped on memory issue. Maybe I should call Devel::size on important items.
- P_t_multithread_index return value (hash -> hashref)
- P_h_t_multithread_index (hash anon ref -> hashref)
- P_d_runner, return value (hash -> hashref)
(IGNORE THIS. path-base would be bigger)
- pick one or two “paragraph” level “Text”. Test it.
- Way too slow (no need to do, since 2010 takes 30+ min)
- Maybe we need something between 2), 3).
- better baseline would be P(h|h), instead of P(h)? (topical relatedness gets some even before starting).
- “gain” (P(h|t) / P(h)) seems to (generally) increases with the length of (t & h)
- (CURRENT) “-text” and “-lm”, and “-write-binary-lm”, all other default
- (CURRENT) all default: no other than “-ppl” (input designation) and “-lm”.
- Running P_t sequentially currently takes about 3 min (2:48) on Moore.
- Multi threads (6) on Gillespie, 58 seconds
- I believe that ngram automatically loads binary model, so no additional coding on model users.
- For each news “story” we call twice; once ngram (can’t reduce this), once octave. Maybe starting up octave each time is expansive. Consider this.
- Currently, the file to be passed to ngram -ppl is a fixed name.
- should be improved to temporary random name, or something like getName{sent}?
- Not really important, since the code does use multithread for P_t, and a single instance can utilize many nubmers of threads.
- we may not need to do the costy log-space-sums.
- (by multiply weights to a certain degree, so within octave normal range).
- (using reference_weightedsum, or a improved variation, etc).
- Not really important Only calculated twice, or three times only per each P(h|t). Not really critical, compared to other efficiency issues.
- Well, “not needing octave anymore” would be nice but.
====
- When processing document-models;
- “Warning: count of count x is zero – lowering maxcount”
- “Warning: discount coeff n is out of range: 0”
It seems that both related to sparseness. Not critical, but affecting (e.g. less good smoothing?)
- “-bayes 0” mix-model is generally what I would expect from simple summation: simple (lambda * model 1 prob) + ((1-lamba) * model 2 prob), for each word point. (Well if you ask me what -bayes non-zero means … I don’t)
- so the mixture model call is something like:
- ngram -lm doc.model -mix-lm collection.model -ppl test.txt -bayes 0 -debug 3 -lambda 0.1
- ppl = 10^(-logprob / (words - OOVs + sentences))
- ppl1 (without </s>) = 10^(-logprob / (words - OOVs))
- When no option is given, it does Good-Turing discount. (the warnings are from those, when counting count of counts, etc)
- Q: They share all the same back-off interpolate model, why different?
- A: /s
- All OOV docs, at least has one </s>. Different /s prob per models.
- We now have an option to exclude this </s>, from calculation. (DEFAULT ON, on lamba_sumX)
- Seems like this causes the small amount of difference in the final result. (try octave> a = 0.00409898)
- Octave uses H/W floats. … hmm. no easy way around(?)
- Eh, no. Above examples is actually within HW float, but octave cuts it. Prolly some precision cut mechanism in work. What’s it?
- “Symbolic toolbox”. vpa(something)? Hmm. no need yet.
- Basically, what I am trying to do is doing weighted sum of probabilities. There is two way of doing things.
- Word Level weighted sum and Sentence Level weighted sum
- Say, sentence is: P(w_1, …, w_n).
- At sentence level, this can be calculated by weighted_mean_all_d( P_d(w_1, .., w_n) )
- At word level, this can be caluclated by
- product { … weighted_mean_all_d( P(w_n | wn-1,wn-2, wn-3 ), weighted_mean_all_d( P(w_n+1 | w_n, wn-1, wn-2 ), … weighted_mean_all_d( P(</s> | …) ) }
- The problem is that, two values are different. Weighted mean on sentence level (up to each sentence, prob calculated by each document model) produces one value. Product of word level probabilities that gained by per word weighted mean produces another value. They are generally not that far, but not the same.
- If we want to use “per-word predictability” power, we need to do things on word level. Maybe this is more powerful. (and a bit slower)
- If we are not interested in word level, and since our assumption simply assumes the underlying document-model generates a probablility for each given sentence… Then sentence level is good enough.
- Try both? Hmm.
- Try both?: no. on sentence level.
- Sentence level. Following strictly to P_d(sentence).
- Basic premise: A sentence, a probability. Each document model is independent (although weakly linked by coll-model, but this is not relevant here)
- Word-level might be useful/needed for “dynamic/better LM”.