Known bugs (to be fixed)

proto_condprob.pm dies in 677 on #test677

#test 677
why? (what a coincident 677 - 677) T or H missing ?!? hmm.
we also have one such in DEV data, too. (check #dev 141)

Two or more sentences in T.

only the first sentence of T is being used in (prolly the reason of search result / maxprob mismatch?), for now.
how to cope with this?:
- using “both sentences” in the T search.
- do two queries.
There are some tricky aspects in this.... well, both of them in query? hmm
See dev #25, #26, test #4, etc.

Top search results makes lower (than cutpoint) generative probability?

sometimes. (test #16, #24)
Why? Just “lack of corpus observation on content terms?”

Things to Check

Check : query with “\n” is okay… (remove any newline?)

#25 (multiple sentence)
#26 (same)

Check : Maxprob is lower? Search ordering error? (DEBUG2)

#15 TH
#14T
#21T

Check: RTE data same? (RTE3) - #26

574 - mountain name? hmm.

CONSIDER EXPR (removing all , . !)

IMPROVE: Add more feature outputs

perword P(h), perword P(t), perword P(h|t), perword P(h|t) - P(h)
Note: “effective number of words” in h and t, from P(t)_methods.

Things to Ponder

“Dynamic” boundary?

One fixed “boundary” on the gain can be Okay? E.g. Highly covered topic vs. rarely covered topic: “not-related”, “somehow related”, “I’ve seen them” will be different among them.... or not?
Just … cover all of the topics, and hopes for the best?
Some indicators? (Top search result #10 P(t) and P(h)? hmm)
Think about this: this is not simple
I guess careful “analysis” would be needed. (PtPh list is prepared just for this.)
I mean, analyzing why what doesn’t work (or work) is more important.
Then, we can add more “factors”.
For example, per-word gain? “more than expected” gain?

“Category” of the problem

evaluation related

Some MEMO (Thoughts for paper work)

why it works?

Lexical level co-occurrence can’t answer P(h|t)

Sentence as sum of lexicals.

Example 1

Hidelberg castle
“… was destryed in XX”

Can you find the “…” from search “Heidelberg”, or “Castle?”

Example 2

Boeing Headquator, Boeing factory, Boeing wing assembly for 777
located in Seattle, in Oclahoma, in Japan, Yokohama,

Word-level “relatedness” can’t answer much. But the approach can.

“Boeing HQ” -> “Boeing factory” -> “Boeing 777 wing assembly” ->

Weakness?

One easy way to fool it.

Greb one news, twist something in T to make it the opposite. (still highly similar). Give a sentence as H, extracted from the same doc.( But do we want this? TE is to answer “real-world”. so no, I guess. )
More likely: one additional “Explanatory” sentences; cases. #596 #593 #687 Almost impossible for this approach to solve.

When “T” sentence wasn’t found. (can be overcome with more text?)

Proper-names (entity names), ups and downs

RTE3 with AFP2010

RTE3 with AFP2009+2010

RTE3 with early AFP2010 (Jan-June)

PRESENTPRESENTPRESENTPRESENTPRESENTPRESENTPRESENTPRESENT

PASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPAST

DEVEL History

(Stored) Future Improvements

[#B] remove (or don’t count) too short news articles

100 bytes? there are some weird (not normal) news files even in .story files

Previous Improvements

[#A] Collection model generate from subdirs

For collection - catall to dump everything in subdirs.

[#A] Per doc model generate with subdirs

For per-doc models - perstroy_runnner with subdir.

now it works on all files in the given dir and its direct sub dirs

[#A] SubDir plays (needed before doing more than AFP one year)

P_t should traverse all subdirs.

P_t argument change (and all consequent callers)
P_t code change (to traverse and run)

Test of P_h_t_multithread with sketch

(with multiple subdirs)

[#A] Index Work

add index optimizer at the end of indexing.pl

check doc Plucene::Index::Writer
call optimize before close the writer.
output indexed file number via $writer->doc_count;

run temp.pl query on 2009 data

run something (on temp already)
check & compare, make sure it really works. (It seems so. Seems so doesn’t sound so strict, but I have no other reason to belive it won’t work so :-).

recreate the work environment in home …

query method. (Text in, ordered result out)

with test code. Yeah!

Code P_t with index.

… and how?
… spend some time …

Implement “top N” approximation. check some approx vs non-approx.

Some more test on P_t, “and try approximate”?

Is it Okay to use top_N? say, 10k? Spend some time.
Approximation will (artificially) lower P_t(hypo).
But it will also lower P_t(text) and everything (?)
What we do finally is comparing P(hypo) and P(hypo | text): if two things both got lowered. Is this Acceptable? …
Need more testing.
It drops “too much”. (very easily get “min” value). Very large Big tail.

Implement P_t_h_index with N approximation.

Test P_t_h_index with test code.

Play with some more simple texts on the newst implementation.

SEARCH ERROR PATCH

why the following two returns different results?

“a bus collision with a truck in uganda has resulted in at least 30 fatalities and has left a further 21 injured” “30 die in a bus collision in uganda”

write a simple script and test: “bus” “bus collision” “bus collision in uganda”
(I am expecting all OR relation. is it something not?)

result return script

test those sequence

it was because of “and”. :-(

check all “special words” for (P)lucene query.

A more complex queries may contain nested queries with ‘and’, ‘or’, ‘not’ or ‘phrase’ relations. (PLUCENE::SEARCH::QUERY)

imporve plucene_query() by removing those terms from the given query

Main line coding

Collection Model

(run) Get “target” news files (target corpus) all in one folder

(run) catall and generate collection LM model

[#C] (If subdir needed) TODO? (write script) recursively catall and generate collection model

Document Model

(write script) For each file, make each LM model

Produce single sentence prob. (t)

(write matlab script) weighted-sum

input: weight (doc prob), sentence prob, of each document
output: weighted average.

(write scripts) P(t) prob

(write debug3 reader) read_log_prob, read_prob(write octave caller) lambda sum (interpolate)check code for get seq_prob to lambda sum(srilm caller) write ngram runner

model
options
sentence (input)

(write octave caller) weighted sum

(need): weighted-sum input format (simple matrix)?
(already have): weighted-sum matlab code

(write octave caller wrapper) logprob mean

use weighted sum with same weights. :-)

calc P_coll

check collection model file
get P_coll (t) (with -debug 3)

each P_doc(t)

get for each pure P_d(t) (with -debug 3), on all doc
calculate lamda*P_d + (1-lamda)*P_coll for each by call octave

calc P_(t) by weighted sum

do the weighted-sum of the values, with uniform weight

Produce conditional prob.

(write scripts) P(h | t) prob

write script “evidence calculation code”

Wow. Finally.

sanity check, more with sketch.

on AFP 2009 May

[#A] Some possible “look-into” data saving.

Starting from P_t, P_h, P_h|t.
Output of result hash:
Debug 1 : output the hash into file, no sorting, file order
Debug 2 : sorting, higher value first.

PERFORMANCE WORK

large files in a dir makes (10k>) file locating very, very slow.
GOAL: to make calling “ngram” perl doc as fast as “non-indexed” callings.
Main cause was big-num of files in a dir. Patched by using month/day subdirs.

(AS REJECTED) Index loading only once

THIS HAS BEEN CANCELED. (see testing)

writing

testing (on Westy)

It makes this even SLOWER!!!! (Memory was too FULL to do other things :-( strange…).
Reverted back. Maybe on servers… again, Maybe not.

Getting list of all model files, only once

path recorder, as a global (same as index). It will be loaded only once, if it is null

writing

testing

looks to be working good. Keep use this. (20 sec? for each trial? good)

GZSet to use Month as dir

make gzset unzipper to use “months” too. This will reduce the number of files in dir.

writing

testing

Sort index hit result

this wasn’t useful/impactful, and has some side effects. won’t use it.

writing

this will (maybe) make it faster to process indexed ones. (test on gillespie afp2010)

testing

Only a few dozen seconds. It affects, but not enough.

(GAVE UP) Memory Profiling

Run a Profiler (Westy, afp 2010)

BRIEF RUN NYTProf

Eh, didn’t really helped on memory issue. Maybe I should call Devel::size on important items.

(NOT NOW) Devel::Size on major data structures

Targets? (Hmm. need this really? )

change codes to reduce memory footage

FIND and REPLACE returning new array/hash

LIST those parts

P_t_multithread_index return value (hash -> hashref)
P_h_t_multithread_index (hash anon ref -> hashref)
P_d_runner, return value (hash -> hashref)

(IGNORE THIS. path-base would be bigger)

RTE Read-Eval

Make a experiment sketch code that will read and work on RTE3 data.

open up a new code, sketch once

make caller without splitta

add splitta support

EXPERIMENTS

Need to confirm/consider

very long sentence okay. (-200 or less logprob)

pick one or two “paragraph” level “Text”. Test it.

MODEL preparation

[#A] See how ngram-count works on large files

1) afp 2010 (no problem)

1-b) afp 2010 per doc (no problem)

2) all afp. (Gillespie, no problem)

2-b) all afp, per doc (Gillespie, ONGOING)

Way too slow (no need to do, since 2010 takes 30+ min)

3) all of the gigaword?

Maybe we need something between 2), 3).

Some additional ideas

some rough ideas & observations

better baseline would be P(h|h), instead of P(h)? (topical relatedness gets some even before starting).
“gain” (P(h|t) / P(h)) seems to (generally) increases with the length of (t & h)

Notes

Currently used/tested SRILM call parameters

ngram-count

(CURRENT) “-text” and “-lm”, and “-write-binary-lm”, all other default

ngram

(CURRENT) all default: no other than “-ppl” (input designation) and “-lm”.

Memo on efficiency

Testing on May 2009 AFP news (20k documents)

Running P_t sequentially currently takes about 3 min (2:48) on Moore.
Multi threads (6) on Gillespie, 58 seconds

RECORDS & POSTPONED

Past Improvements

Binary language model

add binary option as default option

collection model description (user’s own calling)

perstory_runner.pl (per document model)

I believe that ngram automatically loads binary model, so no additional coding on model users.

[#A] bug splitta outputs the last “.” concatted to the last Word.

TODO? [#C] [??] feature catall.pl “do not print a file size less than X”

TODO? [#C] [Very hard - Possible?] Matrix-ize weighted_sum Octave code.

[#A] [Efficiency] Lamda sum in Perl space. (No octave call)

For each news “story” we call twice; once ngram (can’t reduce this), once octave. Maybe starting up octave each time is expansive. Consider this.

[#A] [Efficiency for response] Not using multiple threads/ngram processes

Postponed improvements: “Good to have, but not critical”

TODO? [#C] [Efficiency for throughput] Unable to call two or more instances.

Currently, the file to be passed to ngram -ppl is a fixed name.
should be improved to temporary random name, or something like getName{sent}?
Not really important, since the code does use multithread for P_t, and a single instance can utilize many nubmers of threads.

TODO? [#C] If log-sum is only needed as “weighted sum” (use not-tool-small sum)

we may not need to do the costy log-space-sums.
(by multiply weights to a certain degree, so within octave normal range).
(using reference_weightedsum, or a improved variation, etc).
Not really important Only calculated twice, or three times only per each P(h|t). Not really critical, compared to other efficiency issues.
Well, “not needing octave anymore” would be nice but.

====

Known problems

Discount related questions

When processing document-models;
“Warning: count of count x is zero – lowering maxcount”
“Warning: discount coeff n is out of range: 0”

It seems that both related to sparseness. Not critical, but affecting (e.g. less good smoothing?)

Side notes about tools

SRILM

Interpolate call parameters

“-bayes 0” mix-model is generally what I would expect from simple summation: simple (lambda * model 1 prob) + ((1-lamba) * model 2 prob), for each word point. (Well if you ask me what -bayes non-zero means … I don’t)
so the mixture model call is something like:
ngram -lm doc.model -mix-lm collection.model -ppl test.txt -bayes 0 -debug 3 -lambda 0.1

Perplexity (per word), as calculated in SRILM

ppl = 10^(-logprob / (words - OOVs + sentences))
ppl1 (without </s>) = 10^(-logprob / (words - OOVs))

Discount methods in SRILM defult

When no option is given, it does Good-Turing discount. (the warnings are from those, when counting count of counts, etc)

Why different prob, for all OOV queries?

Q: They share all the same back-off interpolate model, why different?
A: /s
All OOV docs, at least has one </s>. Different /s prob per models.
We now have an option to exclude this </s>, from calculation. (DEFAULT ON, on lamba_sumX)

Octave

Octave “precision” of double is one digit less (than SRILM)

Seems like this causes the small amount of difference in the final result. (try octave> a = 0.00409898)
Octave uses H/W floats. … hmm. no easy way around(?)
Eh, no. Above examples is actually within HW float, but octave cuts it. Prolly some precision cut mechanism in work. What’s it?
“Symbolic toolbox”. vpa(something)? Hmm. no need yet.

Theoretical crosspoints / decisions

THEORETICAL

[#A] Word level model, or Sentence level model?

Basically, what I am trying to do is doing weighted sum of probabilities. There is two way of doing things.
Word Level weighted sum and Sentence Level weighted sum
Say, sentence is: P(w_1, …, w_n).

Sentence level weighted-sum

At sentence level, this can be calculated by weighted_mean_all_d( P_d(w_1, .., w_n) )

Word level weithed-sum

At word level, this can be caluclated by
product { … weighted_mean_all_d( P(w_n | w_n-1,w_n-2, w_n-3 ), weighted_mean_all_d( P(w_n+1 | w_n, w_n-1, w_n-2 ), … weighted_mean_all_d( P(</s> | …) ) }

Not compatible

The problem is that, two values are different. Weighted mean on sentence level (up to each sentence, prob calculated by each document model) produces one value. Product of word level probabilities that gained by per word weighted mean produces another value. They are generally not that far, but not the same.

Which one should we use?

If we want to use “per-word predictability” power, we need to do things on word level. Maybe this is more powerful. (and a bit slower)
If we are not interested in word level, and since our assumption simply assumes the underlying document-model generates a probablility for each given sentence… Then sentence level is good enough.
Try both? Hmm.

For now?

Try both?: no. on sentence level.
Sentence level. Following strictly to P_d(sentence).
Basic premise: A sentence, a probability. Each document model is independent (although weakly linked by coll-model, but this is not relevant here)
Word-level might be useful/needed for “dynamic/better LM”.

Files

TTD.org

Latest commit

History

TTD.org

File metadata and controls

Known bugs (to be fixed)

proto_condprob.pm dies in 677 on #test677

Two or more sentences in T.

Top search results makes lower (than cutpoint) generative probability?

Things to Check

Check : query with “\n” is okay… (remove any newline?)

Check : Maxprob is lower? Search ordering error? (DEBUG2)

Check: RTE data same? (RTE3) - #26

CONSIDER EXPR (removing all , . !)

IMPROVE: Add more feature outputs

Things to Ponder

“Dynamic” boundary?

“Category” of the problem

Some MEMO (Thoughts for paper work)

why it works?

Lexical level co-occurrence can’t answer P(h|t)

Example 1

Example 2

Weakness?

One easy way to fool it.

When “T” sentence wasn’t found. (can be overcome with more text?)

Proper-names (entity names), ups and downs

RTE3 with AFP2010

RTE3 with AFP2009+2010

RTE3 with early AFP2010 (Jan-June)

PRESENTPRESENTPRESENTPRESENTPRESENTPRESENTPRESENTPRESENT

PASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPASTPAST

DEVEL History

(Stored) Future Improvements

[#B] remove (or don’t count) too short news articles

Previous Improvements

[#A] Collection model generate from subdirs

For collection - catall to dump everything in subdirs.

[#A] Per doc model generate with subdirs

For per-doc models - perstroy_runnner with subdir.

[#A] SubDir plays (needed before doing more than AFP one year)

P_t should traverse all subdirs.

Test of P_h_t_multithread with sketch

[#A] Index Work

add index optimizer at the end of indexing.pl

run temp.pl query on 2009 data

recreate the work environment in home …

query method. (Text in, ordered result out)

Code P_t with index.

Implement “top N” approximation. check some approx vs non-approx.

Some more test on P_t, “and try approximate”?

Implement P_t_h_index with N approximation.

Test P_t_h_index with test code.

Play with some more simple texts on the newst implementation.

SEARCH ERROR PATCH

result return script

test those sequence

check all “special words” for (P)lucene query.

imporve plucene_query() by removing those terms from the given query

Main line coding

Collection Model

(run) Get “target” news files (target corpus) all in one folder

(run) catall and generate collection LM model

[#C] (If subdir needed) TODO? (write script) recursively catall and generate collection model

Document Model

(write script) For each file, make each LM model

Produce single sentence prob. (t)

(write matlab script) weighted-sum

(write scripts) P(t) prob

Produce conditional prob.

(write scripts) P(h | t) prob

write script “evidence calculation code”

sanity check, more with sketch.

[#A] Some possible “look-into” data saving.

PERFORMANCE WORK

(AS REJECTED) Index loading only once

writing

testing (on Westy)

Getting list of all model files, only once