making_models.txt

=== 
0. PREREQUISITE  
- SRILM binaries must be in PATH. (ngram-count and ngram) 
- Python 2.x needed by the sentence splitter. (Splitta, in /splitta) 
- Java 1.6 or later (required for SOLR search engine) 
- Perl 5.8 or later needed. 
  (We have tested it with perl-5.16, and 5.18.)
- Also, you need to install the following CPAN modules. 
  + WebService::Solr (for connecting SOLR search from perl) 
  + forks (for concurrent running of ngram on document models) 
  (additionally, it uses the following default-modules.) 
  + DB_File (normally shipped with default perl) for caching 
  + Benchmark (normally shipped with default perl) for checking time  

0.b Installation 
- clone git 
- unpack solr-4.3.0 binary tarball in the following path. 
  /gigaword-lm-scripts/solr-4.3.0 
 (when freshly cloned, the path only has configurations, no solr binaries)
 (unpack can be done simply by running the following command in the 
  /gigaword-lm-script. 
  > tar xvzf /path-to-downloaded-file/solr-4.3.0.tgz 
  the command will unpack files in the expected path in 
  /gigaword-lm-script/solr-4.3.0/ 
  -- and it doesn't intefere with already existing configuration directory of 
  /gigaword-lm-script/solr-4.3.0/gigaword_indexing/ 
 )
=== 
1.a. Unpack (the selected) GigaWord target gz files, and run splitta to
do tokenization & sentence splitting. This will form your corpus. 

Commands to run: 
> perl gzset_runner.pl [list of gigaword gz file] 

Note: output folder is defined in the script. For now, let's assume
this is "./models/document/". (actual may differ). This process will
take some time, according to the corpus size. 

1.b. (optional, but usually you will need this) Very short files (one
or two liner news stories) can act as noises for "document-based"
models. We will avoid this by deleting files that are way too short.   

Commmand to run: 
> perl rm_very_short_story.pl ./models/document

The script will walk over the documents and let you know how many
files it has removed from the document collection. 

Note: How short is too short? Default is defined as less than 4
sentences as too short. (Note that each story start with title, thus 4
means, the body of the news article has two or less sentences). you
can change this variable in the script config global variable,
$DOC_MIN_NUM_SENTENCES at the top of the script code. 

===
2. Generate "collection" LM on all of the "story" (actual news
article) files.  

Commands to run: 
1) > perl cat_all_stories.pl ./models/document > ./models/collection/collection.txt
2) > ngram-count -text ./models/collection/collection.txt -lm ./models/collection/collection.model -write-binary-lm 

Note: 1) generates single big collection text of news articles,for
      SRILM ngram modeller. 2) runs SRILM ngram model on the big text,
      and saves the resulting model on output/collection/collection.model
Note: If you have use some other options (like order n) for LM, 
      those should also be reflected on the next step. 

=== 

3. Generate "per document" LM on each of the "story". 

Commands to run: 
> perl perstory_runner.pl ./models/document 

Note: The model files (*.model) will be generated in the same
directory where the story (news article, *.story) file is located,
with the same name (*.story.model). (as binary SRILM ngram model) 
Note: This also takes SOME time, according to the corpus size. 

===
4. Now the model files are ready. Collection-wide model on
/models/collection/collection.model, and per-document models are at
/models/document/[newsfile].model. Also, newsfile texts are also there
in /document/ directories. 

Time to index them for faster calculation. 

First of all, activate Apache SOLR search engine. This can be done by
running SOLR. SOLR copy and a specific configuration of that for this
task is included in the code-distribution. 

Command to run: 
> cd solr-4.3.0/gigaword_indexing
(change directory into /solr-4.3.0/gigaword_indexing/) 

> java -jar start.jar 
(this command starts SOLR engine with the given configuration.) 
(Note that you can start SOLR as background with "&", but for now, I
recommend you to keep it running on your command window, just to check
SOLR logs) 

If you see this in the SOLR output, SOLR is now up and running on port
9911: "[main] INFO  org.eclipse.jetty.server.AbstractConnector  –
Started SocketConnector@0.0.0.0:9911" 

Now SOLR is up and running. Time to run indexing. Open up another
terminal, go to top directory of the code and run the following
command: 
> perl indexing_solr.pl "./models/document" 
(the command checks each document and index each with SOLR.) 

This will take some time. (but much faster than per-doc model
generation.) 

--- 

Once it is all done, you can look into indexed (gigaword) documents by
visiting SOLR web interface. 
1) Open your browser to: http://localhost:9911/solr/#/
2) On the SOLR interface, click left tab "Core Admin" 
3) Check number of documents indexed, under "Index", "numDocs". 

===
5. All model preparation is done. Now run a test. 

> perl sketch.pl 

This short code gives you minimal running example of calling codes to
calculate P(text1 | text2). 

The default example of "bus accident" will give you PPL gain about 46
percents (0.46 ...), and PMI about  on AFP2010 corpus.

If the numbers are seriously different (e.g. minus values), it means
that it isn't working properly. See A and check for your Perl
multi-thread/process support. 

===

A. About Perl version 

- It is best to use perl version prepared with no-thread support. 
- The condprob.pm module uses multi-process via forks.pm. 
  In theory, a perl version with native-thread support should work
  with no problem, with or without forks. (either with "use forks", or
  "use threads". but actual thread codes in perl (especially 5.18 and
  linux) has some bugs in it and fails to work properly. 
- If such a failure occur, the code can't process given number of
  document models (the progress dots are missing or too short), and
  produces very strange values.  
- So what to do in such a case? --- we recommend you to build and use
  a perl interpreter without native thread support --- process based
  forks.pm is actually faster, and you don't lose anything with "no
  thread" perl. 
- Use PerlBrew to install a perl version without any options (default
  is without usethreads option). PerlBrew is very convenient and it
  builds/installs everything automatically on your user account (no
  root required for perl or CPAN modules). 

B. Module dependency: External binary dependencies of
WebService::Solr, and DB_File   

- WebService::Solr requires EXPAT XML parser library. Any unix-like
  systems would have it as default. But if your system hasn't, you
  have to install it. 
- DB_File requires BerkeleyDB, also normally installed in
  unix-systems; you also need "include" headers for the DB
  (e.g. "db5.1-dev" package in Debian, etc) 
- Again, on a unix-like, they are default and very easy to provide. On 
  Windows? ... I never tried and can't tell.