Skip to content

Latest commit

 

History

History
91 lines (57 loc) · 5.06 KB

README.md

File metadata and controls

91 lines (57 loc) · 5.06 KB

Termolator - Chinese Terminology Extraction and TJet

This package contains NYU's Chinese terminology extraction system and a Jet wrapper to support English Terminology extraction. The system is released under the Apache License, except for the files in sampleRDG/ and sampleBackground/ directories. See https://github.com/AdamMeyers/The_Termolator for the English version.

Files in demo/ are taken from Wikipedia and licensed under CC-BY-SA 3.0 License.

Chinese Terminology Extraction

The binary release of the Chinese terminology extraction system can be downloaded from:

https://github.com/ivanhe/termolator/releases/download/Beta1/chinese_term_extraction.zip

To perform Chinese terminology extraction, first unzip the pacakge and then run:

./run_cn.sh IN_DOMAIN_FILELIST OUT_OF_DOMAIN_FILELIST OUTPUT_FILE
  • IN_DOMAIN_FILELIST: List of in-domain files generated by a Chinese noun chunker, in CONLL format. The CONLL format we use assumes one word per line. Each line has four fields, delimited by the tab character. Field 1: Word; Field 2: Word; Field 3: Part-of-speech tag; Field 4: BIO tag for NP (B-NP, I-NP, or O).

  • OUT_OF_DOMAIN_FILELIST: List of background files generated by a Chinese noun chunker, in CONLL format

  • OUTPUT_FILE: Name of the output file. The output file will be a ranked list of terminologies.

The in-domain corpus is the corpus from which the terminologies are extracted; the out-of-domain corpus is supposed to be a corpus in general domain. To get a feeling, run:

./run_cn.sh demo.pos.filelist demo.neg.filelist demo.output

Here, the in-domain corpus is five documents related to the history of the Byzantine Empire, and the out-of-domain corpus consists of three random documents. There will be one term extracted in demo.output: "拜占庭" (Byzantine).

Building the System

We build the system by maven. In the FuseJet directory, run:

mvn package

The produced jar file is the FuseJet.jar used in the Chinese system, as well as the TJet.jar in the English system.

Using the Word Segmenter and Part-of-Speech Tagger

We provide a Chinese word segmenter and part-of-speech tagger, by courtesy of the Chinese Language Processing Group, Brandeis University. It is available at:

https://github.com/ivanhe/termolator/releases/download/Beta1/brandeis-segmenter-postagger.tgz

The Termolator License terms do NOT cover the word segmenter and part-of-speech tagger. Please find usage and license terms for the Brandeis tagger in Readme.txt from the zip package.

We also provide a Python3 script to convert the word segmenter/pos tagger output into the CoNLL format that our term extraction system requires. Usage:

./pos2conll.py POS_OUTPUT_DIR CONLL_OUTPUT_DIR CONLL_FILE_LIST

where POS_OUTPUT_DIR is the output directory of the Brandeis tagger, CONLL_OUTPUT_DIR is the directory that we save the output files in CoNLL format, and CONLL_FILE_LIST is an output file: pos2conll.py will create a list of files it has written to CONLL_OUTPUT_DIR in CONLL_FILE_LIST

CONLL_FILE_LIST can then be used as the input file list for run_cn.sh

Property File

The parameters for the Chinese property file is explained below:

# Note that the words lists contain words and their absolute frequencies in a news corpus, whether
# a word/character is considered as a stop word/character is determined by the thresholds below
stopWordListName = data/CN.nw.wordlist.txt
endWordListName = data/CN.endlist.txt
forbiddenCharListName = data/CN.charlist.txt
# words with frequency higher than this threshold will be filtered out
stopThreshold = 50
# words with characters higher than this threshold will be filtered out
forbiddenThreshold = 800
#
# The following 3 paramters are currently hard-coded in the system. The values in the properties file
# are not used. The hard-coded values are minAV=3 minCount=5 minDocumentCount=3
# This behavior can be changed in the constructor of ChineseTypedTermFilter
# 1) Threshold for the access variety statistic. Terms with AV less than this will be filtered out
# See Feng, Chen, Deng, and Zheng (2004): Accessor Variety Criteria for Chinese Word Extraction.
# Computational Linguistics 30 (1)
minAV = 5
# 2) Minimum absolute count for a term to be included in the output
minCount = 3
# 3) Terms appear in less than the threshold number of documents will be filtered out 
minDocumentCount = 5
# 
# Percentile of the all terms to output (0.6 means that top 60% of all unfiltered candidates will be returned)
terminologyThreshold = 0.6

Authors

Termolator is developed by Adam Meyers, Yifan He, Zachary Glass and Shasha Liao. The English version is available at: https://github.com/AdamMeyers/The_Termolator

The code for the Chinese terminology extractor and the English part-of-speech tagger in this git repository is developed by Yifan He and Shasha Liao.

We thank the Chinese Language Processing Group at Brandeis University and Prof. Nianwen Xue for providing the Chinese word segmenter and part-of-speech tagger.