summaryrefslogtreecommitdiff
path: root/src/runtime/python/examples/README
diff options
context:
space:
mode:
authorprasanth.kolachina <prasanth.kolachina@cse.gu.se>2015-04-22 13:14:26 +0000
committerprasanth.kolachina <prasanth.kolachina@cse.gu.se>2015-04-22 13:14:26 +0000
commit57006b6296c271bff657be48962fafc5dd207c98 (patch)
tree8ae5a313f53636bb252a6c8996da6b6f593eebbd /src/runtime/python/examples/README
parentc3a626686ed66de161706edc4da62721a4c61193 (diff)
README for Python translation pipeline
Diffstat (limited to 'src/runtime/python/examples/README')
-rw-r--r--src/runtime/python/examples/README167
1 files changed, 167 insertions, 0 deletions
diff --git a/src/runtime/python/examples/README b/src/runtime/python/examples/README
new file mode 100644
index 000000000..b6791a368
--- /dev/null
+++ b/src/runtime/python/examples/README
@@ -0,0 +1,167 @@
+~runtime/python/examples/README
+
+(c) Prasanth Kolachina, 22 April 2015
+
+======================
+TRANSLATION PIPELINE
+======================
+
+The module translation_pipeline.py is a Python replica of the
+translation pipeline used in Wide-coverage Translation demo.
+The pipeline allows for
+ 1. simulataneous batch translation from one language into multiple languages
+ 2. K-best translations
+ 3. translate both text files and sgm files.
+
+The module defines functions for the standard lexer used in the pipeline,
+the callbacks used in robust parsing to partially deal with unknown words
+and proper nouns etc.
+
+Basic example usage:
+> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -i <input-file> -e <exp-directory>
+> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -K 20 -i <input-file> -e <exp-directory>
+> python translation_pipeline.py -g Translate11.pgf -s Eng -t Fin Swe Ger -i <input-file> -e <exp-directory>
+> python translation_pipeline.py -g TranslateEngFin.pgf -s Eng -t Fin -f sgm -i <sgm-input-file> -e <exp-directory>
+
+The full list and description of options accepted by the translation_pipeline
+module can be seen using the -h option.
+
+> python translation_pipeline.py -h
+———
+usage: translation_pipeline.py [-h] -g PGFFILE [-s SRCLANG]
+ [-t [TGTLANGS [TGTLANGS ...]]] [-i INPUT]
+ [-e EXP_DIRECTORY] [-f {txt,sgm}]
+ [-p PROPSFILE] [-K BESTK]
+
+Run the GF translation pipeline on standard test-sets
+
+optional arguments:
+ -h, --help show this help message and exit
+ -g PGFFILE, --pgf PGFFILE
+ PGF grammar file to run the pipeline
+ -s SRCLANG, --source SRCLANG
+ Source language of input sentences
+ -t [TGTLANGS [TGTLANGS ...]], --target [TGTLANGS [TGTLANGS ...]]
+ Target languages to linearize (default is all other
+ languages)
+ -i INPUT, --input INPUT
+ input file (default will accept STDIN)
+ -e EXP_DIRECTORY, --exp EXP_DIRECTORY
+ experiement directory to write translation files
+ -f {txt,sgm}, --format {txt,sgm}
+ input file format (output files will be written in the
+ same format)
+ -p PROPSFILE, --props PROPSFILE
+ properties file for the translation pipeline (specify
+ the above arguments in a file)
+ -K BESTK K value for K-best translation
+
+
+======================
+PREREQUISITES
+======================
+In order to use the examples in this directory, the following components
+are required:
+ 1. GF C runtime (~runtime/c/)
+ 2. Python bindings to the C runtime (~runtime/python/)
+ 3. The path to Python library is added to PYTHONPATH environment variable
+ (Note: by default, the setuptools installs the bindings to a location
+ available for everyone, so this step is only required if you have
+ done a custom installation of the Python bindings and you know what
+ you are doing)
+> export PYTHONPATH="$GF/src/runtime/python/build/lib.*:$PYTHONPATH"
+
+
+======================
+WEB GF PARSING
+======================
+NEW!!!
+In it current state, we carry out parsing of large web texts using
+GF grammars. The same functions described in gf_utils.py are used, but
+we make it faster using multithreading. The multiprocessing module in
+Python allows for trivial parallelization, where each batch of sentences
+are parsed by different threads in the pool.
+
+We noticed one thing during these experiments: the GF parser can
+take an unusually long time for long and ambiguous sentences. Therefore,
+to avoid resource starvation, we use a `timeout' setting to raise a
+PgfParseError if it takes more than 5 minutes for a single sentence.
+With this simple trick, we manage to parse large corpora (Europarl
+texts) in both English and Swedish. Please contact
+prasanth.kolachina@cse.gu.se
+if you have any questions about this.
+
+
+======================
+GENERIC GF UTILITIES
+======================
+
+The module gf_utils.py contains functions to carry out four
+basic tasks:
+1. 1-best parsing (parse)
+2. K-best parsing (kparse)
+3. 1-best linearization (linearize)
+4. K-best linearization (klinearize)
+
+> usage: gf_utils.py [-h] {parse,kparse,linearize,klinearize} ...
+
+Detailed arguments for each function can be found using the "-h" option.
+An exhaustive list of options for all the functions are given below. Options
+marked with (*) are required, the others are optional.
+
+(*) -g/--pgf <pgf-file> PGF Grammar file
+(*) -s/--src-lang <lang> Source language name i.e. code used in GF (for e.g. TranslateEng, TranslateFin). For parsing, the option specifies the language of the input sentences.
+(*) -t/--tgt-lang <lang> Target language name. For linearization, the option specifies the language into which they are linearized.
+(*) -K <int> Prespecified K value for K-best parsing and linearization.
+-i/--input <filename> Input file name, either raw text sentences or abstract trees for linearization.
+-o/--output <filename> Output file name.
+-p/--start-sym <sym> Start symbol used for parsing
+
+Basic example usage:
+> python gf_utils.py parse -g TranslateEng.pgf -s TranslateEng -i <input-file> -o <parse-output-file> [-p Phr]
+> python gf_utils.py kparse -g TranslateEng.pgf -s TranslateEng -K 20 -i <input-file> -o <kparse-output-file> [-p Phr]
+> python gf_utils.py linearize -g TranslateFin.pgf -t TranslateFin -i <parse-output-file> -o <output-file>
+> python gf_utils.py klinearize -g TranslateFin.pgf -t TranslateFin -i <kparse-output-file> -o <output-file>
+
+======================
+File I/O formats
+======================
+
+1. One-sentence-per-line
+The input sentences to the parser/kparser are written one sentence
+per line. This is also the standard format used in the translation
+pipeline.
+
+2. SGM format
+The translation pipeline accepts SGM format file as both input
+and output files. The format is specifically used by automatic
+evaluation metrics used to measure quality of MT systems. The format
+is primarily used in by the NIST evaluation and the WMT Shared
+Task evaluations.
+
+3. Parser output format
+The parser writes four columns, seperated by the <tab>-character
+for each sentence in a single line. The sentence index, time taken
+by the parser, the tree probability value and the abstract syntax tree.
+
+4. K-best parser output format
+The k-best parser uses a representation that has come to be called
+CJ (Charniak-Johnson) format in the parsing community.
+ a. The output consists of parsed blocks for each sentence. Two blocks
+ are seperated by an empty line.
+ b. The first line in the block contains two numbers: the number
+ of parses in that block, and a identifier for that sentence.
+ c. Each subsequent pair of lines contains the log probability of the
+ abstract tree in one line followed by the actual parse tree in the
+ next line.
+
+5. K-best translations output format
+The k-best linearizer and k-best translation use the same format as
+Moses and other SMT toolkits to write K-best translation lists.
+ a. The output consists of translation blocks for each sentence.
+ b. Each block consists of several translations, one per each line.
+ c. Each translation (or line) consists of four columns seperated by
+ '|||' string. The first column contains a sentence identifier,
+ the second column is the actual translation, followed by
+ word-alignment information between the input sentence and the
+ translation and the scores from statistical models used in parsing.