diff options
Diffstat (limited to 'examples/phrasebook/doc-phrasebook.html')
| -rw-r--r-- | examples/phrasebook/doc-phrasebook.html | 688 |
1 files changed, 0 insertions, 688 deletions
diff --git a/examples/phrasebook/doc-phrasebook.html b/examples/phrasebook/doc-phrasebook.html deleted file mode 100644 index a6b42a255..000000000 --- a/examples/phrasebook/doc-phrasebook.html +++ /dev/null @@ -1,688 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> -<HTML> -<HEAD> -<META NAME="generator" CONTENT="http://txt2tags.sf.net"> -<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> -<TITLE>MOLTO Multilingual Phrasebook</TITLE> -</HEAD><BODY BGCOLOR="white" TEXT="black"> -<P ALIGN="center"><CENTER><H1>MOLTO Multilingual Phrasebook</H1> -<FONT SIZE="4"> -<I>Krasimir Angelov, Olga Caprotti, Ramona Enache, Thomas Hallgren, Inari Listenmaa, Aarne Ranta, Jordi Saludes, Adam Slaski</I><BR> -Showcase for project FP7-ICT-247914, Deliverable D10.2. -</FONT></CENTER> - -<P></P> -<HR NOSHADE SIZE=1> -<P></P> - <UL> - <LI><A HREF="#toc1">Purpose</A> - <LI><A HREF="#toc2">Points illustrated</A> - <UL> - <LI><A HREF="#toc3">From the user perspective</A> - <LI><A HREF="#toc4">From the programmer's perspective</A> - </UL> - <LI><A HREF="#toc5">Files</A> - <UL> - <LI><A HREF="#toc6">Grammars</A> - <LI><A HREF="#toc7">Ontology</A> - <LI><A HREF="#toc8">Run-time system and user interface</A> - </UL> - <LI><A HREF="#toc9">Effort and cost</A> - <LI><A HREF="#toc10">Example-based grammar writing prototype</A> - <LI><A HREF="#toc11">To Do</A> - <LI><A HREF="#toc12">How to contribute</A> - <LI><A HREF="#toc13">Conclusions (tentative)</A> - <LI><A HREF="#toc14">Acknowledgements</A> - </UL> - -<P></P> -<HR NOSHADE SIZE=1> -<P></P> -<P> -<HR> -<font size=-1> -</P> -<P> -History -</P> -<UL> -<LI>1 September. Version 1.1: bug fixes, some new constructions. -<LI>2 June. Version 1.0 released! -<LI>29 May. Link to Google translate with the current language pair and phrase. -<LI>27 May. Polish added. -<LI>26 May. Version 0.9: - Catalan added, mass/count noun distinction to reduce overgeneration, - improved web interface. -<LI>20 May. Version 0.8: - Spanish added, Bulgarian complete. -<LI>9 May. Version 0.7: - Danish and Norwegian added (preliminary versions induced from statistical models - and resource grammars). -<LI>3 May. Version 0.6: - Extended API (now final for release), Dutch added; new user interface with text - input enabled. -<LI>10 April. Some additions in API, comments in implementation; regenerated clones. -<LI>8 April. Added German. -<LI>7 April. Added the Clone script, applied to initiate the rest of MOLTO languages. -<LI>6 April. Version 0.4: weekdays, nationalities -<LI>30 March. Version 0.3: disambiguation grammar for English -<LI>28 March. Version 0.2: Swe, Ita; cat Action; small phrases. -<LI>26 March 2010. Version 0.1: Eng, Fin, Fre, Ron; dedicated minibar UI. -</UL> - -<P> -<A HREF="missing.txt">Missing constructs</A> -</P> -<P> -<A HREF="http://www.grammaticalframework.org/demos/phrasebook/">Back to the phrasebook</A> -</P> -<P> -</font> -<HR> -</P> -<A NAME="toc1"></A> -<H1>Purpose</H1> -<P> -This phrasebook is a program for translating touristic phrases -between 14 European languages included in the -<A HREF="http://www.molto-project.eu">MOLTO</A> project -(Multilingual On-Line Translation): -</P> -<UL> -<LI>Bulgarian, Catalan, Danish, Dutch, English, - Finnish, French, German, Italian, Norwegian, - Polish, Romanian, Spanish, Swedish -</UL> - -<P> -A Russian version is not yet finished but is projected later. Also other languages may be added. -</P> -<P> -The phrasebook is implemented by using the GF programming language -(<A HREF="http://grammaticalframework.org">Grammatical Framework</A>). -It is the first demo for the MOLTO project, released in the third month (by June 2010). -The first version is a very small system, but it will extended in the course of the project. -</P> -<P> -The phrasebook has the following requirement specification: -</P> -<UL> -<LI>high quality: reliable translations to express yourself in any of the languages -<LI>translation between all pairs of languages -<LI>runnable in web browsers -<LI>runnable on mobile phones (via web browser; Android stand-alone forthcoming) -<LI>easily extensible by new words (forthcoming: semi-automatic extensions by users) -</UL> - -<P> -The phrasebook is available as open-source software, licensed under GNU LGPL. -The source code resides in -<A HREF="http://www.grammaticalframework.org/examples/phrasebook/"><CODE>www.grammaticalframework.org/examples/phrasebook/</CODE></A> -</P> -<A NAME="toc2"></A> -<H1>Points illustrated</H1> -<A NAME="toc3"></A> -<H2>From the user perspective</H2> -<P> -Interlingua-based translation -</P> -<UL> -<LI>we translate meanings, rather than words -</UL> - -<P> -Incremental parsing -</P> -<UL> -<LI>the user is at every point guided by the list of possible next words -</UL> - -<P> -Mixed modalities -</P> -<UL> -<LI>selection of words ("fridge magnets") combined with text input -</UL> - -<P> -Quasi-incremental translation: many basic types are also used as phrases -</P> -<UL> -<LI>one can translate both words and complete sentences, and get intermediate results -</UL> - -<P> -Disambiguation, esp. of politeness distinctions -</P> -<UL> -<LI>if a phrase has many translations, each of them is shown and given an explanation - (currently just in English, later in any source language) -</UL> - -<P> -Fall-back to statistical translation -</P> -<UL> -<LI>currently just a link to Google translate (forthcoming: tailor-made statistical models) -</UL> - -<P> -Feed-back from users -</P> -<UL> -<LI>users are welcomed to send comments, bug reports, and better translation suggestions -</UL> - -<A NAME="toc4"></A> -<H2>From the programmer's perspective</H2> -<P> -The use of resource grammars and functors -</P> -<UL> -<LI>the translator was implemented on top of an earlier linguistic knowledge base, - the <A HREF="http://www.grammaticalframework.org/lib">GF Resource Grammar Library</A> -</UL> - -<P> -Example-based grammar writing and grammar induction from statistical models -(<A HREF="http://translate.google.com">Google translate</A>) -</P> -<UL> -<LI>many of the grammars were created semi-automatically by generalization from - examples -</UL> - -<P> -Compile-time transfer: especially, in Action in Words -</P> -<UL> -<LI>the structural differences between languages are treated at compile time, - for maximal run-time efficiency -</UL> - -<P> -The level of skills involved in grammar development -</P> -<UL> -<LI>testing different configurations (see table below) -</UL> - -<P> -Grammar testing -</P> -<UL> -<LI>use of treebanks with guided random generation for initial evaluation and regression testing -</UL> - -<A NAME="toc5"></A> -<H1>Files</H1> -<A NAME="toc6"></A> -<H2>Grammars</H2> -<P> -<CODE>Sentences</CODE>: general syntactic structures implementable in a uniform way. -Concrete syntax via the functor <CODE>SencencesI</CODE>. -</P> -<P> -<CODE>Words</CODE>: words and predicates, typically language-dependent. -Separate concrete syntaxes. -</P> -<P> -<CODE>Greetings</CODE>: idiomatic phrases, string-based. -Separate concrete syntaxes. -</P> -<P> -<CODE>Phrasebook</CODE>: the top module putting everything together. -Separate concrete syntaxes. -</P> -<P> -<CODE>DisambPhrasebook</CODE>: disambiguation grammars generating feedback phrases if -the input language is ambiguous. -</P> -<P> -<CODE>Numeral</CODE>: resource grammar module directly inherited from the library. -</P> -<P> -Here is the module structure as produced in GF by -</P> -<PRE> - > i -retain DisambPhrasebookEng.gf - > dg -only=Phrasebook*,Sentences*,Words*,Greetings*,Numeral,NumeralEng,DisambPhrasebookEng - > ! dot -Tpng _gfdepgraph.dot >pgraph.png -</PRE> -<P></P> -<P> -<IMG ALIGN="middle" SRC="pgraph.png" BORDER="0" ALT=""> -</P> -<A NAME="toc7"></A> -<H2>Ontology</H2> -<P> -The abstract syntax defines the <B>ontology</B> behind the phrasebook. -Some explanations can be found in the -<A HREF="Ontology.html">ontology document</A>, which is produced from the -abstract syntax files -<A HREF="http://www.grammaticalframework.org/examples/phrasebook/Sentences.gf"><CODE>Sentences.gf</CODE></A> -and -<A HREF="http://www.grammaticalframework.org/examples/phrasebook/Words.gf"><CODE>Words.gf</CODE></A> -by <CODE>make doc</CODE>. -</P> -<A NAME="toc8"></A> -<H2>Run-time system and user interface</H2> -<P> -The phrasebook uses -the -<A HREF="http://code.google.com/p/grammatical-framework/wiki/LaunchWebDemos">PGF server</A> -written in Haskell and the -<A HREF="http://www.grammaticalframework.org/demos/minibar/about.html">minibar library</A> -written in JavaScript. Since the sources of these systems are available, anyone can build the phrasebook -locally on her own computer. -</P> -<A NAME="toc9"></A> -<H1>Effort and cost</H1> -<TABLE BORDER="1" CELLPADDING="4"> -<TR> -<TH>Language</TH> -<TH>Grammarian's language skills</TH> -<TH>Grammarian's GF skills</TH> -<TH>Informant used for development</TH> -<TH>Informant used for testing</TH> -<TH>Use of external tools</TH> -<TH>Impact of external tools</TH> -<TH>Changes on the resource grammar</TH> -<TH COLSPAN="2">Development time</TH> -</TR> -<TR> -<TD>Bulgarian</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>Catalan</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">#</TD> -</TR> -<TR> -<TD>Danish</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>Dutch</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>English</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">_</TD> -<TD ALIGN="center">#</TD> -</TR> -<TR> -<TD>Finnish</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>French</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">#</TD> -</TR> -<TR> -<TD>German</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">###</TD> -</TR> -<TR> -<TD>Italian</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>Norwegian</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>Polish</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>Romanian</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">###</TD> -</TR> -<TR> -<TD>Spanish</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">#</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">_</TD> -<TD ALIGN="center">##</TD> -</TR> -<TR> -<TD>Swedish</TD> -<TD ALIGN="center">##</TD> -<TD ALIGN="center">###</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">+</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">?</TD> -<TD ALIGN="center">-</TD> -<TD ALIGN="center">##</TD> -</TR> -</TABLE> - -<P> -Explanation on scores -</P> -<UL> -<LI>Grammarian's language skills - <UL> - <LI>- : no skills - <LI># : passive knowledge - <LI>## : fluent non-native - <LI>### : native speaker - </UL> -</UL> - -<UL> -<LI>Grammarian's GF skills - <UL> - <LI>- : no skills - <LI># : basic skills (2-day GF tutorial) - <LI>## : medium skills (previous experience of similar task) - <LI>### : advanced skills (resource grammar writer/substantial contributor) - </UL> -</UL> - -<UL> -<LI>Informant used for development/Informant needed for testing/Use of external tools - <UL> - <LI>- : no - <LI>+ : yes - </UL> -</UL> - -<UL> -<LI>Impact of external tools - <UL> - <LI>? : not investigated - <LI>- : no effect on the Phrasebook - <LI># : small impact (literal translation, simple idioms) - <LI>## : medium effect (translation of more forms of words, contextual preposition) - <LI>### : great effect (no extra work needed, translations are correct) - </UL> -</UL> - -<UL> -<LI>Changes on the resource grammars - <UL> - <LI>- : no changes - <LI># : 1-3 minor changes - <LI>## : 4-10 minor changes, 1-3 medium changes - <LI>### : >10 changes of any kind - </UL> -</UL> - -<UL> -<LI>Overall effort (including extra work on resource grammars) - <UL> - <LI># : less than 8 person hours - <LI>## : 8-24 person hours - <LI>### : >24 person hours - </UL> -</UL> - -<A NAME="toc10"></A> -<H1>Example-based grammar writing prototype</H1> -<P> -The figure presents the process of creating a Phrasebook using an example-based -approach for the language X, where X = {Danish, Dutch, German, Norwegian}. -</P> -<P> -<IMG ALIGN="middle" SRC="picpic.jpg" BORDER="0" ALT=""> -</P> -<UL> -<LI>the first step assumes an analysis of the resource grammar and extracts the necessary - information that functions that build new lexical entries would need. - A model is built so that the proper forms of the word can be rendered, - and additional information, such as gender, can be inferred. The script applies - these rules to each entry that we want to translate into the target language, and - one obtains a set of constructions. -<LI>they are furthermore given to an external translator tool (Google translate) - or a native speaker for translation. One needs the configuration file even if the - translator is human, because formal knowledge of grammar is not assumed. -<LI>the translations into the target language are further more processed in order to - build the linearizations of the categories first, decoding the information received. - Furthermore, having the words in the lexicon, one can parse the translations of - functions with the GF parser and generalize from that. -<LI>the resulting grammar is tested with the aid of a script that generates - constructions covering all the functions and categories from the grammar, along - with some other constructions that proved to be problematic in some language. - The result of the script contains for each construction in the target language - its English correspondent and the abstract syntax tree. A native speaker - evaluates the results and if corrections are needed, the algorithm runs again - with the new examples. Depending on the language skills of the grammar writer, - the changes can be made directly into the GF files, and the correct examples - given by the native informant are just kept for validating the results. - The algorithm is repeated as long as corrections are needed. -</UL> - -<P> -The time needed for preparing the configuration files for a grammar will not be needed -in the future, since the files are reusable for other applications. -The time for the second step can be saved if automatic tools, like Google translate -are used. This is only possible in languages with a simpler morphology and syntax -and large corpora available. -Good results were obtained for German and Dutch with Google translate, but for -languages like Romanian or Polish, which are both complex and lack enough resources, -the results are discouraging. -</P> -<P> -If the statistical oracle works well, the only step where the presence of a human -translator is needed is the evaluation and feedback step. An average of 4 hours per -round and 2 rounds were needed in average for the languages for which we performed -the experiment. It is possible that more effort is needed for more complex languages. -</P> -<A NAME="toc11"></A> -<H1>To Do</H1> -<P> -Disambiguation grammars for other languages than English -</P> -<P> -Extend the abstract lexicon in <CODE>Words</CODE> by hand or (semi)automatically for -</P> -<UL> -<LI>food stuff -<LI>places -<LI>actions -</UL> - -<P> -Customizable phone distribution: make your own selection of the 2^15 language subsets -when downloading the phrasebook to a phone -</P> -<A NAME="toc12"></A> -<H1>How to contribute</H1> -<P> -The basic things "everyone" can do is -</P> -<UL> -<LI>complete <A HREF="missing.txt">missing words</A> in concrete syntaxes -<LI>add new abstract words in <CODE>Words</CODE> and greetings in <CODE>Greetings</CODE> -</UL> - -<P> -The missing concrete syntax entries are added to the <CODE>Words</CODE><I>L</I><CODE>.gf</CODE> -files for each language <I>L</I>. The -<A HREF="http://www.grammaticalframework.org/lib/doc/synopsis.html#toc78">morphological paradigms</A> -of the GF resource library should be used. Actions (prefixed with <CODE>A</CODE>, as <CODE>AWant</CODE>) are -a little more demanding, since they also require syntax constructors. Greetings (prefixed -with <CODE>G</CODE>) are pure strings. -</P> -<P> -Some explanations can be found in the -<A HREF="Implementation.html">implementation document</A>, which is produced from the -concrete syntax files -<A HREF="http://www.grammaticalframework.org/examples/phrasebook/SentencesI.gf"><CODE>SentencesI.gf</CODE></A> -and -<A HREF="http://www.grammaticalframework.org/examples/phrasebook/WordsEng.gf"><CODE>WordsEng.gf</CODE></A> -by <CODE>make doc</CODE>. -</P> -<P> -Here are the steps to follow for contributors: -</P> -<OL> -<LI>Make sure you have the latest sources - from <A HREF="http://www.grammaticalframework.org/doc/gf-developers.html">GF Darcs</A>, - using <CODE>darcs pull</CODE>. -<LI>Also make sure that you have compiled the library by <CODE>make present</CODE> in <CODE>gf/lib/src/</CODE>. -<LI>Work in the directory - <A HREF="http://www.grammaticalframework.org/examples/phrasebook/"><CODE>gf/examples/phrasebook/</CODE></A>. -<LI>After you've finished your contribution, recompile the phrasebook by <CODE>make pgf</CODE>. -<LI>Save your changes in <CODE>darcs record .</CODE> (in the <CODE>phrasebook</CODE> subdirectory). -<LI>Make a patch file with <CODE>darcs send -o my_phrasebook_patch</CODE>, which you can - send to GF maintainers. -<LI>(Recommended:) Test the phrasebook on your local server: - <OL> - <LI>Go to <CODE>gf/src/server/</CODE> and follow the instructions in the - <A HREF="http://code.google.com/p/grammatical-framework/wiki/LaunchWebDemos">project Wiki</A>. - <LI>Make sure that <CODE>Phrasebook.pgf</CODE> is available to you GF server (see project wiki). - <LI>Launch <CODE>lighttpd</CODE> (see project wiki). - <LI>How you can open <CODE>gf/examples/phrasebook/www/phrasebook.html</CODE> and use your phrasebook! - </OL> -</OL> - -<UL> -<LI>Don't delete anything! But you are free to correct incorrect forms. -<LI>Don't change the module structure! -<LI>Don't compromise quality to gain coverage: <I>non multa sed multum!</I> -</UL> - -<A NAME="toc13"></A> -<H1>Conclusions (tentative)</H1> -<P> -The grammarian need not be a native speaker of the language. -</P> -<P> -For many languages, the grammarian need not even know the language - native informants are -enough. -</P> -<P> -However, evaluation by native speakers is necessary. -</P> -<P> -Correct and idiomatic translations are possible. -</P> -<P> -A typical development time was 2-3 person working days per language. -</P> -<P> -Google translate helps in bootstrapping grammars, but must be checked. -</P> -<UL> -<LI>in particular, unreliable for morphologically rich languages -</UL> - -<P> -Resource grammars should give some more support -</P> -<UL> -<LI>higher-level access to constructions like negative expressions -<LI>large-scale morphological lexica -</UL> - -<A NAME="toc14"></A> -<H1>Acknowledgements</H1> -<P> -The Phrasebook has been built in the MOLTO project funded by the European Commission. -</P> -<P> -The authors are grateful to their native speaker informants helping to bootstrap and evaluate -the grammars: -Richard Bubel, -Grégoire Détrez, -Rise Eilert, -Karin Keijzer, -Michał Pałka, -Willard Rafnsson, -Nick Smallbone. -</P> - -<!-- html code generated by txt2tags 2.5 (http://txt2tags.sf.net) --> -<!-- cmdline: txt2tags -thtml -\-toc doc-phrasebook.txt --> -</BODY></HTML> |
