From 7215dc71ff6e229878fd370ad6e68522b212f5b7 Mon Sep 17 00:00:00 2001 From: aarne Date: Sat, 20 Sep 2008 08:51:39 +0000 Subject: new resource-howto --- doc/Resource-HOWTO.html | 540 ++++++++++++++++++++++++++++++------------------ 1 file changed, 340 insertions(+), 200 deletions(-) (limited to 'doc/Resource-HOWTO.html') diff --git a/doc/Resource-HOWTO.html b/doc/Resource-HOWTO.html index 1494e404a..74e095955 100644 --- a/doc/Resource-HOWTO.html +++ b/doc/Resource-HOWTO.html @@ -7,17 +7,63 @@

Resource grammar writing HOWTO

Author: Aarne Ranta <aarne (at) cs.chalmers.se>
-Last update: Tue Sep 16 09:58:01 2008 +Last update: Sat Sep 20 10:40:53 2008
+

+
+

+ + +

+
+

History

-September 2008: partly outdated - to be updated for API 1.5. +September 2008: updated for Version 1.5.

-October 2007: updated for API 1.2. +October 2007: updated for Version 1.2.

January 2006: first version. @@ -32,20 +78,31 @@ will give some hints how to extend the API. A manual for using the resource grammar is found in

-http://www.cs.chalmers.se/~aarne/GF/lib/resource-1.0/doc/synopsis.html. +www.cs.chalmers.se/Cs/Research/Language-technology/GF/lib/resource/doc/synopsis.html.

A tutorial on GF, also introducing the idea of resource grammars, is found in

-http://www.cs.chalmers.se/~aarne/GF/doc/tutorial/gf-tutorial2.html. +www.cs.chalmers.se/Cs/Research/Language-technology/GF/doc/gf-tutorial.html. +

+

+This document concerns the API v. 1.5, while the current stable release is 1.4. +You can find the code for the stable release in

-This document concerns the API v. 1.0. You can find the current code in +www.cs.chalmers.se/Cs/Research/Language-technology/GF/lib/resource/

-http://www.cs.chalmers.se/~aarne/GF/lib/resource-1.0/ +and the next release in

+

+www.cs.chalmers.se/Cs/Research/Language-technology/GF/lib/next-resource/ +

+

+It is recommended to build new grammars to match the next release. +

+

The resource grammar structure

The library is divided into a bunch of modules, whose dependencies @@ -54,8 +111,11 @@ are given in the following figure.

+

+Modules of different kinds are distinguished as follows: +

-The solid ellipses show the API as visible to the user of the library. The -dashed ellipses form the main of the implementation, on which the resource -grammar programmer has to work with. With the exception of the Paradigms -module, the visible API modules can be produced mechanically. -

-

- +Put in another way:

+ +

-Thus the API consists of a grammar and a lexicon, which is -provided for test purposes. +The dashed ellipses form the main parts of the implementation, on which the resource +grammar programmer has to work with. She also has to work on the Paradigms +module. The rest of the modules can be produced mechanically from corresponding +modules for other languages, by just changing the language codes appearing in +their module headers.

The module structure is rather flat: most modules are direct parents of Grammar. The idea -is that you can concentrate on one linguistic aspect at a time, or +is that the implementors can concentrate on one linguistic aspect at a time, or also distribute the work among several authors. The module Cat defines the "glue" that ties the aspects together - a type system to which all the other modules conform, so that e.g. NP means the same thing in those modules that use NPs and those that constructs them.

+ +

Library API modules

+

+For the user of the library, these modules are the most important ones. +In a typical application, it is enough to open Paradigms and Syntax. +The module Try combines these two, making it possible to experiment +with combinations of syntactic and lexical constructors by using the +cc command in the GF shell. Here are short explanations of each API module: +

+ + +

Phrase category modules

-The direct parents of the top will be called phrase category modules, +The immediate parents of Grammar will be called phrase category modules, since each of them concentrates on a particular phrase category (nouns, verbs, adjectives, sentences,...). A phrase category module tells how to construct phrases in that category. You will find out that @@ -106,9 +190,10 @@ one of a small number of different types). Thus we have

  • Conjunction: coordination of phrases
  • Phrase: construction of the major units of text and speech
  • Text: construction of texts as sequences of phrases -
  • Idiom: idiomatic phrases such as existentials +
  • Idiom: idiomatic expressions such as existentials +

    Infrastructure modules

    Expressions of each phrase category are constructed in the corresponding @@ -137,6 +222,7 @@ can skip the lincat definition of a category and use the default {s : Str} until you need to change it to something else. In English, for instance, many categories do have this linearization type.

    +

    Lexical modules

    What is lexical and what is syntactic is not as clearcut in GF as in @@ -162,41 +248,42 @@ samples than complete lists. There are two such modules:

    The module Structural aims for completeness, and is likely to be extended in future releases of the resource. The module Lexicon -gives a "random" list of words, which enable interesting testing of syntax, -and also a check list for morphology, since those words are likely to include +gives a "random" list of words, which enables testing the syntax. +It also provides a check list for morphology, since those words are likely to include most morphological patterns of the language.

    In the case of Lexicon it may come out clearer than anywhere else in the API that it is impossible to give exact translation equivalents in -different languages on the level of a resource grammar. In other words, -application grammars are likely to use the resource in different ways for +different languages on the level of a resource grammar. This is no problem, +since application grammars can use the resource in different ways for different languages.

    +

    Language-dependent syntax modules

    In addition to the common API, there is room for language-dependent extensions -of the resource. The top level of each languages looks as follows (with English as example): +of the resource. The top level of each languages looks as follows (with German +as example):

    -    abstract English = Grammar, ExtraEngAbs, DictEngAbs
    +    abstract AllGerAbs = Lang, ExtraGerAbs, IrregGerAbs
     

    -where ExtraEngAbs is a collection of syntactic structures specific to English, -and DictEngAbs is an English dictionary -(at the moment, it consists of IrregEngAbs, -the irregular verbs of English). Each of these language-specific grammars has +where ExtraGerAbs is a collection of syntactic structures specific to German, +and IrregGerAbs is a dictionary of irregular words of German +(at the moment, just verbs). Each of these language-specific grammars has the potential to grow into a full-scale grammar of the language. These grammar can also be used as libraries, but the possibility of using functors is lost.

    To give a better overview of language-specific structures, -modules like ExtraEngAbs +modules like ExtraGerAbs are built from a language-independent module ExtraAbs by restricted inheritance:

    -    abstract ExtraEngAbs = Extra [f,g,...]
    +    abstract ExtraGerAbs = Extra [f,g,...]
     

    Thus any category and function in Extra may be shared by a subset of all @@ -210,42 +297,15 @@ In a minimal resource grammar implementation, the language-dependent extensions are just empty modules, but it is good to provide them for the sake of uniformity.

    -

    The core of the syntax

    -

    -Among all categories and functions, a handful are -most important and distinct ones, of which the others are can be -seen as variations. The categories are -

    -
    -    Cl ; VP ; V2 ; NP ; CN ; Det ; AP ;
    -
    + +

    The present-tense fragment

    -The functions are +Some lines in the resource library are suffixed with the comment

    -    PredVP  : NP  -> VP -> Cl ;  -- predication
    -    ComplV2 : V2  -> NP -> VP ;  -- complementization
    -    DetCN   : Det -> CN -> NP ;  -- determination
    -    ModCN   : AP  -> CN -> CN ;  -- modification
    +    --# notpresent
     

    -This toy Latin grammar shows in a nutshell how these -rules relate the categories to each other. It is intended to be a -first approximation when designing the parameter system of a new -language. -

    -

    Another reduced API

    -

    -If you want to experiment with a small subset of the resource API first, -try out the module -Syntax -explained in the -GF Tutorial. -

    -

    The present-tense fragment

    -

    -Some lines in the resource library are suffixed with the comment -```--# notpresent which is used by a preprocessor to exclude those lines from a reduced version of the full resource. This present-tense-only version is useful for applications in most technical text, since @@ -254,10 +314,14 @@ be useful to exclude those lines in a first version of resource implementation. To compile a grammar with present-tense-only, use

    -    i -preproc=GF/lib/resource-1.0/mkPresent LangGer.gf
    +    make Present
     
    -

    +

    +with resource/Makefile. +

    +

    Phases of the work

    +

    Putting up a directory

    Unless you are writing an instance of a parametrized implementation @@ -265,7 +329,8 @@ Unless you are writing an instance of a parametrized implementation simplest way is to follow roughly the following procedure. Assume you are building a grammar for the German language. Here are the first steps, which we actually followed ourselves when building the German implementation -of resource v. 1.0. +of resource v. 1.0 at Ubuntu linux. We have slightly modified them to +match resource v. 1.5 and GF v. 3.0.

    1. Create a sister directory for GF/lib/resource/english, named @@ -279,6 +344,8 @@ of resource v. 1.0.
    2. Check out the [ISO 639 3-letter language code http://www.w3.org/WAI/ER/IG/ert/iso639.htm] for German: both Ger and Deu are given, and we pick Ger. + (We use the 3-letter codes rather than the more common 2-letter codes, + since they will suffice for many more languages!)

    3. Copy the *Eng.gf files from english german, and rename them: @@ -286,7 +353,10 @@ of resource v. 1.0. cp ../english/*Eng.gf . rename 's/Eng/Ger/' *Eng.gf -

      + If you don't have the rename command, you can use a bash script with mv. +
    + +
    1. Change the Eng module references to Ger references in all files:
      @@ -294,7 +364,8 @@ of resource v. 1.0.
                sed -i 's/Eng/Ger/g' *Ger.gf
       
      The first line prevents changing the word English, which appears - here and there in comments, to Gerlish. + here and there in comments, to Gerlish. The sed command syntax + may vary depending on your operating system.

    2. This may of course change unwanted occurrences of the string Eng - verify this by @@ -327,10 +398,10 @@ of resource v. 1.0. You will get lots of warnings on missing rules, but the grammar will compile.

      -
    3. At all following steps you will now have a valid, but incomplete +
    4. At all the following steps you will now have a valid, but incomplete GF grammar. The GF command
      -         pg -printer=missing
      +         pg -missing
       
      tells you what exactly is missing.
    @@ -338,14 +409,15 @@ of resource v. 1.0.

    Here is the module structure of LangGer. It has been simplified by leaving out the majority of the phrase category modules. Each of them has the same dependencies -as e.g. VerbGer. +as VerbGer, whose complete dependencies are shown as an example.

    +

    Direction of work

    -The real work starts now. There are many ways to proceed, the main ones being +The real work starts now. There are many ways to proceed, the most obvious ones being

    -In this chapter, we will look at an example: adding Italian to -the Romance family (to be completed). Here is a set of +Here is a set of slides on the topic.

    -

    Parametrizing a resource grammar implementation

    + +

    Parametrizing a resource grammar implementation

    This is the most demanding form of resource grammar writing. We do not recommend the method of parametrizing from the @@ -817,11 +908,60 @@ same family by aprametrization. This means that the copy and paste method is still used, but at this time the differences are put into an interface module.

    + +

    Character encoding and transliterations

    +

    +This section is relevant for languages using a non-ASCII character set. +

    + +

    Coding conventions in GF

    +

    +From version 3.0, GF follows a simple encoding convention: +

    + + +

    +Most current resource grammars use isolatin-1 in the source, but this does +not affect their use in parallel with grammars written in other encodings. +In fact, a grammar can be put up from modules using different codings. +

    +

    +Warning. While string literals may contain any characters, identifiers +must be isolatin-1 letters (or digits, underscores, or dashes). This has to +do with the restrictions of the lexer tool that is used. +

    + +

    Transliterations

    +

    +While UTF-8 is well supported by most web browsers, its use in terminals and +text editors may cause disappointment. Many grammarians therefore prefer to +use ASCII transliterations. GF 3.0beta2 provides the following built-in +transliterations: +

    + +

    -This chapter will work out an example of how an Estonian grammar -is constructed from the Finnish grammar through parametrization. +New transliterations can be defined in the GF source file +GF/Text/Transliterations.hs. +This file also gives instructions on how new ones are added.

    - + -- cgit v1.2.3