diff options
| author | aarne <aarne@cs.chalmers.se> | 2007-12-21 15:10:38 +0000 |
|---|---|---|
| committer | aarne <aarne@cs.chalmers.se> | 2007-12-21 15:10:38 +0000 |
| commit | 5ee1714fd23e974d1cf2511fa398b6ce310a9807 (patch) | |
| tree | 7a82f85d4f4681086430fdefd7903e4a26015c3f | |
| parent | c5017f28aad7702838b9861aa3f6cbf7b3bacca5 (diff) | |
new tutorial and reference manual
| -rw-r--r-- | doc/10lang-small.png | bin | 0 -> 66840 bytes | |||
| -rw-r--r-- | doc/German.png | bin | 0 -> 21000 bytes | |||
| -rw-r--r-- | doc/Syntax.png | bin | 0 -> 14804 bytes | |||
| -rw-r--r-- | doc/categories.png | bin | 0 -> 4241 bytes | |||
| -rw-r--r-- | doc/food-translet.png | bin | 0 -> 22916 bytes | |||
| -rw-r--r-- | doc/food1.png | bin | 0 -> 22805 bytes | |||
| -rw-r--r-- | doc/food2.png | bin | 0 -> 31506 bytes | |||
| -rw-r--r-- | doc/foodmarket.png | bin | 0 -> 2099 bytes | |||
| -rw-r--r-- | doc/gf-refman.html | 4545 | ||||
| -rw-r--r-- | doc/gf-tutorial.html | 7952 | ||||
| -rw-r--r-- | doc/mytree.png | bin | 0 -> 2230 bytes |
11 files changed, 12497 insertions, 0 deletions
diff --git a/doc/10lang-small.png b/doc/10lang-small.png Binary files differnew file mode 100644 index 000000000..49a3d0a98 --- /dev/null +++ b/doc/10lang-small.png diff --git a/doc/German.png b/doc/German.png Binary files differnew file mode 100644 index 000000000..7c6303897 --- /dev/null +++ b/doc/German.png diff --git a/doc/Syntax.png b/doc/Syntax.png Binary files differnew file mode 100644 index 000000000..1cc8161b1 --- /dev/null +++ b/doc/Syntax.png diff --git a/doc/categories.png b/doc/categories.png Binary files differnew file mode 100644 index 000000000..afc5873c5 --- /dev/null +++ b/doc/categories.png diff --git a/doc/food-translet.png b/doc/food-translet.png Binary files differnew file mode 100644 index 000000000..dd622a4bf --- /dev/null +++ b/doc/food-translet.png diff --git a/doc/food1.png b/doc/food1.png Binary files differnew file mode 100644 index 000000000..767069dab --- /dev/null +++ b/doc/food1.png diff --git a/doc/food2.png b/doc/food2.png Binary files differnew file mode 100644 index 000000000..b36a01b22 --- /dev/null +++ b/doc/food2.png diff --git a/doc/foodmarket.png b/doc/foodmarket.png Binary files differnew file mode 100644 index 000000000..6b0e3fbd7 --- /dev/null +++ b/doc/foodmarket.png diff --git a/doc/gf-refman.html b/doc/gf-refman.html new file mode 100644 index 000000000..b84079ecf --- /dev/null +++ b/doc/gf-refman.html @@ -0,0 +1,4545 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> +<HTML> +<HEAD> +<META NAME="generator" CONTENT="http://txt2tags.sf.net"> +<TITLE>GF Language Reference Manual</TITLE> +</HEAD><BODY BGCOLOR="white" TEXT="black"> +<P ALIGN="center"><CENTER><H1>GF Language Reference Manual</H1> +<FONT SIZE="4"> +<I>Aarne Ranta</I><BR> +</FONT></CENTER> + +<P></P> +<HR NOSHADE SIZE=1> +<P></P> + <UL> + <LI><A HREF="#toc1">Overview of GF</A> + <LI><A HREF="#toc2">The module system</A> + <UL> + <LI><A HREF="#toc3">Top-level and supplementary module structure</A> + <LI><A HREF="#toc4">Compilation units</A> + <LI><A HREF="#toc5">Names</A> + <LI><A HREF="#toc6">The structure of a module</A> + <LI><A HREF="#toc7">Module types, headers, and bodies</A> + <LI><A HREF="#toc8">Digression: the logic of module types</A> + <LI><A HREF="#toc9">Inheritance</A> + <LI><A HREF="#toc10">Opening</A> + <LI><A HREF="#toc11">Name resolution</A> + <LI><A HREF="#toc12">Functor instantiations</A> + <LI><A HREF="#toc13">Completeness</A> + </UL> + <LI><A HREF="#toc14">Judgements</A> + <UL> + <LI><A HREF="#toc15">Overview of the forms of judgement</A> + <LI><A HREF="#toc16">Category declarations, cat</A> + <LI><A HREF="#toc17">Hypotheses and contexts</A> + <LI><A HREF="#toc18">Function declarations, fun</A> + <LI><A HREF="#toc19">Function definitions, def</A> + <LI><A HREF="#toc20">Data constructor definitions, data</A> + <LI><A HREF="#toc21">The semantic status of an abstract syntax function</A> + <LI><A HREF="#toc22">Linearization type definitions, lincat</A> + <LI><A HREF="#toc23">Linearization definitions, lin</A> + <LI><A HREF="#toc24">Linearization default definitions, lindef</A> + <LI><A HREF="#toc25">Printname definitions, printname cat and printname fun</A> + <LI><A HREF="#toc26">Parameter type definitions, param</A> + <LI><A HREF="#toc27">Parameter values</A> + <LI><A HREF="#toc28">Operation definitions, oper</A> + <LI><A HREF="#toc29">Operation overloading</A> + <LI><A HREF="#toc30">Flag definitions, flags</A> + </UL> + <LI><A HREF="#toc31">Types and expressions</A> + <UL> + <LI><A HREF="#toc32">Overview of expression forms</A> + <LI><A HREF="#toc33">The functional fragment: expressions in abstract syntax</A> + <LI><A HREF="#toc34">Conversions</A> + <LI><A HREF="#toc35">Syntax trees</A> + <LI><A HREF="#toc36">Predefined types in abstract syntax</A> + <LI><A HREF="#toc37">Overview of expressions in concrete syntax</A> + <LI><A HREF="#toc38">Values, canonical forms, and run-time variables</A> + <LI><A HREF="#toc39">Token lists, tokens, and strings</A> + <LI><A HREF="#toc40">Records and record types</A> + <LI><A HREF="#toc41">Subtyping</A> + <LI><A HREF="#toc42">Tables and table types</A> + <LI><A HREF="#toc43">Pattern matching</A> + <LI><A HREF="#toc44">Free variation</A> + <LI><A HREF="#toc45">Local definitions</A> + <LI><A HREF="#toc46">Function applications in concrete syntax</A> + <LI><A HREF="#toc47">Reusing top-level grammars as resources</A> + <LI><A HREF="#toc48">Predefined concrete syntax types</A> + <LI><A HREF="#toc49">Predefined concrete syntax operations</A> + </UL> + <LI><A HREF="#toc50">Flags and pragmas</A> + <UL> + <LI><A HREF="#toc51">Some flags and their values</A> + <LI><A HREF="#toc52">Compiler pragmas</A> + </UL> + <LI><A HREF="#toc53">Alternative grammar input formats</A> + <UL> + <LI><A HREF="#toc54">Old GF without modules</A> + <LI><A HREF="#toc55">Context-free grammars</A> + <LI><A HREF="#toc56">Extended BNF grammars</A> + <LI><A HREF="#toc57">Example-based grammars</A> + </UL> + <LI><A HREF="#toc58">The grammar of GF</A> + <LI><A HREF="#toc59">The lexical structure of GF</A> + <UL> + <LI><A HREF="#toc60">Identifiers</A> + <LI><A HREF="#toc61">Literals</A> + <LI><A HREF="#toc62">Reserved words and symbols</A> + <LI><A HREF="#toc63">Comments</A> + </UL> + <LI><A HREF="#toc64">The syntactic structure of GF</A> + </UL> + +<P></P> +<HR NOSHADE SIZE=1> +<P></P> +<P> + +</P> +<P> +This document is a reference manual to the GF programming language. +GF, Grammatical Framework, is a special-purpose programming language, +designed to support definitions of grammars. +</P> +<P> +This document is not an introduction to GF; such introduction can be +found in the GF tutorial available on line on the GF web page, +</P> +<P> +<A HREF="http://digitalgrammars.com/gf"><CODE>digitalgrammars.com/gf</CODE></A> +</P> +<P> +This manual covers only the language, not the GF compiler or +interactive system. We will however make some references to different +compiler versions, if they involve changes of behaviour having to +do with the language specification. +</P> +<P> +This manual is meant to be fully compatible with GF version 3.0 +(forthcoming). Main discrepancies with version 2.8 are indicated, +as well as with the reference article on GF, +</P> +<P> +A. Ranta, "Grammatical Framework. A Type Theoretical Grammar Formalism", +<I>The Journal of Functional Programming</I> 14(2), 2004, pp. 145-189. +</P> +<P> +This article will referred to as "the JFP article". +</P> +<P> +As metalinguistic notation, we will use the symbols +</P> +<UL> +<LI><I>a</I> === <I>b</I> to say that <I>a</I> is syntactic sugar for <I>b</I> +<LI><I>a</I> ==> <I>b</I> to say that <I>a</I> is computed (or compiled) to <I>b</I> +</UL> + +<A NAME="toc1"></A> +<H2>Overview of GF</H2> +<P> +GF is a typed functional language, +borrowing many of its constructs from ML and Haskell: algebraic datatypes, +higher-order functions, pattern matching. The module system bears resemblance +to ML (functors) but also to object-oriented languages (inheritance). +The type theory used in the abstract syntax part of GF is inherited from +logical frameworks, in particular ALF ("Another Logical Framework"; in a +sense, GF is Yet Another ALF). From ALF comes also the use of dependent +types, including the use of explicit type variables instead of +Hindley-Milner polymorphism. +</P> +<P> +The look and feel of GF is close to Java and +C, due to the use of curly brackets and semicolons in structuring the code; +the expression syntax, however, follows Haskell in using juxtaposition for +function application and parentheses only for grouping. +</P> +<P> +To understand the constructs of GF, and especially their limitations in comparison +to general-purpose programming languages, it is essential to keep in mind that +GF is a special-purpose and non-turing-complete language. Every GF program is +ultimately compiled to a <B>multilingual grammar</B>, which consists of an +<B>abstract syntax</B> and a set of <B>concrete syntaxes</B>. The abstract syntax +defines a system of <B>syntax trees</B>, and each concrete syntax defines a +mapping from those syntax trees to <B>nested tuples</B> of strings and integers. +This mapping is <B>compositional</B>, i.e. <B>homomorphic</B>, and moreover +<B>reversible</B>: given a nested tuple, there exists an effective way of finding +the set of syntax trees that map to this tuple. The procedure of applying the +mapping to a tree to produce a tuple is called <B>linearization</B>, and the +reverse search procedure is called <B>parsing</B>. It is ultimately the requirement +of reversibility that restricts GF to be less than turing-complete. This is +reflected in restrictions to recursion in concrete syntax. Tree formation in +abstract syntax, however, is fully recursive. +</P> +<P> +Even though run-time GF grammars manipulate just nested tuples, at compile +time these are represented by by the more fine-grained labelled records +and finite functions over algebraic datatypes. This enables the programmer +to write on a higher abstraction level, and also adds type distinctions +and hence raises the level of checking of programs. +</P> +<A NAME="toc2"></A> +<H2>The module system</H2> +<A NAME="toc3"></A> +<H3>Top-level and supplementary module structure</H3> +<P> +The big picture of GF as a programming language for multilingual grammars +explains its principal module structure. Any GF grammar must have an +abstract syntax module; it can in addition have any number of concrete +syntax modules matching that abstract syntax. Before going to details, +we give a simple example: a module defining the <B>category</B> <CODE>A</CODE> +of adjectives and one adjective-forming <B>function</B>, the zero-place function +<CODE>Even</CODE>. We give the module the name <CODE>Adj</CODE>. The GF code for the +module looks as follows: +</P> +<PRE> + abstract Adj = { + cat A ; + fun Even : A ; + } +</PRE> +<P> +Here are two concrete syntax modules, one intended for mapping the trees +to English, the other to Swedish. The mappling is defined by +<CODE>lincat</CODE> definitions assigning a <B>linearization type</B> to each category, +and <CODE>lin</CODE> definitions assigning a <B>linearization</B> to each function. +</P> +<PRE> + concrete AdjEng of Adj = { + lincat A = {s : Str} ; + lin Even = {s = "even"} ; + } + + concrete AdjSwe of Adj = { + lincat A = {s : AForm => Str} ; + lin Even = {s = table { + ASg Utr => "jämn" ; + ASg Neutr => "jämnt" ; + APl => "jämna" + } + } ; + param AForm = ASg Gender | APl ; + param Gender = Utr | Neutr ; + } +</PRE> +<P> +These examples illustrate the main ideas of multilingual grammars: +</P> +<UL> +<LI>the concrete syntax must match the abstract syntax: + <UL> + <LI>every <CODE>cat</CODE> is given a <CODE>lincat</CODE> + <LI>every <CODE>fun</CODE> is given a <CODE>lin</CODE> + </UL> +</UL> + +<UL> +<LI>the concrete syntax is internally coherent: + <UL> + <LI>the <CODE>lin</CODE> rules respect the types defined by <CODE>lincat</CODE> rules + </UL> +</UL> + +<UL> +<LI>concrete syntaxes are independent of each other + <UL> + <LI>they can use different <CODE>lincat</CODE> and <CODE>lin</CODE> definitions + <LI>they can define their own <B>parameter types</B> (<CODE>param</CODE>) + </UL> +</UL> + +<P> +The first two ideas form the core of the <B>static checking</B> of GF +grammars, eliminating the possibility of run-time errors in +linearization and parsing. The third idea gives GF the expressive +power needed to map abstract syntax to vastly different languages. +</P> +<P> +Abstract and concrete modules are called <B>top-level grammar modules</B>, +since they are the ones that remain in grammar systems at run time. +However, in order to support <B>modular grammar engineering</B>, GF provides +much more module structure than strictly required in top-level grammars. +</P> +<P> +<B>Inheritance</B>, also known as <B>extension</B>, means that a module can inherit the +contents of one or more other modules to which new judgements are added, +e.g. +</P> +<PRE> + abstract MoreAdj = Adj ** { + fun Odd : A ; + } +</PRE> +<P> +<B>Resource modules</B> define parameter types and <B>operations</B> usable +in several concrete syntaxes, +</P> +<PRE> + resource MorphoFre = { + param Number = Sg | Pl ; + param Gender = Masc | Fem ; + oper regA : Str -> {s : Gender => Number => Str} = + \fin -> { + s = table { + Masc => table {Sg => fin ; Pl => fin + "s"} ; + Fem => table {Sg => fin + "e" ; Pl => fin + "es"} + } + } ; + } +</PRE> +<P> +By <B>opening</B>, a module can use the contents of a resource module +without inheriting them, e.g. +</P> +<PRE> + concrete AdjFre of Adj = open MorphoFre in { + lincat A = {s : Gender => Number => Str} ; + lin Even = regA "pair" ; + } +</PRE> +<P> +<B>Interfaces</B> and <B>instances</B> separate the contents of a resource module +to type signatures and definitions, in a way analogous to abstract vs. concrete +modules, e.g. +</P> +<PRE> + interface Lexicon = { + oper Adjective : Type ; + oper even_A : Adjective ; + } + + instance LexiconEng of Lexicon = { + oper Adjective = {s : Str} ; + oper even_A = {s = "even"} ; + } +</PRE> +<P> +<B>Functors</B> i.e. <B>parametrized modules</B> i.e. <B>incomplete modules</B>, defining +a concrete syntax in terms of an interface. +</P> +<PRE> + incomplete concrete AdjI of Adj = open Lexicon in { + lincat A = Adjective ; + lin Even = even_A ; + } +</PRE> +<P> +A functor can be <B>instantiated</B> by providing instances of its open interfaces. +</P> +<PRE> + concrete AdjEng of Adj = AdjI with (Lexicon = LexiconEng) ; +</PRE> +<P></P> +<A NAME="toc4"></A> +<H3>Compilation units</H3> +<P> +The compilation unit of GF source code is a file that contains a module. +Judgements outside modules are supported only for backward compatibility, +as explained <a href="#oldgf">here</a>. +Every source file, suffixed <CODE>.gf</CODE>, is compiled to a "GF object file", +suffixed <CODE>.gfo</CODE> (as of GF Version 3.0 and later). For runtime grammar objects +used for parsing and linearization, a set of <CODE>.gfo</CODE> files is linked to +a single file suffixed <CODE>.gfcc</CODE>. While <CODE>.gf</CODE> and <CODE>.gfo</CODE> files may contain +modules of any kinds, a <CODE>.gfcc</CODE> file always contains a multilingual grammar +with one abstract and a set of concrete syntaxes. +</P> +<P> +The following diagram summarizes the files involved in the compilation process. +<center> +<CODE>module1.gf module2.gf ... modulen.gf</CODE> +</P> +<P> +==> +</P> +<P> +<CODE>module1.gfo module2.gfo ... modulen.gfo</CODE> +</P> +<P> +==> +</P> +<P> +grammar.gfcc +</center> +Both <CODE>.gf</CODE> and <CODE>.gfo</CODE> files are written in the GF source language; +<CODE>.gfcc</CODE> files are written in a lower-level format. The process of translating +<CODE>.gf</CODE> to <CODE>.gfo</CODE> consists of <B>name resolution</B>, <B>type annotation</B>, +<B>partial evaluation</B>, and <B>optimization</B>. +There is a great advantage in the possibility to do this +separately for GF modules and saving the result in <CODE>.gfo</CODE> files. The partial +evaluation phase, in particular, is time and memory consuming, and GF libraries +are therefore distributed in <CODE>.gfo</CODE> to make their use less arduous. +</P> +<P> +<I>In GF before version 3.0, the object files are in a format called <CODE>.gfc</CODE>,</I> +<I>and the multilingual runtime grammar is in a format called <CODE>.gfcm</CODE>.</I> +</P> +<P> +The standard compiler has a built-in <B>make facility</B>, which finds out what +other modules are needed when compiling an explicitly given module. +This facility builds a dependency graph and decides which of the involved +modules need recompilation (from <CODE>.gf</CODE> to <CODE>.gfo</CODE>), and for which the +GF object can be used directly. +</P> +<A NAME="toc5"></A> +<H3>Names</H3> +<P> +Each module <I>M</I> defines a set of <B>names</B>, which are visible in <I>M</I> +itself, in all modules extending <I>M</I> (unless excluded, as explained +<a href="#restrictedinheritance">here</a>), and +all modules opening <I>M</I>. These names can stand for abstract syntax +categories and functions, parameter types and parameter constructors, +and operations. All these names live in the same <B>name space</B>, which +means that a name entering a module more than once due to inheritance or +opening can lead to a <B>conflict</B>. It is specified +<a href="#renaming">here</a> how these +conflicts are resolved. +</P> +<P> +The names of modules live in a name space separate from the other names. +Even here, all names must be distinct in a set of files compiled to a +multilingual grammar. In particular, even files residing in different directories +must have different names, since GF has no notion of hierarchic +module names. +</P> +<P> +Lexically, names belong to the class of <B>identifiers</B>. An idenfifier is +a letter followed by any number of letters, digits, undercores (<CODE>_</CODE>) and +primes (<CODE>'</CODE>). Upper- and lower-case letters are treated as distinct. +Nothing dictates the choice of upper or lower-case initials, but +the standard libraries follow conventions similar to Haskell: +</P> +<UL> +<LI>upper case is used for modules, abstract syntax categories and functions, + parameter types and constructors, and type synonyms +<LI>lower case is used for non-type-valued operations and for variables +</UL> + +<P> +<a name="identifiers"></a> +</P> +<P> +"Letters" as mentioned in the identifier syntax include all 7-bit ASCII +letters. Iso-latin-1 and Unicode letters are supported in varying degrees +by different tools and platforms, and are hence not recommended in identifiers. +</P> +<A NAME="toc6"></A> +<H3>The structure of a module</H3> +<P> +Modules of all types have the following structure: +<center> +<I>moduletype</I> <I>name</I> <CODE>=</CODE> <I>extends</I> <I>opens</I> <I>body</I> +</center> +The part of the module preceding the body is its <B>header</B>. The header +defines the type of the module and tells what other modules it inherits +and opens. The body consists of the judgements that introduce all the new +names defined by the module. +</P> +<P> +Any of the parts <I>extends</I>, <I>opens</I>, and <I>body</I> may be empty. +If they are all filled, delimiters and keywords separate the parts in the +following way: +<center> +<I>moduletype</I> <I>name</I> <CODE>=</CODE> + <I>extends</I> <CODE>**</CODE> <CODE>open</CODE> <I>opens</I> <CODE>in</CODE> <CODE>{</CODE> <I>body</I> <CODE>}</CODE> +</center> +The part <I>moduletype</I> <I>name</I> looks slightly different if the +type is <CODE>concrete</CODE> or <CODE>instance</CODE>: the <I>name</I> intrudes between +the type keyword and the name of the module being implemented and which +really belongs to the type of the module: +<center> + <CODE>concrete</CODE> <I>name</I> <CODE>of</CODE> <I>abstractname</I> +</center> +The only exception to the schema of functor syntax +is functor instantiations: the instantiation +list is given in a special way between <I>extends</I> and <I>opens</I>: +<center> +<CODE>incomplete concrete</CODE> <I>name</I> <CODE>of</CODE> <I>abstractname</I> <CODE>=</CODE> + <I>extends</I> <CODE>**</CODE> <I>functorname</I> <CODE>with</CODE> <I>instantiations</I> <CODE>**</CODE> + <CODE>open</CODE> <I>opens</I> <CODE>in</CODE> <CODE>{</CODE> <I>body</I> <CODE>}</CODE> +</center> +Logically, the part "<I>functorname</I> <CODE>with</CODE> <I>instantiations</I>" should +really be one of the <I>extends</I>. This is also shown by the fact that +it can have restricted inheritance (concept defined <a href="#restrictedinheritance">here</a>). +</P> +<A NAME="toc7"></A> +<H3>Module types, headers, and bodies</H3> +<P> +The <I>extends</I> and <I>opens</I> parts of a module header are lists of +module names (with possible qualifications, as defined below <a href="#qualifiednames">here</a>). +The first step of type checking a module consists of verifying that +these names stand for modules of approptiate module types. As a rule +of thumb, +</P> +<UL> +<LI>the <I>extends</I> of a module must have the same <I>moduletype</I> +<LI>the <I>opens</I> of a module must be of type <CODE>resource</CODE> +</UL> + +<P> +However, the precise rules are a little more fine-grained, because +of the presence of interfaces and their instances, and the possibility +to reuse abstract and concrete modules as resources. The following table +gives, for all module types, the possible module types of their <I>extends</I> +and <I>opens</I>, as well as the forms of judgement legal in that module type. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>module type</TH> +<TH>extends</TH> +<TH>opens</TH> +<TH COLSPAN="2">body</TH> +</TR> +<TR> +<TD><CODE>abstract</CODE></TD> +<TD>abstract</TD> +<TD>-</TD> +<TD><CODE>cat, fun, def, data</CODE></TD> +</TR> +<TR> +<TD><CODE>concrete of</CODE> <I>abstract</I></TD> +<TD>concrete</TD> +<TD>resource*</TD> +<TD><CODE>lincat, cat, oper, param</CODE></TD> +</TR> +<TR> +<TD><CODE>resource</CODE></TD> +<TD>resource*</TD> +<TD>resource*</TD> +<TD><CODE>oper, param</CODE></TD> +</TR> +<TR> +<TD><CODE>interface</CODE></TD> +<TD>resource+</TD> +<TD>resource*</TD> +<TD><CODE>oper, param</CODE></TD> +</TR> +<TR> +<TD><CODE>instance of</CODE> <I>interface</I></TD> +<TD>resource*</TD> +<TD>resource*</TD> +<TD><CODE>oper, param</CODE></TD> +</TR> +<TR> +<TD><CODE>incomplete</CODE> concrete</TD> +<TD>concrete+</TD> +<TD>resource+</TD> +<TD><CODE>lincat, cat, oper, param</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +The table uses the following shorthands for lists of module types: +</P> +<UL> +<LI>resource*: resource, instance, concrete +<LI>resource+: resource*, interface, abstract +<LI>concrete+: concrete, incomplete concrete +</UL> + +<P> +The legality of judgements in the body is checked before the judgements +themselves are checked. +</P> +<P> +The forms of judgement are explained <a href="#judgementforms">here</a>. +</P> +<A NAME="toc8"></A> +<H3>Digression: the logic of module types</H3> +<P> +Why are the legality conditions of opens and extends so complicated? The best way +to grasp them is probably to consider a simplified logical model of the module +system, replacing modules by types and functions. This model could actually +be developed towards treating modules in GF as first-class objects; so far, +however, this step has not been motivated by any practical needs. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>module</TH> +<TH COLSPAN="2">object and type</TH> +</TR> +<TR> +<TD>abstract A = B</TD> +<TD>A = B : type</TD> +</TR> +<TR> +<TD>concrete C of A = B</TD> +<TD>C = B : A -> S</TD> +</TR> +<TR> +<TD>interface I = B</TD> +<TD>I = B : type</TD> +</TR> +<TR> +<TD>instance J of I = B</TD> +<TD>J = B : I</TD> +</TR> +<TR> +<TD>incomplete concrete C of A = open I in B</TD> +<TD>C = B : I -> A -> S</TD> +</TR> +<TR> +<TD>concrete K of A = C with (I=J)</TD> +<TD>K = B(J) : A -> S</TD> +</TR> +<TR> +<TD>resource R = B</TD> +<TD>R = B : I</TD> +</TR> +<TR> +<TD>concrete C of A = open R in B</TD> +<TD>C = B(R) : A -> S</TD> +</TR> +</TABLE> + +<P></P> +<P> +A further step of defining modules as first-class objects would use +GADTs and record types: +</P> +<UL> +<LI>an abstract syntax is a Generalized Algebraic Datatype (GADT) +<LI>the target type <CODE>S</CODE> of concrete syntax is the type of nested + tuples over strings and integers +<LI>an interface is a labelled record type +<LI>an instance is a record of the type defined by the interface +<LI>a functor, with a module body opening an interface, is a function + on its instances +<LI>the instantiation of a functor is an application of the function to + some instance +<LI>a resource is a typed labelled record, putting together an interface and + an instance of it +<LI>the body of a module opening a resource is as a function on the interface + implicit in the resource; this function is immediately applied to the instance + defined in the resource +</UL> + +<P> +Slightly unexpectedly, interfaces and instances are easier to understand +in this way than resources - a resource is, indeed, more complex, since +it fuses together an interface and an instance. +</P> +<P> +<a name="openabstract"></a> +</P> +<P> +When an abstract is used as an interface and a concrete as its instance, they +are actually reinterpreted so that they match the model. Then the abstract is +no longer a GADT, but a system of <I>abstract</I> datatypes, with a record field +of type <CODE>Type</CODE> for each category, and a function among these types for each +abstract syntax function. A concrete syntax instantiates this record with +linearization types and linearizations. +</P> +<A NAME="toc9"></A> +<H3>Inheritance</H3> +<P> +After checking that the <I>extends</I> of a module are of appropriate +module types, the compiler adds the inherited judgements to the +judgements included in the body. The inherited judgements are +not copied entirely, but their names with links to the inherited module. +Conflicts may arise in this process: a name can have two definitions in the combined +pool of inherited and added judgements. Such a conflict is always an +error: GF provides no way to redefine an inherited constant. +</P> +<P> +Simple as the definition of a conflict may sound, it has to take care of the +inheritance hierarchy. A very common pattern of inheritance is the +<B>diamond</B>: inheritance from two modules which themselves inherit a common +base module. Assume that the base module defines a name <CODE>f</CODE>: +</P> +<PRE> + N + / \ + M1 M2 + \ / + Base {f} +</PRE> +<P> +Now, <CODE>N</CODE> inherits <CODE>f</CODE> from both <CODE>M1</CODE> and <CODE>M2</CODE>, so is there a +conflict? The answer in GF is <I>no</I>, because the "two" <CODE>f</CODE>'s are in the +end the same: the one defined in <CODE>Base</CODE>. The situation is thus simpler +than in <B>multiple inheritance</B> in languages like C++, because definitions in +GF are <B>immutable</B>: neither <CODE>M1</CODE> nor <CODE>M2</CODE> can possibly have changed +the definition of <CODE>f</CODE> given in <CODE>Base</CODE>. In practice, the compiler manages +inheritance through hierarchy in a very simple way, by just always creating +a link not to the immediate parent, but the original ancestor; this ancestor +can be read from the link provided by the immediate parent. Here is how +links are created from source modules by the compiler: +</P> +<PRE> + Base {f} + M1 {m1} ===> M1 {Base.f, m1} + M2 {m2} ===> M2 {Base.f, m2} + N {n} ===> N {Base.f, M1.m1, M2.m2, n} +</PRE> +<P></P> +<P> +<a name="restrictedinheritance"></a> +</P> +<P> +Inheritance can be <B>restricted</B>. This means that a module can be specified +as inheriting <I>only</I> explicitly listed constants, or all constants +<I>except</I> ones explicitly listed. The syntax uses constant names in brackets, +prefixed by a minus sign in the case of an exclusion list. In the following +configuration, N inherits <CODE>a,b,c</CODE> from <CODE>M1</CODE>, and all names but <CODE>d</CODE> +from <CODE>M2</CODE> +</P> +<PRE> + N = M1 {a,b,c}, M2-{d} +</PRE> +<P> +Restrictions are performed as a part of inheritance linking, module by module: +the link is created for a constant if and only if it is both +included in the module and compatible with the restriction. Thus, +for instance, an inadvertent usage can exclude a constant from one module +but inherit it from another one. In the following +configuration, <CODE>f</CODE> is inherited via <CODE>M1</CODE>, if <CODE>M1</CODE> inherits it. +</P> +<PRE> + N = M1 [a,b,c], M2-[f] +</PRE> +<P> +Unintended inheritance may cause problems later in compilation, in the +judgement-level dependency analysis phase. For instance, suppose a function +<CODE>f</CODE> has category <CODE>C</CODE> as its type in <CODE>M</CODE>, and we only include <CODE>f</CODE>. The +exclusion has the effect of creating an ill-formed module: +</P> +<PRE> + abstract M = {cat C ; fun f : C ;} + M [f] ===> {fun f : C ;} +</PRE> +<P> +One might expect inheritance restriction to be transitive: if an included +constant <I>b</I> depends on some other constant <I>a</I>, then <I>a</I> should be +included automatically. However, this rule would leave to hard-to-detect +inheritances. And it could only be applied later in the compilation phase, +when the compiler has not only collected the names defined, but also +resolved the names used in definitions. +</P> +<P> +Yet another pitfall with restricted inheritance is that it must be stated +for each module separately. For instance, a concrete syntax of an abstract +must exclude all those names that the abstract does, and a functor instantiation +must replicate all restrictions of the functor. +</P> +<A NAME="toc10"></A> +<H3>Opening</H3> +<P> +Opening makes constants from other modules usable in judgements, without +inheriting them. This means that, unlike inheritance, opening is not +transitive. +</P> +<P> +<a name="qualifiednames"></a> +</P> +<P> +Opening cannot be restricted as inheritance can, but it can be <B>qualified</B>. +This means that the names from the opened modules cannot be used as such, but +only as prefixed by a qualifier and a dot (<CODE>.</CODE>). The qualifier can be any +identifier, including the name of the module. Here is an example of +an <I>opens</I> list: +</P> +<PRE> + open A, (X = XSLTS), (Y = XSLTS), B +</PRE> +<P> +If <CODE>A</CODE> defines the constant <CODE>a</CODE>, it can be accessed by the names +</P> +<PRE> + a A.a +</PRE> +<P> +If <CODE>XSLTS</CODE> defines the constant <CODE>x</CODE>, it can be accessed by the names +</P> +<PRE> + X.x Y.x XSLTS.x +</PRE> +<P> +Thus qualification by real module name is always possible, and one and the same +module can be qualified in different ways at the same time (the latter can +be useful if you want to be able to change the implementations of some +constants to a different resource later). Since the qualification with real +module name is always possible, it is not possible to "swap" the names of +modules locally: +</P> +<PRE> + open (A=B), (B=A) -- NOT POSSIBLE! +</PRE> +<P> +The list of qualifiers names and module names in a module header may +thus not contain any duplicates. +</P> +<A NAME="toc11"></A> +<H3>Name resolution</H3> +<P> +<a name="renaming"></a> +</P> +<P> +<B>Name resolution</B> is the compiler phase taking place after inheritance +linking. It qualifies all names occurring in the definition parts of judgements +(that is, just excluding the defined names themselves) with the names of +the modules they come from. If a name can come from different modules (that is, +not from their common ancestor), a conflict is reported; this decision is +hence not dependent on e.g. types, which are known only at a later phase. +</P> +<P> +Qualification of names is the main device for avoiding conflicts in +name resolution. No other information is used, such as priorities between +modules. However, if a name is defined in different opened modules +but never used in the module body, +a conflict does not arise: conflicts arise only +when names are used. Also in this respect, opening is thus different from +inheritance, where conflicts are checked independently of use. +</P> +<P> +As usual, inner scope has priority in name resolution. This means that +if an identifier is in scope as a bound variable, it will not be +interpreted as a constant, unless qualified by a module name +(variable bindings are explained <a href="#variablebinding">here</a>). +</P> +<A NAME="toc12"></A> +<H3>Functor instantiations</H3> +<P> +We have dealt with the principles of module headers, inheritance, and +names in a general way that applies to all module types. The exception +is functor instantiations, that have an extra part of the instantiating +equations, assigning an instance to every interface. Here is a typical +example, displaying the full generality: +</P> +<PRE> + concrete FoodsEng of Foods = PhrasesEng ** + FoodsI-[Pizza] with + (Syntax = SyntaxEng), + (LexFoods = LexFoodsEng) ** + open SyntaxEng, ParadigmsEng in { + lin Pizza = mkCN (mkA "Italian") (mkN "pie") ; + } +</PRE> +<P> +(The example is modified from Section 5.9 in the GF Tutorial.) +</P> +<P> +The instantiation syntax is similar to qualified <I>opens</I>. The left-hand-side +names must be interfaces, the right-hand-side names their instances. (Recall +that <CODE>abstract</CODE> can be use as <CODE>interface</CODE> and <CODE>concrete</CODE> as its +<CODE>instance</CODE>.) Inheritance from the functor can be restricted, typically +in the purpose of defining some excluded functions in language-specific +ways in the module body. +</P> +<A NAME="toc13"></A> +<H3>Completeness</H3> +<P> +<a name="completeness"></a> +</P> +<P> +(This section refers to the forms of judgement introduced <a href="#judgementforms">here</a>.) +</P> +<P> +A <CODE>concrete</CODE> is complete with respect to an <CODE>abstract</CODE>, if it +contains a <CODE>lincat</CODE> definition for every <CODE>cat</CODE> declaration, and +a <CODE>lin</CODE> definition for every <CODE>fun</CODE> declaration. +</P> +<P> +The same completeness criterion applies to functor instantiations. +It is not possible to use a partial functor instantiation, leading +to another functor. +</P> +<P> +Functors do not need to be complete in the sense concrete modules need. +The missing definitions can then be provided in the body of each +functor instantiation. +</P> +<P> +A <CODE>resource</CODE> is complete, if all its <CODE>oper</CODE> and <CODE>param</CODE> judgements +have a definition part. While a <CODE>resource</CODE> must be complete, an +<CODE>interface</CODE> need not. For an <CODE>interface</CODE>, it is the definition +parts of judgements are optional. +</P> +<P> +An <CODE>instance</CODE> is complete with respect to an <CODE>interface</CODE>, if it +gives the definition parts of all <CODE>oper</CODE> and <CODE>param</CODE> judgements +that are omitted in the <CODE>interface</CODE>. Giving definitions to judgements +that have already been defined in the <CODE>interface</CODE> is illegal. +Type signatures, on the other hand, can be repeated if the same types +are used. +</P> +<P> +In addition to completing the definitions in an <CODE>interface</CODE>, +its instance may contain other judgements, but these must all +be complete with definitions. +</P> +<P> +Here is an example of an instance and its interface showing the +above variations: +</P> +<PRE> + interface Pos = { + param Case ; -- no definition + param Number = Sg | Pl ; -- definition given + oper Noun : Type = { -- relative definition given + s : Number => Case => Str + } ; + oper regNoun : Str -> Noun ; -- no definition + } + + instance PosEng of Pos = { + param Case = Nom | Gen ; -- definition of Case + -- Number and Noun inherited + oper regNoun = \dog -> { -- type of regNoun inherited + s = table { -- definition of regNoun + Sg => table { + Nom => dog + -- etc + } + } ; + oper house_N : Noun = -- new definition + regNoun "house" ; + } +</PRE> +<P></P> +<A NAME="toc14"></A> +<H2>Judgements</H2> +<A NAME="toc15"></A> +<H3>Overview of the forms of judgement</H3> +<P> +<a name="judgementforms"></a> +</P> +<P> +A module body in GF is a set of <B>judgements</B>. Judgements are +definitions or declarations, sometimes combinations of the two; the +common feature is that every judgement introduces a name, which is +available in the module and whenever the module is extended or opened. +</P> +<P> +There are several different <B>forms of judgement</B>, identified by different +<B>judgement keywords</B>. Here is a list of all these forms, together +with syntax descriptions and the types of modules in which each form can occur. +The table moreover indicates whether the judgement has a default value, and +whether it contributes to the <B>name base</B>, i.e. introduces a new +name to the scope. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>judgement</TH> +<TH>where</TH> +<TH>module</TH> +<TH>default</TH> +<TH COLSPAN="2">base</TH> +</TR> +<TR> +<TD><CODE>cat</CODE> C G</TD> +<TD>G context</TD> +<TD>abstract</TD> +<TD>N/A</TD> +<TD>yes</TD> +</TR> +<TR> +<TD><CODE>fun</CODE> f : A</TD> +<TD>A type</TD> +<TD>abstract</TD> +<TD>N/A</TD> +<TD>yes</TD> +</TR> +<TR> +<TD><CODE>def</CODE> f ps = t</TD> +<TD>f fun, ps patterns, t term</TD> +<TD>abstract</TD> +<TD>yes</TD> +<TD>no</TD> +</TR> +<TR> +<TD><CODE>data</CODE> C = f <CODE>|</CODE> ... <CODE>|</CODE> g</TD> +<TD>C cat, f...g fun</TD> +<TD>abstract</TD> +<TD>yes</TD> +<TD>no</TD> +</TR> +<TR> +<TD><CODE>lincat</CODE> C = T</TD> +<TD>C cat, T type</TD> +<TD>concrete*</TD> +<TD>yes</TD> +<TD>yes</TD> +</TR> +<TR> +<TD><CODE>lin</CODE> f = t</TD> +<TD>f fun, t term</TD> +<TD>concrete*</TD> +<TD>no</TD> +<TD>yes</TD> +</TR> +<TR> +<TD><CODE>lindef</CODE> f = t</TD> +<TD>f fun, t term</TD> +<TD>concrete*</TD> +<TD>yes</TD> +<TD>no</TD> +</TR> +<TR> +<TD><CODE>printname cat</CODE> C = t</TD> +<TD>C cat, t term</TD> +<TD>concrete*</TD> +<TD>yes</TD> +<TD>no</TD> +</TR> +<TR> +<TD><CODE>printname fun</CODE> f = t</TD> +<TD>f fun, t term</TD> +<TD>concrete*</TD> +<TD>yes</TD> +<TD>no</TD> +</TR> +<TR> +<TD><CODE>param</CODE> P = C<CODE>|</CODE> ... <CODE>|</CODE> D</TD> +<TD>C...D constructors</TD> +<TD>resource*</TD> +<TD>N/A</TD> +<TD>yes</TD> +</TR> +<TR> +<TD><CODE>oper</CODE> f : T = t</TD> +<TD>T type, t term</TD> +<TD>resource*</TD> +<TD>N/A</TD> +<TD>yes</TD> +</TR> +<TR> +<TD><CODE>flags</CODE> o = v</TD> +<TD>o flag, v value</TD> +<TD>all</TD> +<TD>yes</TD> +<TD>N/A</TD> +</TR> +</TABLE> + +<P></P> +<P> +Judgements that have default values are rarely used, except <CODE>lincat</CODE> and +<CODE>flags</CODE>, which often need values different from the defaults. +</P> +<P> +Introducing a name twice in the same module is an error. In other words, +all judgements that have a "yes" in the name base column, must +have distinct identifiers on their left-hand sides. +</P> +<P> +All judgement end with semicolons (<CODE>;</CODE>). +</P> +<P> +In addition to the syntax given in the table, many of the forms have +syntactic sugar. This sugar will be explained below in connection to +each form. There are moreover two kinds of syntactic sugar common to all forms: +</P> +<UL> +<LI>the judgement keyword is shared between consecutive judgements + until a new keyword appears: +<center> +<CODE>keyw J ; K ;</CODE> === <CODE>keyw J ; keyw K ;</CODE> +</center> +<LI>the right-hand sides of colon (<CODE>:</CODE>) and equality (<CODE>=</CODE>) + can be shared, by using comma (<CODE>,</CODE>) as separator of left-hand sides, which + must consist of identifiers +<center> +<CODE>c,d : T</CODE> === <CODE>c : T ; d : T ;</CODE> +<P></P> +<CODE>c,d = t</CODE> === <CODE>c = t ; d = t ;</CODE> +</center> +</UL> + +<P> +These conventions, like all syntactic sugar, are performed at an +early compilation phase, directly after parsing. This means that e.g. +</P> +<PRE> + lin f,g = \x -> x ; +</PRE> +<P> +can be correct even though <CODE>f</CODE> and <CODE>g</CODE> required different +function types. +</P> +<P> +Within a module, judgements can occur in any order. In particular, +a name can be used before it is introduced. +</P> +<P> +The explanations of judgement forms refer to the notions +of <B>type</B> and <B>term</B> (the latter also called <B>expression</B>). +These notions will be explained in detail <a href="#expressions">here</a>. +</P> +<A NAME="toc16"></A> +<H3>Category declarations, cat</H3> +<P> +<a name="catjudgements"></a> +</P> +<P> +Category declarations +<center> +<CODE>cat</CODE> <I>C</I> <I>G</I> +</center> +define the <B>basic types</B> of abstract syntax. +A basic type is formed from a category by giving values to all variables +in the <B>context</B> <I>G</I>. If the context is empty, the +basic type looks the same as the category itself. Otherwise, application +syntax is used: +<center> +<I>C</I> <i>a</i><sub>1</sub>...<i>a</i><sub>n</sub> +</center> +</P> +<A NAME="toc17"></A> +<H3>Hypotheses and contexts</H3> +<P> +<a name="contexts"></a> +</P> +<P> +A context is a sequence of <B>hypotheses</B>, i.e. variable-type pairs. +A hypothesis is written +<center> +<CODE>(</CODE> <I>x</I> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> +</center> +and a sequence does not have any separator symbols. As syntactic sugar, +</P> +<UL> +<LI>variables can share a type, +<center> +<CODE>(</CODE> <I>x,y</I> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> === <CODE>(</CODE> <I>x</I> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> <CODE>(</CODE> <I>y</I> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> +</center> +<LI>a <B>wildcard</B> can be used for a variable not occurring in types + later in the context, +<center> +<CODE>(</CODE> <CODE>_</CODE> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> === <CODE>(</CODE> <I>x</I> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> +</center> +<LI>if the variable does not occur later, it can be omitted altogether, and + parentheses are not used, +<center> + <I>T</I> === <CODE>(</CODE> <I>x</I> <CODE>:</CODE> <I>T</I> <CODE>)</CODE> +</center> + But if <I>T</I> is more complex than an identifier, it needs parentheses to + be separated from the rest of the context. +</UL> + +<P> +An abstract syntax has <B>dependent types</B>, if any of its categories has +a non-empty context. +</P> +<A NAME="toc18"></A> +<H3>Function declarations, fun</H3> +<P> +Function declarations, +<center> + <CODE>fun</CODE> <I>f</I> <CODE>:</CODE> <I>T</I> +</center> +define the <B>syntactic constructors</B> of abstract +syntax. The type <I>T</I> of <I>f</I> +is built built from basic types (formed from categories) by using +the function type constructor <CODE>-></CODE>. Thus its form is +<center> + (<i>x</i><sub>1</sub> <CODE>:</CODE> <i>A</i><sub>1</sub>) <CODE>-></CODE> ... <CODE>-></CODE> (<i>x</i><sub>n</sub> <CODE>:</CODE> <i>A</i><sub>n</sub>) <CODE>-></CODE> <I>B</I> +</center> +where <I>Ai</I> are types, called the <B>argument types</B>, and <I>B</I> is a +basic type, called the <B>value type</B> of <I>f</I>. The <B>value category</B> of +<I>f</I> is the category that forms the type <I>B</I>. +</P> +<P> +A <B>syntax tree</B> is formed from <I>f</I> by applying it to a full list of +arguments, so that the result is of a basic type. +</P> +<P> +A <B>higher-order function</B> is one that has a function type as an +argument. The concrete syntax of GF does not support displaying the +bound variables of functions of higher than second order, but they are +legal in abstract syntax. +</P> +<P> +An abstract syntax is <B>context-free</B>, if it has neither dependent +types nor higher-order functions. Grammars with context-free abstract +syntax are an important subclass of GF, with more limited complexity +than full GF. Whether the <I>concrete</I> syntax is context-free in the sense +of the Chomsky hierarchy is independent of the context-freeness of +the abstract syntax. +</P> +<A NAME="toc19"></A> +<H3>Function definitions, def</H3> +<P> +Function definitions, +<center> + <CODE>def</CODE> <I>f</I> <i>p</i><sub>1</sub> ... <i>p</i><sub>n</sub> <CODE>=</CODE> <I>t</I> +</center> +where <I>f</I> is a <CODE>fun</CODE> function and <i>p</i><sub>i</sub># are patterns, +impose a relation of <B>definitional equality</B> on abstract syntax +trees. They form the basis of <B>computation</B>, which is used +when comparing whether two types are equal; this notion is relevant +only if the types are dependent. Computation can also be used for +the <B>normalization</B> of syntax trees, which applies even in +context-free abstract syntax. +</P> +<P> +The set of <CODE>def</CODE> definitions for <I>f</I> can be scattered around +the module in which <I>f</I> is introduced as a function. The compiler +builds the set of pattern equations in the order in which the +equations appear; this order is significant in the case of +overlapping patterns. All equations must appear in the same module in +which <I>f</I> itself declared. +</P> +<P> +The syntax of patterns will be specified <a href="#patternmatching">here</a>, commonly for +abstract and concrete syntax. In abstract +syntax, <B>constructor patterns</B> are those of the form +<center> + <I>C</I> <i>p</i><sub>1</sub> ... <i>p</i><sub>n</sub> +</center> +where <I>C</I> is declared as <CODE>data</CODE> for some abstract syntax category +(see next section). A <B>variable pattern</B> is either an identifier or +a wildcard. +</P> +<P> +A common pitfall is to forget to declare a constructor as data, which +causes it to be interpreted as a variable pattern in definitions. +</P> +<P> +Computation is performed by applying definitions and beta conversions, +and in general by using <B>pattern matching</B>. Computation and pattern matching +are explained commonly for abstract and concrete syntax <a href="#patternmatching">here</a>. +</P> +<P> +In contrast to concrete syntax, abstract syntax computation is +completely <B>symbolic</B>: it does not produce a value, but just another +term. Hence it is not an error to have incomplete systems of +pattern equations for a function. In addition, the definitions +can be <B>recursive</B>, which means that computation can fail to terminate; +this can never happen in concrete syntax. +</P> +<A NAME="toc20"></A> +<H3>Data constructor definitions, data</H3> +<P> +A data constructor definition, +<center> + <CODE>data</CODE> <I>C</I> <CODE>=</CODE> <i>f</i><sub>1</sub> <CODE>|</CODE> ... <CODE>|</CODE> <i>f</i><sub>n</sub> +</center> +defines the functions <I>f1</I>...<I>fn</I> to be <B>constructors</B> +of the category <I>C</I>. This means that they are recognized as constructor +patterns when used in function definitions. +</P> +<P> +In order for the data constructor definition to be correct, +<i>f</i><sub>1</sub>...<i>f</i><sub>n</sub> must be functions with <I>C</I> as their value category. +</P> +<P> +The complete set of constructors for a category <I>C</I> is the union of +all its data constructor definitions. Thus a category can be "extended" +by new constructors afterwards. However, all these constructor definitions +must appear in the same module in which the category is itself defined. +</P> +<P> +There is syntactic sugar for declaring a function as a constructor at +the same time as introducing it: +<center> +<CODE>data</CODE> <I>f</I> : <i>A</i><sub>1</sub> <CODE>-></CODE> ... <CODE>-></CODE> <i>A</i><sub>n</sub> <CODE>-></CODE> <I>C</I> <i>t</i><sub>1</sub> ... <i>t</i><sub>m</sub> +</P> +<P> + === +</P> +<P> +<CODE>fun</CODE> <I>f</I> : <i>A</i><sub>1</sub> <CODE>-></CODE> ... <CODE>-></CODE> <i>A</i><sub>n</sub> <CODE>-></CODE> <I>C</I> <i>t</i><sub>1</sub> ... <i>t</i><sub>m</sub> ; + <CODE>data</CODE> <I>C</I> = <I>f</I> +</center> +</P> +<A NAME="toc21"></A> +<H3>The semantic status of an abstract syntax function</H3> +<P> +There are three possible statuses for a function declared in a <CODE>fun</CODE> judgement: +</P> +<UL> +<LI>primitive notion: the default status +<LI>constructor: the function appears on the right-hand side in <CODE>data</CODE> judgement +<LI>defined: the function has a <CODE>def</CODE> definition +</UL> + +<P> +The "constructor" and "defined" statuses are in contradiction with each other, +whereas the primitive notion status is overridden by any of the two others. +</P> +<P> +This distinction is relevant for the semantics of abstract syntax, not +for concrete syntax. It shows in the way patterns are treated in +equations in <CODE>def</CODE> definitions: a constructor +in a pattern matches only itself, whereas +any other name is treated as a variable pattern, which matches +anything. +</P> +<A NAME="toc22"></A> +<H3>Linearization type definitions, lincat</H3> +<P> +A linearization type definition, +<center> + <CODE>lincat</CODE> <I>C</I> <CODE>=</CODE> <I>T</I> +</center> +defines the type of linearizations of trees whose type has category <I>C</I>. +Type dependences have no effect on the linearization type. +</P> +<P> +The type <I>T</I> must be a <B>legal linearization type</B>, which means that it +is a <I>record type</I> whose fields have either parameter types, the type Str +of strings, or table or record types of these. In particular, function types +may not appear in <I>T</I>. A detailed explanation of types in concrete syntax +will be given <a href="#cnctypes">here</a>. +</P> +<P> +If <I>K</I> is the concrete syntax of an abstract syntax <I>A</I>, then <I>K</I> must +define the linearization type of all categories declared in <I>A</I>. However, +the definition can be omitted from the source code, in which case the default +type <CODE>{s : Str}</CODE> is used. +</P> +<A NAME="toc23"></A> +<H3>Linearization definitions, lin</H3> +<P> +A linearization definition, +<center> + <CODE>lin</CODE> <I>f</I> <CODE>=</CODE> <I>t</I> +</center> +defines the linearizations function of function <I>f</I>, i.e. the function +used for linearizing trees formed by <I>f</I>. +</P> +<P> +The type of <I>t</I> must be the homomorphic image of the type of <I>f</I>. +In other words, if +<center> + <CODE>fun</CODE> <I>f</I> <CODE>:</CODE> <i>A</i><sub>1</sub> <CODE>-></CODE> ... <CODE>-></CODE> <i>A</i><sub>n</sub> <CODE>-></CODE> <I>A</I> +</center> +then +<center> + <CODE>lin</CODE> <I>f</I> <CODE>:</CODE> <i>A</i><sub>1</sub>* <CODE>-></CODE> ... <CODE>-></CODE> <i>A</i><sub>n</sub>* <CODE>-></CODE> <I>A</I>* +</center> +where the type <I>T</I>* is defined as follows depending on <I>T</I>: +</P> +<UL> +<LI>(<I>C</I> <i>t</i><sub>1</sub> ... <i>t</i><sub>n</sub>)* = <I>T</I>, if <CODE>lincat</CODE> <I>C</I> <CODE>=</CODE> <I>T</I> +<LI>(<i>B</i><sub>1</sub> <CODE>-></CODE> ... <CODE>-></CODE> <i>B</i><sub>m</sub> <CODE>-></CODE> <I>B</I>)* = <I>B</I>* <CODE>** {$0,...,$m : Str}</CODE> +</UL> + +<P> +The second case is relevant for higher-order functions only. It says that +the linearization type of the value type is extended by adding a string field +for each argument types; these fields store the variable symbol used for +the binding of each variable. +</P> +<P> +<a name="HOAS"></a> +</P> +<P> +Since the arguments of a function argument are treated as bare strings, +orders higher than the second are irrelevant for concrete syntax. +</P> +<P> +There is syntactic sugar for binding the variables of the linearization +of a function on the left-hand side: +<center> + <CODE>lin</CODE> <I>f</I> <I>p</I> <CODE>=</CODE> <I>t</I> === <CODE>lin</CODE> <I>f</I> <CODE>= \</CODE><I>p</I> <CODE>-></CODE> <I>t</I> +</center> +The pattern <I>p</I> must be either a variable or a wildcard (<CODE>_</CODE>); this is +what the syntax of lambda abstracts (<CODE>\p -> t</CODE>) requires. +</P> +<A NAME="toc24"></A> +<H3>Linearization default definitions, lindef</H3> +<P> +<a name="lindefjudgements"></a> +</P> +<P> +A linearization default definition, +<center> + <CODE>lindef</CODE> <I>C</I> <CODE>=</CODE> <I>t</I> +</center> +defines the default linearization of category <I>C</I>, i.e. the function +applicable to a string to make it into an object of the linearization +type of <I>C</I>. +</P> +<P> +Linearization defaults are invoked when linearizing variable bindings +in higher-order abstract syntax. A variable symbol is then presented +as a string, which must be converted to correct type in order for +linearization not to fail with an error. +</P> +<P> +The defaults can also be used for linearizing metavariables +in an interactive syntax editor. +</P> +<P> +Usually, linearization defaults are generated by using the default +rule that "uses the symbol itself for every string, and the +first value of the parameter type for every parameter". The precise +definition is by structural recursion on the type: +</P> +<UL> +<LI>default(Str,s) = s +<LI>default(P,s) = #1(P) +<LI>default(P => T,s) = <CODE>\\_ =></CODE> default(T,s) +<LI>default(<CODE>{</CODE>... ; r : R ; ...<CODE>}</CODE>,s) = <CODE>{</CODE>... ; r : default(R,s) ; ...<CODE>}</CODE> +</UL> + +<P> +The notion of the first value of a parameter type (#1(P)) is defined +<a href="#paramvalues">here</a> below. +</P> +<A NAME="toc25"></A> +<H3>Printname definitions, printname cat and printname fun</H3> +<P> +A category printname definition, +<center> + <CODE>printname cat</CODE> <I>C</I> <CODE>=</CODE> <I>s</I> +</center> +defines the printname of category <I>C</I>, i.e. the name used +in some abstract syntax information shown to the user. +</P> +<P> +Likewise, a function printname definition, +<center> + <CODE>printname fun</CODE> <I>f</I> <CODE>=</CODE> <I>s</I> +</center> +defines the printname of function <I>f</I>, i.e. the name used +in some abstract syntax information shown to the user. +</P> +<P> +The most common use of printnames is in the interactive syntax +editor, where printnames are displayed in menus. It is possible +e.g. to adapt them to each language, or to embed HTML tooltips +in them (as is used in some HTML-based editor GUIs). +</P> +<P> +Usually, printnames are generated automatically from the symbol +and/or concrete syntax information. +</P> +<A NAME="toc26"></A> +<H3>Parameter type definitions, param</H3> +<P> +<a name="paramjudgements"></a> +</P> +<P> +A parameter type definition, +<center> + <CODE>param</CODE> <I>P</I> <CODE>=</CODE> <i>C</i><sub>1</sub> <i>G</i><sub>1</sub> <CODE>|</CODE> ... <CODE>|</CODE> <i>C</i><sub>n</sub> <i>G</i><sub>n</sub> +</center> +defines a parameter type <I>P</I> with the <B>parameter constructors</B> +<i>C</i><sub>1</sub>...<i>C</i><sub>n</sub>, with their respective contexts <i>G</i><sub>1</sub>...<i>G</i><sub>n</sub>. +</P> +<P> +<a name="paramtypes"></a> +</P> +<P> +Contexts have the same syntax as in <CODE>cat</CODE> judgements, explained +<a href="#catjudgements">here</a>. Since dependent types are not available in +parameter type definitions, the use of variables is never +necessary. The types in the context must themselves be <B>parameter types</B>, +which are defined as follows: +</P> +<UL> +<LI>Given the judgement <CODE>param</CODE> <I>P</I> ..., <I>P</I> is a parameter type. +<LI>A record type of parameter types is a parameter type. +<LI><CODE>Ints</CODE> <I>n</I> (an initial segment of integers) is a parameter type. +</UL> + +<P> +The names defined by a parameter type definition include both the +type name <I>P</I> and the constructor names <i>C</i><sub>i</sub>. Therefore all these +names must be distinct in a module. +</P> +<P> +A parameter type may not be recursive, i.e. <I>P</I> itself may not occur in +the contexts of its constructors. This restriction extends to mutual +recursion: we say that <I>P</I> <B>depends</B> on the types that occur +in the contexts of its constructors and on all types that those types +depend on, and state that <I>P</I> may not depend on itself. +</P> +<P> +In an <CODE>interface module</CODE>, it is possible to declare a parameter type +without defining it, +<center> + <CODE>param</CODE> <I>P</I> <CODE>;</CODE> +</center> +</P> +<A NAME="toc27"></A> +<H3>Parameter values</H3> +<P> +<a name="paramvalues"></a> +</P> +<P> +All parameter types are finite, and the GF compiler will internally +compute them to <B>lists of parameter values</B>. These lists are formed by +traversing the <CODE>param</CODE> definitions, usually respecting the +order of constructors in the source code. For records, bibliographical +sorting is applied. However, both the order of traversal of <CODE>param</CODE> +definitions and the order of fields in a record are specified +in a compiler-internal way, which means that the programmer should not +rely on any particular order. +</P> +<P> +The order of the list of parameter values can affect the program in two +cases: +</P> +<UL> +<LI>in the default <CODE>lindef</CODE> definition (<a href="#lindefjudgements">here</a>), + the first value is chosen +<LI>in course-of-value tables (<a href="#tables">here</a>), the compiler-internal order is + followed +</UL> + +<P> +The first usage implies that, if <CODE>lindef</CODE> definitions are essential for +the application, they should be given manually. The second usage implies that +course-of-value tables should be avoided in hand-written GF code. +</P> +<P> +In run-time grammar generation, all parameter values are translated to +integers denotions positions in these parameter lists. +</P> +<A NAME="toc28"></A> +<H3>Operation definitions, oper</H3> +<P> +An operation definition, +<center> + <CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> <CODE>=</CODE> <I>t</I> +</center> +defines an <B>operation</B> <I>h</I> of type <I>T</I>, with the computation rule +<center> + <I>h</I> ==> <I>t</I> +</center> +The type <I>T</I> can be any concrete syntax type, including function +types of any order. The term <I>t</I> must have the type <I>T</I>, as +defined <a href="#expressions">here</a>. +</P> +<P> +As syntactic sugar, the type can be omitted, +<center> + <CODE>oper</CODE> <I>h</I> <CODE>=</CODE> <I>t</I> +</center> +which works in two cases +</P> +<UL> +<LI>the type can be inferred from <I>t</I> (compiler-dependent) +<LI>the definition occurs in an <CODE>instance</CODE> and the type is given in + the <CODE>interface</CODE> +</UL> + +<P> +It is also possible to give the type and the definition separately: +<center> +<CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> ; <CODE>oper</CODE> <I>h</I> <CODE>=</CODE> <I>t</I> === + <CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> <CODE>=</CODE> <I>t</I> +</center> +The order of the type part and the definition part is free, and there +can be other judgements in between. However, they must occur in the +same <CODE>resource</CODE> module for it to be complete (as defined <a href="#completeness">here</a>). +In an <CODE>interface</CODE> module, it is enough to give the type. +</P> +<P> +When only the definition is given, it is possible to use a shorthand +similar to <CODE>lin</CODE> judgements: +<center> +<CODE>oper</CODE> <I>h</I> <I>p</I> <CODE>=</CODE> <I>t</I> === <CODE>oper</CODE> <I>h</I> <CODE>=</CODE> <CODE>\</CODE><I>p</I> <CODE>-></CODE> <I>t</I> +</center> +The pattern <I>p</I> is either a variable or a wildcard (<CODE>_</CODE>). +</P> +<P> +Operation definitions may not be recursive, not even mutually recursive. +This condition ensures that functions can in the end be eliminated from +concrete syntax code (as explained <a href="#functionelimination">here</a>). +</P> +<A NAME="toc29"></A> +<H3>Operation overloading</H3> +<P> +<a name="overloading"></a> +</P> +<P> +One and the same operation name <I>h</I> can be used for different operations, +which have to have different types. For each call of <I>h</I>, the type checker +selects one of these operations depending on what type is expected in the +context of the call. The syntax of overloaded operation definitions is +<center> +<CODE>oper</CODE> <I>h</I> + <CODE>= overload {</CODE><I>h</I> : <i>T</i><sub>1</sub> = <i>t</i><sub>1</sub> ; ... ; <I>h</I> : <i>T</i><sub>n</sub> = <i>t</i><sub>n</sub><CODE>}</CODE> +</center> +Notice that <I>h</I> must be the same in all cases. +This format can be used to give the complete implementation; to give just +the types, e.g. in an interface, one can use the form +<center> +<CODE>oper</CODE> <I>h</I> + <CODE>: overload {</CODE><I>h</I> : <i>T</i><sub>1</sub> ; ... ; <I>h</I> : <i>T</i><sub>n</sub><CODE>}</CODE> +</center> +The implementation of this operation typing is given by a judgement of +the first form. The order of branches need not be the same. +</P> +<A NAME="toc30"></A> +<H3>Flag definitions, flags</H3> +<P> +A flag definition, +<center> + <CODE>flags</CODE> <I>o</I> <CODE>=</CODE> <I>v</I> +</center> +sets the value of the flag <I>o</I>, to be used when compiling or using +the module. +</P> +<P> +The flag <I>o</I> is an identifier, and the value <I>v</I> is either an identifier +or a quoted string. +</P> +<P> +Flags are a kind of metadata, which do not strictly belong to the GF +language. For instance, compilers do not necessarily check the +consistency of flags, or the meaningfulness of their values. +The inheritance of flags is not well-defined; the only certain rule +is that flags set in the module body override the settings from +inherited modules. +</P> +<P> +Here are some flags commonly included in grammars. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>flag</TH> +<TH>value</TH> +<TH>description</TH> +<TH COLSPAN="2">module</TH> +</TR> +<TR> +<TD><CODE>coding</CODE></TD> +<TD>character encoding</TD> +<TD>encoding used in string literals</TD> +<TD>concrete</TD> +</TR> +<TR> +<TD><CODE>lexer</CODE></TD> +<TD>predefined lexer</TD> +<TD>lexer before parsing</TD> +<TD>concrete</TD> +</TR> +<TR> +<TD><CODE>startcat</CODE></TD> +<TD>category</TD> +<TD>default target of parsing</TD> +<TD>abstract</TD> +</TR> +<TR> +<TD><CODE>unlexer</CODE></TD> +<TD>predefined unlexer</TD> +<TD>unlexer after linearization</TD> +<TD>concrete</TD> +</TR> +</TABLE> + +<P></P> +<P> +The possible values of these flags are specified <a href="#flagvalues">here</a>. +</P> +<A NAME="toc31"></A> +<H2>Types and expressions</H2> +<A NAME="toc32"></A> +<H3>Overview of expression forms</H3> +<P> +<a name="expressions"></a> +</P> +<P> +Like many dependently typed languages, GF makes no syntactic distinction +between expressions and types. An illegal use of a type as an expression or +vice versa comes out as a type error. Whether a variable, for instance, +stands for a type or an expression value, can only be resolved from its +context of use. +</P> +<P> +One practical consequence of the common syntax is that global and local definitions +(<CODE>oper</CODE> judgements and <CODE>let</CODE> expressions, respectively) work in the same way +for types and expressions. Thus it is possible to abbreviate a type +occurring in a type expression: +</P> +<PRE> + let A = {s : Str ; b : Bool} in A -> A -> A +</PRE> +<P> +Type and other expressions have a system of <B>precedences</B>. The following table +summarizes all expression forms, from the highest to the lowest precedence. +Some expressions are moreover left- or right-associative. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>prec</TH> +<TH>expression example</TH> +<TH COLSPAN="2">explanation</TH> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>c</CODE></TD> +<TD>constant or variable</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>Type</CODE></TD> +<TD>the type of types</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>PType</CODE></TD> +<TD>the type of parameter types</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>Str</CODE></TD> +<TD>the type of strings/token lists</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>"foo"</CODE></TD> +<TD>string literal</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>123</CODE></TD> +<TD>integer literal</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>0.123</CODE></TD> +<TD>floating point literal</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>?</CODE></TD> +<TD>metavariable</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>[]</CODE></TD> +<TD>empty token list</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>[C a b]</CODE></TD> +<TD>list category</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>["foo bar"]</CODE></TD> +<TD>token list</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>{"s : Str ; n : Num}</CODE></TD> +<TD>record type</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE>{"s = "foo" ; n = Sg}</CODE></TD> +<TD>record</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE><Sg,Fem,Gen></CODE></TD> +<TD>tuple</TD> +</TR> +<TR> +<TD>7</TD> +<TD><CODE><n : Num></CODE></TD> +<TD>type-annotated expression</TD> +</TR> +<TR> +<TD>6 left</TD> +<TD><CODE>t.r</CODE></TD> +<TD>projection or qualification</TD> +</TR> +<TR> +<TD>5 left</TD> +<TD><CODE>f a</CODE></TD> +<TD>function application</TD> +</TR> +<TR> +<TD>5</TD> +<TD><CODE>table {Sg => [] ; _ => "xs"}</CODE></TD> +<TD>table</TD> +</TR> +<TR> +<TD>5</TD> +<TD><CODE>table P [a ; b ; c]</CODE></TD> +<TD>course-of-values table</TD> +</TR> +<TR> +<TD>5</TD> +<TD><CODE>case n of {Sg => [] ; _ => "xs"}</CODE></TD> +<TD>case expression</TD> +</TR> +<TR> +<TD>5</TD> +<TD><CODE>variants {"color" ; "colour"}</CODE></TD> +<TD>free variation</TD> +</TR> +<TR> +<TD>5</TD> +<TD><CODE>pre {"a" ; "an"/vowel}</CODE></TD> +<TD>prefix-dependent choice</TD> +</TR> +<TR> +<TD>4 left</TD> +<TD><CODE>t ! v</CODE></TD> +<TD>table selection</TD> +</TR> +<TR> +<TD>4 left</TD> +<TD><CODE>A * B</CODE></TD> +<TD>tuple type</TD> +</TR> +<TR> +<TD>4 left</TD> +<TD><CODE>R ** {b : Bool}</CODE></TD> +<TD>record (type) extension</TD> +</TR> +<TR> +<TD>3 left</TD> +<TD><CODE>t + s</CODE></TD> +<TD>token gluing</TD> +</TR> +<TR> +<TD>2 left</TD> +<TD><CODE>t ++ s</CODE></TD> +<TD>token list concatenation</TD> +</TR> +<TR> +<TD>1 right</TD> +<TD><CODE>\x,y -> t</CODE></TD> +<TD>function abstraction ("lambda")</TD> +</TR> +<TR> +<TD>1 right</TD> +<TD><CODE>\\x,y => t</CODE></TD> +<TD>table abstraction</TD> +</TR> +<TR> +<TD>1 right</TD> +<TD><CODE>(x : A) -> B</CODE></TD> +<TD>dependent function type</TD> +</TR> +<TR> +<TD>1 right</TD> +<TD><CODE>A -> B</CODE></TD> +<TD>function type</TD> +</TR> +<TR> +<TD>1 right</TD> +<TD><CODE>P => T</CODE></TD> +<TD>table type</TD> +</TR> +<TR> +<TD>1 right</TD> +<TD><CODE>let x = v in t</CODE></TD> +<TD>local definition</TD> +</TR> +<TR> +<TD>1</TD> +<TD><CODE>t where {x = v}</CODE></TD> +<TD>local definition</TD> +</TR> +<TR> +<TD>1</TD> +<TD><CODE>in M.C "foo"</CODE></TD> +<TD>rule by example</TD> +</TR> +</TABLE> + +<P></P> +<P> +Any expression in parentheses (<CODE>(</CODE><I>exp</I><CODE>)</CODE>) is in the highest +precedence class. +</P> +<A NAME="toc33"></A> +<H3>The functional fragment: expressions in abstract syntax</H3> +<P> +<a name="functiontype"></a> +</P> +<P> +The expression syntax is the same in abstract and concrete syntax, although +only a part of the syntax is actually usable in well-typed expressions in +abstract syntax. An abstract syntax is essentially used for defining a set +of types and a set of functions between those types. Therefore it needs +essentially the <B>functional fragment</B> +of the syntax. This fragment comprises two kinds of types: +</P> +<UL> +<LI><B>basic types</B>, of form <I>C a1...an</I> where + <UL> + <LI><CODE>cat</CODE> <I>C</I> (<i>x</i><sub>1</sub> : <i>A</i><sub>1</sub>)...(<i>x</i><sub>n</sub> : <i>A</i><sub>n</sub>), including the predefined + categories <CODE>Int</CODE>, <CODE>Float</CODE>, and <CODE>String</CODE> explained <a href="#predefabs">here</a> + <LI><i>a</i><sub>1</sub> : <i>A</i><sub>1</sub>,...,<i>a</i><sub>n</sub> : <i>A</i><sub>n</sub>{<i>x</i><sub>1</sub> = <i>a</i><sub>1</sub>,...,<i>x</i><sub>n-1</sub>=<i>a</i><sub>n-1</sub>} + </UL> +</UL> + +<UL> +<LI><B>function types</B>, of form (<I>x</I> : <I>A</I>) <CODE>-></CODE> <I>B</I>, where + <UL> + <LI><I>A</I> is a type + <LI><I>B</I> is a type possibly depending on <I>x</I> : <I>A</I> + </UL> +</UL> + +<P> +When defining basic types, we used the notation +<I>t</I>{<i>x</i><sub>1</sub> = <i>t</i><sub>1</sub>,...,<i>x</i><sub>n</sub>=<i>t</i><sub>n</sub>} +for the <B>substitution</B> of values to variables. This is a metalevel notation, +which denotes a term that is formed by replacing the free occurrences of +each variable <i>x</i><sub>i</sub> by <i>t</i><sub>i</sub>. +</P> +<P> +These types have six kinds of expressions: +</P> +<UL> +<LI><B>constants</B>, <I>f</I> : <I>A</I> where + <UL> + <LI><CODE>fun</CODE> <I>f</I> : <I>A</I> + </UL> +</UL> + +<UL> +<LI><B>literals</B> for integers, floats, and strings (defined in <a href="#predefabs">here</a>) +</UL> + +<UL> +<LI><B>variables</B>, <I>x</I> : <I>A</I> where + <UL> + <LI><I>x</I> has been introduced by a binding + </UL> +</UL> + +<UL> +<LI><B>applications</B>, <I>f a</I> : <I>B</I>{<I>x</I>=<I>a</I>}, where + <UL> + <LI><I>f</I> : (<I>x</I> : <I>A</I>) <CODE>-></CODE> <I>B</I> + <LI><I>a</I> : <I>A</I> + </UL> +</UL> + +<UL> +<LI><B>abstractions</B>, <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>b</I> : (<I>x</I> : <I>A</I>) <CODE>-></CODE> <I>B</I>, where + <UL> + <LI><I>b</I> : <I>B</I> possibly depending on <I>x</I> : <I>A</I> + </UL> +</UL> + +<UL> +<LI><B>metavariables</B>, <CODE>?</CODE>, as introduced in intermediate phases of + incremental type checking; metavariables are not permitted + in GF source code +</UL> + +<P> +<a name="variablebinding"></a> +</P> +<P> +The notion of <B>binding</B> is defined for occurrences of variables in +subexpressions as follows: +</P> +<UL> +<LI>in (<I>x</I> : <I>A</I>) <CODE>-></CODE> <I>B</I>, <I>x</I> is bound in <I>B</I> +<LI>in <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>b</I>, <I>x</I> is bound in <I>b</I> +<LI>in <CODE>def</CODE> <I>f</I> <i>p</i><sub>1</sub> ... <i>p</i><sub>n</sub> = <I>t</I>, any pattern variable introduced in + any <I>pi</I> is bound in <I>t</I> (as defined <a href="#patternmatching">here</a>) +</UL> + +<P> +As syntactic sugar, function types have sharing of types and +suppression of variables, in the same way as contexts +(defined <a href="#contexts">here</a>): +</P> +<UL> +<LI>variables can share a type, +<center> +<CODE>(</CODE> <I>x,y</I> <CODE>:</CODE> <I>A</I> <CODE>)</CODE> <CODE>-></CODE> <I>B</I> === + <CODE>(</CODE> <I>x</I> <CODE>:</CODE> <I>A</I> <CODE>) -> (</CODE> <I>y</I> <CODE>:</CODE> <I>A</I> <CODE>) -></CODE> <I>B</I> +</center> +<LI>a <B>wildcard</B> can be used for a variable not occurring later in the type, +<center> +<CODE>(</CODE> <CODE>_</CODE> <CODE>:</CODE> <I>A</I> <CODE>) -></CODE> <I>B</I> === + <CODE>(</CODE> <I>x</I> <CODE>:</CODE> <I>T</I> <CODE>) -></CODE> <I>B</I> +</center> +<LI>if the variable does not occur later, it can be omitted altogether, and + parentheses are not used, +<center> + <I>A</I> <CODE>-></CODE> <I>B</I> === <CODE>(</CODE> <I>_</I> <CODE>:</CODE> <I>A</I> <CODE>) -></CODE> <I>B</I> +</center> +</UL> + +<P> +There is analogous syntactic sugar for constant functions, +<center> +<CODE>\</CODE><I>_</I> <CODE>-></CODE> <I>t</I> === <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I> +</center> +where <I>x</I> does not occur in <I>t</I>, and for multiple lambda abstractions: +<center> +<CODE>\</CODE><I>p,q</I> <CODE>-></CODE> <I>t</I> === <CODE>\</CODE><I>p</I> <CODE>-></CODE> <CODE>\</CODE><I>q</I> <CODE>-></CODE> <I>t</I> +</center> +where <I>p</I> and <I>q</I> are variables or wild cards (<CODE>_</CODE>). +</P> +<A NAME="toc34"></A> +<H3>Conversions</H3> +<P> +<a name="conversions"></a> +</P> +<P> +Among expressions, there is a relation of <B>definitional equality</B> defined +by four <B>conversion rules</B>: +</P> +<UL> +<LI><B>alpha conversion</B>: + <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>b</I> = <CODE>\</CODE><I>y</I> <CODE>-></CODE> <I>b</I>{<I>x</I>=<I>y</I>} +</UL> + +<UL> +<LI><B>beta conversion</B>: (<CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>b</I>) <I>a</I> = <I>b</I>{<I>x</I>=<I>a</I>} +</UL> + +<UL> +<LI><B>delta conversion</B>: <I>f</I> <i>a</i><sub>1</sub> ... <i>a</i><sub>n</sub> = <I>tg</I>, if + <UL> + <LI>there is a definition <CODE>def</CODE> <I>f</I> <i>p</i><sub>1</sub> ... <i>p</i><sub>n</sub> = <I>t</I> + <LI>this definition is the first for <I>f</I> that matches the sequence + <i>a</i><sub>1</sub> .... <i>a</i><sub>n</sub>, with the substitution <I>g</I> + </UL> +</UL> + +<UL> +<LI><B>eta conversion</B>: <I>c</I> = <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>c x</I>, + if <I>c</I> : (<I>x</I> : <I>A</I>) <CODE>-></CODE> <I>B</I> +</UL> + +<P> +Pattern matching substitution used in delta conversion +is defined <a href="#patternmatching">here</a>. +</P> +<P> +An expression is in <B>beta-eta-normal form</B> if +</P> +<UL> +<LI>it has no subexpressions to which beta conversion applies (beta normality) +<LI>each constant or variable whose type is a function type must be + <B>eta-expanded</B>, i.e. made into an abstract equal to it by eta conversion + (eta normality) +</UL> + +<P> +Notice that the iteration of eta expansion would lead to an expression not +in beta-normal form. +</P> +<A NAME="toc35"></A> +<H3>Syntax trees</H3> +<P> +<a name="syntaxtrees"></a> +</P> +<P> +The <B>syntax trees</B> defined by an abstract syntax are well-typed +expressions of basic types in beta-eta normal form. +Linearization defined in concrete +syntax applies to all and only these expressions. +</P> +<P> +There is also a direct definition of syntax trees, which does not +refer to beta and eta conversions: keeping in mind that a type always has +the form +<center> +(<i>x</i><sub>1</sub> : <i>A</i><sub>1</sub>) <CODE>-></CODE> ... <CODE>-></CODE> (<i>x</i><sub>n</sub> : <i>A</i><sub>n</sub>) <CODE>-></CODE> <I>B</I> +</center> +where <I>Ai</I> are types and <I>B</I> is a basic type, a syntax tree is an expression +<center> +<I>b</I> <i>t</i><sub>1</sub> ... <i>t</i><sub>n</sub> : <I>B'</I> +</center> +where +</P> +<UL> +<LI><I>B'</I> is the basic type <I>B</I>{<i>x</i><sub>1</sub> = <i>t</i><sub>1</sub>,...,<i>x</i><sub>n</sub> = <i>t</i><sub>n</sub>} +<LI><CODE>fun</CODE> <I>b</I> : (<i>x</i><sub>1</sub> : <i>A</i><sub>1</sub>) <CODE>-></CODE> ... <CODE>-></CODE> (<i>x</i><sub>n</sub> : <i>A</i><sub>n</sub>) <CODE>-></CODE> <I>B</I> +<LI>each <i>t</i><sub>i</sub> has the form <CODE>\</CODE><i>z</i><sub>1</sub>,...,<i>z</i><sub>m</sub> <CODE>-></CODE> <I>c</I> where <i>A</i><sub>i</sub> is +<center> +(<i>y</i><sub>1</sub> : <i>B</i><sub>1</sub>) <CODE>-></CODE> ... <CODE>-></CODE> (<i>y</i><sub>m</sub> : <i>B</i><sub>m</sub>) <CODE>-></CODE> <I>B</I> +</center> +</UL> + +<A NAME="toc36"></A> +<H3>Predefined types in abstract syntax</H3> +<P> +<a name="predefabs"></a> +</P> +<P> +GF provides three predefined categories for abstract syntax, with predefined +expressions: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>category</TH> +<TH COLSPAN="2">expressions</TH> +</TR> +<TR> +<TD ALIGN="center"><CODE>Int</CODE></TD> +<TD>integer literals, e.g. <CODE>123</CODE></TD> +</TR> +<TR> +<TD ALIGN="center"><CODE>Float</CODE></TD> +<TD>floating point literals, e.g. <CODE>12.34</CODE></TD> +</TR> +<TR> +<TD ALIGN="center"><CODE>String</CODE></TD> +<TD>string literals, e.g. <CODE>"foo"</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +These categories take no arguments, and they can be used as basic +types in the same way as if they were introduced in <CODE>cat</CODE> judgements. +However, it is not legal to define <CODE>fun</CODE> functions that have any +of these types as value type: their only well-typed expressions are +literals as defined in the above table. +</P> +<A NAME="toc37"></A> +<H3>Overview of expressions in concrete syntax</H3> +<P> +<a name="cnctypes"></a> +</P> +<P> +Concrete syntax is about defining mappings from abstract syntax trees +to <B>concrete syntax objects</B>. These objects comprise +</P> +<UL> +<LI>records +<LI>tables +<LI>strings +<LI>parameter values +</UL> + +<P> +Thus functions are not concrete syntax objects; however, the +mappings themselves are expressed as functions, and the source code +of a concrete syntax can use functions under the condition that +they can be eliminated from the final compiled grammar (which they +can; this is one of the fundamental properties of compilation, as +explained in more detail in the <I>JFP</I> article). +</P> +<P> +Concrete syntax thus has the same function types and expression forms as +abstract syntax, specified <a href="#functiontype">here</a>. The basic types defined +by categories (<CODE>cat</CODE> judgements) are available via grammar reuse +explained <a href="#reuse">here</a>; this also comprises the +predefined categories <CODE>Float</CODE> and <CODE>String</CODE>. +</P> +<A NAME="toc38"></A> +<H3>Values, canonical forms, and run-time variables</H3> +<P> +In abstract syntax, the conversion rules fiven <a href="#conversions">here</a> +define a computational relation +among expressions, but there is no separate notion of a <B>value</B> of +computation: the value (the end point) of a computation chain is +simply an expression to which no more conversions apply. In general, +we are interested in expressions that satisfy the conditions of being +syntax trees (as defined <a href="#syntaxtrees">here</a>), but there can be many computationally +equivalent syntax trees which nonetheless are distinct syntax trees +and hence have different linearizations. The main use of computation +in abstract syntax is to compare types in dependent type checking. +</P> +<P> +In concrete syntax, the notion of values is central. At run time, +we want to compute the values of linearizations; at compile time, we want +to perform <B>partial evaluation</B>, which computes expressions as far as +possible. +To specify what happens +in computation we therefore have to distinguish between <B>canonical forms</B> +and other forms of expressions. The canonical forms are defined separately +for each form of type, whereas the other forms may usually produce expressions +of any type. +</P> +<P> +<a name="linexpansion"></a> +<a name="runtimevariables"></a> +</P> +<P> +What is done at compile time is the elimination of any noncanonical forms, +except for those depending on <B>run-time variables</B>. Run-time variables are +the same as the <B>argument variables</B> of linearization rules, i.e. the +variables <i>x</i><sub>1</sub>,...,<i>x</i><sub>n</sub> in +<center> +<CODE>lin</CODE> <I>f</I> <CODE>= \</CODE> <i>x</i><sub>1</sub>,...,<i>x</i><sub>n</sub> <CODE>-></CODE> <I>t</I> +</center> +where +<center> +<CODE>fun</CODE> <I>f</I> <CODE>:</CODE> +(<i>x</i><sub>1</sub> : <i>A</i><sub>1</sub>) <CODE>-></CODE> ... <CODE>-></CODE> (<i>x</i><sub>n</sub> : <i>A</i><sub>n</sub>) <CODE>-></CODE> <I>B</I> +</center> +Notice that this definition refers to the <B>eta-expanded</B> linearization term, +which has one abstracted variable for each argument type of <I>f</I>. These variables +are not necessarily explicit in GF source code, but introduced by the compiler. +</P> +<P> +Since certain expression forms should be eliminated in compilation but +cannot be eliminated if run-time variables appear in them, errors can +appear late in compilation. This is an issue with the following +expression forms: +</P> +<UL> +<LI>gluing (<CODE>s + t</CODE>), defined <a href="#gluing">here</a> +<LI>pattern matching on strings, defined <a href="#patternmatching">here</a> +<LI>predefined string operations, defined <a href="#predefcnc">here</a> (those taking + <CODE>Str</CODE> arguments) +</UL> + +<A NAME="toc39"></A> +<H3>Token lists, tokens, and strings</H3> +<P> +<a name="strtype"></a> +</P> +<P> +The most prominent basic type is <CODE>Str</CODE>, the type of <B>token lists</B>. +This type is often sloppily referred to as the type of <B>strings</B>; +but it should be kept in mind that the objects of <CODE>Str</CODE> are +<I>lists</I> of strings rather than single strings. +</P> +<P> +Expressions of type <CODE>Str</CODE> have the following canonical forms: +</P> +<UL> +<LI><B>tokens</B>, i.e. <B>string literals</B>, in double quotes, e.g. <CODE>"foo"</CODE> +<LI><B>the empty token list</B>, <CODE>[]</CODE> +<LI><B>concatenation</B>, <I>s</I> <CODE>++</CODE> <I>t</I>, where <I>s,t</I> : <CODE>Str</CODE> +<LI><B>prefix-dependent choice</B>, + <CODE>pre {</CODE> <I>s</I> ; <i>s</i><sub>1</sub> <CODE>/</CODE> <i>p</i><sub>1</sub> ; ... ; <i>s</i><sub>n</sub> <CODE>/</CODE> <i>p</i><sub>n</sub>}, where + <UL> + <LI><I>s</I>, <i>s</i><sub>1</sub>,...,<i>s</i><sub>n</sub>, <i>p</i><sub>1</sub>,...,<i>p</i><sub>n</sub> : <CODE>Str</CODE> + </UL> +</UL> + +<P> +For convenience, the notation is overloaded so that tokens are identified +with singleton token lists, and there is no separate type of tokens +(this is a change from the <I>JFP</I> article). +The notion of a token +is still important for compilation: all tokens introduced by +the grammar must be known at compile time. This, in turn, is +required by the parsing algorithms used for parsing with GF grammars. +</P> +<P> +In addition to string literals, tokens can be formed by a specific +non-canonical operator: +</P> +<UL> +<LI><B>gluing</B>, <I>s</I> <CODE>+</CODE> <I>t</I>, where <I>s,t</I> : <CODE>Str</CODE> +</UL> + +<P> +<a name="gluing"></a> +</P> +<P> +Being noncanonical, gluing is equipped with a computation rule: +string literals are glued by forming a new string literal, and +empty token lists can be ignored: +</P> +<UL> +<LI><CODE>"foo" + "bar"</CODE> ==> <CODE>"foobar"</CODE> +<LI><I>t</I> <CODE>+ []</CODE> ==> <I>t</I> +<LI><CODE>[] +</CODE> <I>t</I> ==> <I>t</I> +</UL> + +<P> +Since tokens must be known at compile time, +the operands of gluing may not depend on run-time variables, +as defined <a href="#runtimevariables">here</a>. +</P> +<P> +As syntactic sugar, token lists can be given as bracketed string literals, where +spaces separate tokens: +</P> +<UL> +<LI><B>token lists</B>, <CODE>["one two three"]</CODE> === <CODE>"one" ++ "two" ++ "three"</CODE> +</UL> + +<P> +Notice that there are no empty tokens, but the expression <CODE>[]</CODE> +can be used in a context requiring a token, in particular in gluing expression +below. Since <CODE>[]</CODE> denotes an empty token list, the following computation laws +are valid: +</P> +<UL> +<LI><I>t</I> <CODE>++ []</CODE> ==> <I>t</I> +<LI><CODE>[] ++</CODE> <I>t</I> ==> <I>t</I> +</UL> + +<P> +Moreover, concatenation and gluing are associative: +</P> +<UL> +<LI>s <CODE>+</CODE> (t <CODE>+</CODE> u) ==> s <CODE>+</CODE> t <CODE>+</CODE> u +<LI>s <CODE>++</CODE> (t <CODE>++</CODE> u) ==> s <CODE>++</CODE> t <CODE>++</CODE> u +</UL> + +<P> +For the programmer, associativity and the empty token laws mean +that the compiler can use them to simplify string expressions. +It also means that these laws are respected in pattern matching +on strings. +</P> +<P> +A prime example of prefix-dependent choice operation is the following +approximative expression for the English indefinite article: +</P> +<PRE> + pre {"a" ; "an" / variants {"a" ; "e" ; "i" ; "o"}} +</PRE> +<P> +This expression can be computed in the context of a subsequent token: +</P> +<UL> +<LI><CODE>pre {</CODE> <I>s</I> ; <i>s</i><sub>1</sub> <CODE>/</CODE> <i>p</i><sub>1</sub> ; ... ; <i>s</i><sub>n</sub> <CODE>/</CODE> <i>p</i><sub>n</sub><CODE>} ++</CODE> <I>t</I> + ==> + <UL> + <LI><i>s</i><sub>i</sub> for the first <I>i</I> such that the prefix <i>p</i><sub>i</sub> + matches <I>t</I>, if it exists + <LI><I>s</I> otherwise + </UL> +</UL> + +<P> +The <B>matching prefix</B> is defined by comparing the string with the prefix of +the token. If the prefix is a variant list of strings, then it matches +the token if any of the strings in the list matches it. +</P> +<P> +The computation rule can sometimes be applied at compile time, but it general, +prefix-dependent choices need to be passed to the run-time grammar, because +they are not given a subsequent token to compare with, or because the +subsequent token depends on a run-time variable. +</P> +<P> +The prefix-dependent choice expression itself may not depend on run-time +variables. +</P> +<P> +<I>In GF prior to 3.0, a specific type</I> <CODE>Strs</CODE> +<I>is used for defining prefixes,</I> +<I>instead of just</I> <CODE>variants</CODE> <I>of</I> <CODE>Str</CODE>. +</P> +<A NAME="toc40"></A> +<H3>Records and record types</H3> +<P> +A <B>record</B> is a collection of objects of possibly different types, +accessible by <B>projections</B> from the record with <B>labels</B> pointing +to these objects. A record is also itself an object, whose type is +a <B>record type</B>. Record types have the form +<center> + <CODE>{</CODE> <i>r</i><sub>1</sub> : <i>A</i><sub>1</sub> <CODE>;</CODE> ... <CODE>;</CODE> <i>r</i><sub>n</sub> : <i>A</i><sub>n</sub> <CODE>}</CODE> +</center> +where <I>n</I> >= 0, each <i>A</i><sub>i</sub> is a type, and the labels <i>r</i><sub>i</sub> are +distinct. A record of this type has the form +<center> + <CODE>{</CODE> <i>r</i><sub>1</sub> = <i>a</i><sub>1</sub> <CODE>;</CODE> ... <CODE>;</CODE> <i>r</i><sub>n</sub> = <i>a</i><sub>n</sub> <CODE>}</CODE> +</center> +where each #aii : "Aii. A limiting case is the <B>empty record type</B> +<CODE>{}</CODE>, which has the object <CODE>{}</CODE>, the <B>empty record</B>. +</P> +<P> +The <B>fields</B> of a record type are its parts of the form <I>r</I> : <I>A</I>, +also called <B>typings</B>. The <B>fields</B> of a record are of the form +<I>r</I> = <I>a</I>, also called <B>value assignments</B>. Value assignments +may optionally indicate the type, as in <I>r</I> : <I>A</I> = <I>a</I>. +</P> +<P> +The order of fields in record types and records is insignificant: two record +types (or records) are equal if they have the same fields, in any order, and a +record is an object of a record type, if it has type-correct value assignments +for all fields of the record type. +The latter definition implies the even stronger +principle of <B>record subtyping</B>: a record can have any type that has some +subset of its fields. This principle is explained further +<a href="#subtyping">here</a>. +</P> +<P> +All fields in a record must have distinct labels. Thus it is not possible +e.g. to "redefine" a field "later" in a record. +</P> +<P> +Lexically, labels are identifiers (defined <a href="#identifiers">here</a>). +This is with the exception +of the labels selecting bound variables in the linearization of higher-order +abstract syntax, which have the form <CODE>$</CODE><I>i</I> for an integer <I>i</I>, +as specified <a href="#HOAS">here</a>. +In source code, these labels should not appear in records fields, +but only in selections. +</P> +<P> +Labels occur only in syntactic positions where they cannot be confused with +constants or variables. Therefore it is safe to write, as in <CODE>Prelude</CODE>, +</P> +<PRE> + ss : Str -> {s : Str} = \s -> {s = s} ; +</PRE> +<P> +A <B>projection</B> is an expression of the form +<center> + <I>t</I>.<I>r</I> +</center> +where <I>t</I> must be a record and <I>r</I> must be a label defined in it. +The type of the projection is the type of that field. +The computation rule for projection returns the value assigned to that field: +<center> +<CODE>{</CODE> ... <CODE>;</CODE> <I>r</I> = <I>a</I> <CODE>;</CODE> ... <CODE>}.</CODE><I>r</I> ==> <I>a</I> +</center> +Notice that the dot notation <I>t</I>.<I>r</I> is also used for qualified names +as specified <a href="#qualifiednames">here</a>. +This ambiguity follows tradition and convenience. It is +resolved by the following rules (before type checking): +</P> +<OL> +<LI>if <I>t</I> is a bound variable or a constant in scope, + <I>t</I>.<I>r</I> is type-checked as a projection +<LI>otherwise, <I>t</I>.<I>r</I> is type-checked as a qualified name +</OL> + +<P> +As syntactic sugar, types and values can be shared: +</P> +<UL> +<LI><CODE>{</CODE> ... <CODE>;</CODE> <I>r,s</I> : <I>A</I> <CODE>;</CODE> ... <CODE>}</CODE> === + <CODE>{</CODE> ... <CODE>;</CODE> <I>r</I> : <I>A</I> <CODE>;</CODE> <I>s</I> : <I>A</I> <CODE>;</CODE> ... <CODE>}</CODE> +<LI><CODE>{</CODE> ... <CODE>;</CODE> <I>r,s</I> = <I>a</I> <CODE>;</CODE> ... <CODE>}</CODE> === + <CODE>{</CODE> ... <CODE>;</CODE> <I>r</I> = <I>a</I> <CODE>;</CODE> <I>s</I> = <I>a</I> <CODE>;</CODE> ... <CODE>}</CODE> +</UL> + +<P> +Another syntactic sugar are <B>tuple types</B> and <B>tuples</B>, which are translated +by endowing their unlabelled fields by the labels <CODE>p1</CODE>, <CODE>p2</CODE>,... in the +order of appearance of the fields: +</P> +<UL> +<LI><i>A</i><sub>1</sub> <CODE>*</CODE> ... <CODE>*</CODE> <i>A</i><sub>n</sub> === + <CODE>{</CODE> <CODE>p1</CODE> : <i>A</i><sub>1</sub> <CODE>;</CODE> ... <CODE>;</CODE> <CODE>pn</CODE> : <i>A</i><sub>n</sub> <CODE>}</CODE> +<LI><CODE><</CODE><i>a</i><sub>1</sub> <CODE>,</CODE> ... <CODE>,</CODE> <i>a</i><sub>n</sub> <CODE>></CODE> === + <CODE>{</CODE> <CODE>p1</CODE> = <i>a</i><sub>1</sub><CODE>;</CODE> ... <CODE>;</CODE> <CODE>pn</CODE> = <i>a</i><sub>n</sub> <CODE>}</CODE> +</UL> + +<P> +A <B>record extension</B> is formed by adding fields to a record or a record type. +The general syntax involves two expressions, +<center> + <I>R</I> <CODE>**</CODE> <I>S</I> +</center> +The result is a record type or a record with a union of the fields of <I>R</I> and +<I>S</I>. It is therefore well-formed if +</P> +<UL> +<LI>both <I>R</I> and <I>S</I> are either records or record types +<LI>the labels in <I>R</I> and <I>S</I> are distinct +</UL> + +<A NAME="toc41"></A> +<H3>Subtyping</H3> +<P> +<a name="subtyping"></a> +</P> +<P> +The possibility of having superfluous fields in a record forms the basis of +the <B>subtyping</B> relation. +That <I>A</I> is a subtype of <I>B</I> means that <I>a : A</I> implies <I>a : B</I>. +This is clearly satisfied for records with superfluous fields: +</P> +<UL> +<LI>if <I>R</I> is a record type without the label <I>r</I>, + then <I>R</I> <CODE>** {</CODE> <I>r</I> : <I>A</I> <CODE>}</CODE> is a subtype of <I>R</I> +</UL> + +<P> +The GF grammar compiler extends subtyping to function types by <B>covariance</B> +and <B>contravariance</B>: +</P> +<UL> +<LI>covariance: if <I>A</I> is a subtype of <I>B</I>, + then <I>C</I> <CODE>-></CODE> <I>A</I> is a subtype of <I>C</I> <CODE>-></CODE> <I>B</I> +<LI>contravariance: if <I>A</I> is a subtype of <I>B</I>, + then <I>B</I> <CODE>-></CODE> <I>C</I> is a subtype of <I>A</I> <CODE>-></CODE> <I>C</I> +</UL> + +<P> +The logic of these rules is natural: if a function is returns a value +in a subtype, then this value is <I>a fortiori</I> in the supertype. +If a function is defined for some type, then it is <I>a fortiori</I> defined +for any subtype. +</P> +<P> +In addition to the well-known principles of record subtyping and co- and +contravariance, GF implements subtyping for initial segments of integers: +</P> +<UL> +<LI>if <I>m</I> < <I>n</I>, then <CODE>Ints</CODE> <I>m</I> is a subtype of <CODE>Ints</CODE> <I>n</I> +<LI><CODE>Ints</CODE> <I>n</I> is a subtype of <CODE>Integer</CODE> +</UL> + +<P> +As the last rule, subtyping is transitive: +</P> +<UL> +<LI>if <I>A</I> is a subtype of <I>B</I> and <I>B</I> is a subtype of <I>C</I>, then + <I>A</I> is a subtype of <I>C</I>. +</UL> + +<A NAME="toc42"></A> +<H3>Tables and table types</H3> +<P> +<a name="tables"></a> +</P> +<P> +One of the most characteristic constructs of GF is <B>tables</B>, also called +<B>finite functions</B>. That these functions are finite means that it +is possible to finitely enumerate all argument-value pairs; this, in +turn, is possible because the argument types are finite. +</P> +<P> +A <B>table type</B> has the form +<center> +<I>P</I> <CODE>=></CODE> <I>T</I> +</center> +where <I>P</I> must be a parameter type in the sense defined <a href="#paramtypes">here</a>, whereas +<I>T</I> can be any type. +</P> +<P> +Canonical expressions of table types are <B>tables</B>, of the form +<center> +<CODE>table</CODE> <CODE>{</CODE> <i>V</i><sub>1</sub> <CODE>=></CODE> <i>t</i><sub>1</sub> ; ... ; <i>V</i><sub>n</sub> <CODE>=></CODE> <i>t</i><sub>n</sub> <CODE>}</CODE> +</center> +where <i>V</i><sub>1</sub>,...,<i>V</i><sub>n</sub> is the complete list of the parameter values of +the argument type <I>P</I> (defined <a href="#paramvalues">here</a>), and each <i>t</i><sub>i</sub> is +an expression of the value type <I>T</I>. +</P> +<P> +In addition to explicit enumerations, +tables can be given by <B>pattern matching</B>, +<center> +<CODE>table</CODE> <CODE>{</CODE><i>p</i><sub>1</sub> <CODE>=></CODE> <i>t</i><sub>1</sub> ; ... ; <i>p</i><sub>m</sub> <CODE>=></CODE> <i>t</i><sub>m</sub><CODE>}</CODE> +</center> +where <i>p</i><sub>1</sub>,....,<i>p</i><sub>m</sub> is a list of patterns that covers all values of type <I>P</I>. +Each pattern <i>p</i><sub>i</sub> may bind some variables, on which the expression <i>t</i><sub>i</sub> +may depend. A complete account of patterns and pattern matching is given +<a href="#patternmatching">here</a>. +</P> +<P> +A <B>course-of-values table</B> omits the patterns and just lists all +values. It uses the enumeration of all values of the argument type <I>P</I> +to pair the values with arguments: +<center> +<CODE>table</CODE> <I>P</I> <CODE>[</CODE><i>t</i><sub>1</sub> ; ... ; <i>t</i><sub>n</sub><CODE>]</CODE> +</center> +This format is not recommended for GF source code, since the +ordering of parameter values is not specified and therefore a +compiler-internal decision. +</P> +<P> +The argument type can be indicated in ordinary tables as well, which is +sometimes helpful for type inference: +<center> +<CODE>table</CODE> <I>P</I> <CODE>{</CODE> ... <CODE>}</CODE> +</center> +</P> +<P> +The <B>selection</B> operator <CODE>!</CODE>, applied to a table <I>t</I> and to an expression +<I>v</I> of its argument type +<center> +<I>t</I> <CODE>!</CODE> <I>v</I> +</center> +returns the first pattern matching result from <I>t</I> with <I>v</I>, as defined +<a href="#patternmatching">here</a>. The order of patterns is thus significant as long as the +patterns contain variables or wildcards. When the compiler reorders the +patterns following the enumeration of all values of the argument type, +this order no longer matters, because no overlap remains between patterns. +</P> +<P> +The GF compiler performs <B>table expansion</B>, i.e. an analogue of +eta expansion defined <a href="#conversions">here</a>, where a table is applied to all +values to its argument type: +<center> +<I>t</I> : <I>P</I> <CODE>=></CODE> <I>T</I> ==> +<CODE>table</CODE> <I>P</I> <CODE>[</CODE><I>t</I> <CODE>!</CODE> <i>V</i><sub>1</sub> ; ... ; <I>t</I> <CODE>!</CODE> <i>V</i><sub>n</sub><CODE>]</CODE> +</center> +As syntactic sugar, one-branch tables can be written in a way similar to +lambda abstractions: +<center> +<CODE>\\</CODE><I>p</I> <CODE>=></CODE> <I>t</I> === <CODE>table {</CODE><I>p</I> <CODE>=></CODE> <I>t</I> <CODE>}</CODE> +</center> +where <I>p</I> is either a variable or a wildcard (<CODE>_</CODE>). Multiple bindings +can be abbreviated: +<center> +<CODE>\\</CODE><I>p,q</I> <CODE>=></CODE> <I>t</I> === <CODE>\\</CODE><I>p</I> <CODE>=></CODE> <CODE>\\</CODE><I>q</I> <CODE>=></CODE> <I>t</I> +</center> +<B>Case expressions</B> are syntactic sugar for selections: +<center> +<CODE>case</CODE> <I>e</I> <CODE>of {</CODE>...<CODE>}</CODE> === <CODE>table {</CODE>...<CODE>} !</CODE> <I>e</I> +</center> +</P> +<A NAME="toc43"></A> +<H3>Pattern matching</H3> +<P> +<a name="patternmatching"></a> +</P> +<P> +We will list all forms of patterns that can be used in table branches. +We define their <B>variable bindings</B> and <B>matching substitutions</B>. +</P> +<P> +We start with the patterns available for all parameter types, as well +as for the types <CODE>Integer</CODE> and <CODE>Str</CODE>. +</P> +<UL> +<LI>A constructor pattern <I>C</I> <i>p</i><sub>1</sub>...<i>p</i><sub>n</sub> + binds the union of all variables bound in the subpatterns + <i>p</i><sub>1</sub>,...,<i>p</i><sub>n</sub>. + It matches any value + <I>C</I> <i>V</i><sub>1</sub>...<i>V</i><sub>n</sub> where each <i>p</i><sub>i</sub># matches <i>V</i><sub>i</sub>, + and the matching substitution is the union of these substitutions. +<LI>A record pattern + <CODE>{</CODE> <i>r</i><sub>1</sub> <CODE>=</CODE> <i>p</i><sub>1</sub> <CODE>;</CODE> ... <CODE>;</CODE> <i>r</i><sub>n</sub> <CODE>=</CODE> <i>p</i><sub>n</sub> <CODE>}</CODE> + binds the union of all variables bound in the subpatterns + <i>p</i><sub>1</sub>,...,<i>p</i><sub>n</sub>. + It matches any value + <CODE>{</CODE> <i>r</i><sub>1</sub> <CODE>=</CODE> <i>V</i><sub>1</sub> <CODE>;</CODE> ... <CODE>;</CODE> <i>r</i><sub>n</sub> <CODE>=</CODE> <i>V</i><sub>n</sub> <CODE>;</CODE> ...<CODE>}</CODE> + where each <i>p</i><sub>i</sub># matches <i>V</i><sub>i</sub>, + and the matching substitution is the union of these substitutions. +<LI>A variable pattern <I>x</I> + (identifier other than parameter constructor) + binds the variable <I>x</I>. + It matches any value <I>V</I>, with the substitution {<I>x</I> = <I>V</I>}. +<LI>The wild card <CODE>_</CODE> binds no variables. + It matches any value, with the empty substitution. +<LI>A disjunctive pattern <I>p</I> <CODE>|</CODE> <I>q</I> binds the intersection of + the variables bound by <I>p</I> and <I>q</I>. + It matches anything that + either <I>p</I> or <I>q</I> matches, with the first substitution starting + with <I>p</I> matches, from which those + variables that are not bound by both patterns are removed. +<LI>A negative pattern <CODE>-</CODE> <I>p</I> binds no variables. + It matches anything that <I>p</I> does <I>not</I> match, with the empty + substitution. +<LI>An alias pattern <I>x</I> <CODE>@</CODE> <I>p</I> binds <I>x</I> and all the variables + bound by <I>p</I>. It matches any value <I>V</I> that <I>p</I> matches, with + the same substition extended by {<I>x</I> = <I>V</I>}. +</UL> + +<P> +The following patterns are only available for the type <CODE>Str</CODE>: +</P> +<UL> +<LI>A string literal pattern, e.g. <CODE>"s"</CODE>, binds no variables. + It matches the same string, with the empty substitution. +<LI>A concatenation pattern, <I>p</I> <CODE>+</CODE> <I>q</I>, + binds the union of variables bound by <I>p</I> and <I>q</I>. + It matches any string that consists + of a prefix matching <I>p</I> and a suffix matching <I>q</I>, + with the union of substitutions corresponding to the first match (see below). +<LI>A repetition pattern <I>p</I><CODE>*</CODE> binds no variables. + It matches any string that can be decomposed + into strings that match <I>p</I>, with the empty substitution. +</UL> + +<P> +The following pattern is only available for the types <CODE>Integer</CODE> +and <CODE>Ints</CODE> <I>n</I>: +</P> +<UL> +<LI>An integer literal pattern, e.g. <CODE>214</CODE>, binds no variables. + It matches the same integer, with + the empty substitution. +</UL> + +<P> +All patterns must be <B>linear</B>: the same pattern variable may occur +only once in them. This is what makes it straightforward to speak +about unions of binding sets and substitutions. +</P> +<P> +Pattern matching is performed in the order in which the branches +appear in the source code: the branch of the first matching pattern is followed. +In concrete syntax, the type checker reject sets of patterns that are +not exhaustive, and warns for completely overshadowed patterns. +It also checks the type correctness of patterns with respect to the +argument type. In abstract syntax, only type correctness is checked, +no exhaustiveness or overshadowing. +</P> +<P> +It follows from the definition of record pattern matching +that it can utilize partial records: the branch +</P> +<PRE> + {g = Fem} => t +</PRE> +<P> +in a table of type <CODE>{g : Gender ; n : Number} => T</CODE> means the same as +</P> +<PRE> + {g = Fem ; n = _} => t +</PRE> +<P> +Variables in regular expression patterns +are always bound to the <B>first match</B>, which is the first +in the sequence of binding lists. For example: +</P> +<UL> +<LI><CODE>x + "e" + y</CODE> matches <CODE>"peter"</CODE> with <CODE>x = "p", y = "ter"</CODE> +<LI><CODE>x + "er"*</CODE> matches <CODE>"burgerer"</CODE> with <CODE>x = "burg"</CODE> +</UL> + +<A NAME="toc44"></A> +<H3>Free variation</H3> +<P> +An expressions of the form +<center> +<CODE>variants</CODE> <CODE>{</CODE><i>t</i><sub>1</sub> ; ... ; <i>t</i><sub>n</sub><CODE>}</CODE> +</center> +where all <i>t</i><sub>i</sub> are of the same type <I>T</I>, has itseld type <I>T</I>. +This expression presents <i>t</i><sub>i</sub>,...,<i>t</i><sub>n</sub> as being in <B>free variation</B>: +the choice between them is not determined by semantics or parameters. +A limiting case is +<center> +<CODE>variants {}</CODE> +</center> +which encodes a rule saying that there is no way to express a certain +thing, e.g. that a certain inflectional form does not exist. +</P> +<P> +A common wisdom in linguistics is that "there is no free variation", which +refers to the situation where <I>all</I> aspects are taken into account. For +instance, the English negation contraction could be expressed as free variation, +</P> +<PRE> + variants {"don't" ; "do" ++ "not"} +</PRE> +<P> +if only semantics is taken into account, but if stylistic aspects are included, +then the proper formulation might be with a parameter distinguishing between +informal and formal style: +</P> +<PRE> + case style of {Informal => "don't" ; Formal => "do" ++ "not"} +</PRE> +<P> +Since there is not way to choose a particular element from a ``variants` list, +free variants is normally not adequate in libraries, nor in grammars meant for +natural language generation. In application grammars +meant to parse user input, free variation is a way to avoid cluttering the +abstract syntax with semantically insignificant distinctions and even to +tolerate some grammatical errors. +</P> +<P> +Permitting <CODE>variants</CODE> in all types involves a major modification of the +semantics of GF expressions. All computation rules have to be lifted to +deal with lists of expressions and values. For instance, +<center> +<I>t</I> <CODE>!</CODE> <CODE>variants</CODE> <CODE>{</CODE><i>t</i><sub>1</sub> ; ... ; <i>t</i><sub>n</sub><CODE>}</CODE> ==> +<CODE>variants</CODE> <CODE>{</CODE><I>t</I> <CODE>!</CODE> <i>t</i><sub>1</sub> ; ... ; <I>t</I> <CODE>!</CODE> <i>t</i><sub>n</sub><CODE>}</CODE> +</center> +This is done in such a way that +variation does not distribute to records (or other product-like structures). +For instance, variants of records, +</P> +<PRE> + variants {{s = "Auto" ; g = Neutr} ; {s = "Wagen" ; g = Masc}} +</PRE> +<P> +is <I>not</I> the same as a record of variants, +</P> +<PRE> + {s = variants {"Auto" ; "Wagen"} ; g = variants {Neutr ; Masc}} +</PRE> +<P> +Variants of variants are flattened, +<center> +<CODE>variants</CODE> <CODE>{</CODE>...; <CODE>variants</CODE> <CODE>{</CODE><i>t</i><sub>1</sub> ;...; <i>t</i><sub>n</sub><CODE>}</CODE> ;...<CODE>}</CODE> +==> +<CODE>variants</CODE> <CODE>{</CODE>...; <i>t</i><sub>1</sub> ;...; <i>t</i><sub>n</sub> ;...<CODE>}</CODE> +</center> +and singleton variants are eliminated, +<center> +<CODE>variants</CODE> <CODE>{</CODE><I>t</I><CODE>}</CODE> ==> <I>t</I> +</center> +</P> +<A NAME="toc45"></A> +<H3>Local definitions</H3> +<P> +A <B>local definition</B>, i.e. a <B>let expression</B> has the form +<center> +<CODE>let</CODE> <I>x</I> : <I>T</I> = <I>t</I> <CODE>in</CODE> <I>e</I> +</center> +The type of <I>x</I> must be <I>T</I>, which also has to be the type of <I>t</I>. +Computation is performed by substituting <I>t</I> for <I>x</I> in <I>e</I>: +<center> +<CODE>let</CODE> <I>x</I> : <I>T</I> = <I>t</I> <CODE>in</CODE> <I>e</I> ==> <I>e</I> {<I>x</I> = <I>t</I>} +</center> +As syntactic sugar, the type can be omitted if the type checker is +able to infer it: +<center> +<CODE>let</CODE> <I>x</I> = <I>t</I> <CODE>in</CODE> <I>e</I> +</center> +It is possible to compress several local definitions into one block: +<center> +<CODE>let</CODE> <I>x</I> : <I>T</I> = <I>t</I> <CODE>;</CODE> <I>y</I> : <I>U</I> = <I>u</I> <CODE>in</CODE> <I>e</I> +=== +<CODE>let</CODE> <I>x</I> : <I>T</I> = <I>t</I> <CODE>in</CODE> <CODE>let</CODE> <I>y</I> : <I>U</I> = <I>u</I> <CODE>in</CODE> <I>e</I> +</center> +Another notational variant is a definition block appearing after the main +expression: +<center> +<I>e</I> <CODE>where</CODE> <CODE>{</CODE>...<CODE>}</CODE> === <CODE>let</CODE> <CODE>{</CODE>...<CODE>}</CODE> <CODE>in</CODE> <I>e</I> +</center> +Curly brackets are obligatory in the <CODE>where</CODE> form, and can +also be optionally used in the <CODE>let</CODE> form. +</P> +<P> +Since a block of definitions is treated as syntactic sugar +for a nested <CODE>let</CODE> expression, a constant must be defined before it +is used: the scope is not mutual, as in a module body. +Furthermore, unlike in <CODE>lin</CODE> and <CODE>oper</CODE> definitions, it is <I>not</I> possible +to bind variables on the left of the equality sign. +</P> +<A NAME="toc46"></A> +<H3>Function applications in concrete syntax</H3> +<P> +<a name="functionelimination"></a> +</P> +<P> +Fully compiled concrete syntax may not include expressions of function types +except on the outermost level of <CODE>lin</CODE> rules, as defined <a href="#linexpansion">here</a>. +However, +in the source code, and especially in <CODE>oper</CODE> definitions, functions +are the main vehicle of code reuse and abstraction. Thus function types and +functions follow the same rules as in abstract syntax, as specified +<a href="#functiontype">here</a>. In +particular, the application of a lambda abstract is computed by beta conversion. +</P> +<P> +To ensure the elimination of functions, GF uses a special computation rule +for pushing function applications inside tables, since otherwise run-time +variables could block their applications: +<center> +(<CODE>table</CODE> <CODE>{</CODE><i>p</i><sub>1</sub> <CODE>=></CODE> <i>f</i><sub>1</sub> ; ... ; + <i>p</i><sub>n</sub> <CODE>=></CODE> <i>f</i><sub>n</sub> <CODE>}</CODE> <CODE>!</CODE> <I>e</I>) <I>a</I> + ==> + <CODE>table</CODE> <CODE>{</CODE><i>p</i><sub>1</sub> <CODE>=></CODE> <i>f</i><sub>1</sub> <I>a</I> ; ... ; + <i>p</i><sub>n</sub> <CODE>=></CODE> <i>f</i><sub>n</sub> <I>a</I><CODE>}</CODE> <CODE>!</CODE> <I>e</I> +</center> +Also parameter constructors with non-empty contexts, as defined +<a href="#paramjudgements">here</a>, +result in expressions in application form. These expressions are never +a problem if their arguments are just constructors, because they can then +be translated to integers corresponding to the position of the expression +in the enumaration of the values of its type. +However, a constructor +applied to a run-time variable may need to be converted as follows: +<center> +<I>C</I>...<I>x</I>... ==> <CODE>case</CODE> <I>x</I> of <CODE>{_ =></CODE> <I>C</I>...<I>x</I><CODE>}</CODE> +</center> +The resulting expression, when processed by table expansion as explained +<a href="#tables">here</a>, +results in <I>C</I> being applied to just values of the type of <I>x</I>, and the +application thereby disappears. +</P> +<A NAME="toc47"></A> +<H3>Reusing top-level grammars as resources</H3> +<P> +<a name="reuse"></a> +</P> +<P> +<I>This section is valid for GF 3.0, which abandons the "lock field"</I> +<I>discipline of GF 2.8.</I> +</P> +<P> +As explained <a href="#openabstract">here</a>, +abstract syntax modules can be opened as interfaces +and concrete syntaxes as their instances. This means that judgements are, +as it were, translated in the following way: +</P> +<UL> +<LI><CODE>cat</CODE> <I>C</I> <I>G</I> ===> <CODE>oper</CODE> <I>C</I> : <CODE>Type</CODE> +<LI><CODE>fun</CODE> <I>f</I> : <I>T</I> ===> <CODE>oper</CODE> <I>f</I> : <I>T</I> +<LI><CODE>lincat</CODE> <I>C</I> = <I>T</I> ===> <CODE>oper</CODE> <I>C</I> : <CODE>Type</CODE> = <I>C</I> +<LI><CODE>lin</CODE> <I>f</I> = <I>t</I> ===> <CODE>oper</CODE> <I>f</I> = <I>t</I> +</UL> + +<P> +Notice that the value <I>T</I> of <CODE>lincat</CODE> definitions is not disclosed +in the translation. This means that the type <I>C</I> remains abstract: the +only ways of building an object of type <I>C</I> are the operations <I>f</I> +obtained from <I>fun</I> and <I>lin</I> rules. +</P> +<P> +The purpose of keeping linearization types abstract is to enforce +<B>grammar checking via type checking</B>. This means that any well-typed +operation application is also well-typed in the sense of the original +grammar. If the types were disclosed, then we could for instance easily +confuse all categories that have the linearization +type <CODE>{s : Str}</CODE>. Yet another reason is that revealing the types +makes it impossible for the library programmers to change their type +definitions afterwards. +</P> +<P> +Library writers may occasionally want to have access to the values of +linearization types. The way to make it possible is to add an extra +construction operation to a module in which the linearization type +is available: +</P> +<PRE> + oper MkC : T -> C = \x -> x +</PRE> +<P> +In object-oriented terms, the type <I>C</I> itself is <B>protected</B>, whereas +<I>MkC</I> is a <B>public constructor</B> of <I>C</I>. Of course, it is possible to +make these constructors overloaded (concept explained <a href="#overloading">here</a>), +to enable easy access to special cases. +</P> +<A NAME="toc48"></A> +<H3>Predefined concrete syntax types</H3> +<P> +<a name="predefcnc"></a> +</P> +<P> +The following concrete syntax types are predefined: +</P> +<UL> +<LI><CODE>Str</CODE>, the type of tokens and token lists (defined <a href="#strtype">here</a>) +<LI><CODE>Integer</CODE>, the type of nonnegative integers +<LI><CODE>Ints</CODE> <I>n</I>, the type of integers from <I>0</I> to <I>n</I> +<LI><CODE>Type</CODE>, the type of (concrete syntax) types +<LI><CODE>PType</CODE>, the type of parameter types +</UL> + +<P> +The last two types are, in a way, extended by user-written grammars, +since new parameter types can be defined in the way shown <a href="#paramjudgements">here</a>, +and every paramater type is also a type. From the point of view of the values +of expressions, however, a <CODE>param</CODE> declaration does not extend +<CODE>PType</CODE>, since all parameter types get compiled to initial +segments of integers. +</P> +<P> +Notice the difference between the concrete syntax types +<CODE>Str</CODE> and <CODE>Integer</CODE> on the one hand, and the abstract +syntax categories <CODE>String</CODE> and <CODE>Int</CODE>, on the other. +As <I>concrete syntax</I> types, the latter are treated in +the same way as any reused categories: their objects +can be formed by using syntax trees (string and integer +literals). +</P> +<P> +<I>The type name</I> <CODE>Integer</CODE> <I>replaces in GF 3.0 the name</I> <CODE>Int</CODE>, +<I>to avoid confusion with the abstract syntax type and to be analogous</I> +<I>with the</I> <CODE>Str</CODE> <I>vs.</I> <CODE>String</CODE> <I>distinction.</I> +</P> +<A NAME="toc49"></A> +<H3>Predefined concrete syntax operations</H3> +<P> +The following predefined operations are defined in the resource module +<CODE>prelude/Predefined.gf</CODE>. Their implementations are defined as +a part of the GF grammar compiler. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>operation</TH> +<TH>type</TH> +<TH COLSPAN="2">explanation</TH> +</TR> +<TR> +<TD><CODE>PBool</CODE></TD> +<TD><CODE>PType</CODE></TD> +<TD><CODE>PTrue | PFalse</CODE></TD> +</TR> +<TR> +<TD><CODE>Error</CODE></TD> +<TD><CODE>Type</CODE></TD> +<TD>the empty type</TD> +</TR> +<TR> +<TD><CODE>Int</CODE></TD> +<TD><CODE>Type</CODE></TD> +<TD>the type of integers</TD> +</TR> +<TR> +<TD><CODE>Ints</CODE></TD> +<TD><CODE>Integer -> Type</CODE></TD> +<TD>the type of integers from 0 to n</TD> +</TR> +<TR> +<TD><CODE>error</CODE></TD> +<TD><CODE>Str -> Error</CODE></TD> +<TD>forms error message</TD> +</TR> +<TR> +<TD><CODE>length</CODE></TD> +<TD><CODE>Str -> Int</CODE></TD> +<TD>length of string</TD> +</TR> +<TR> +<TD><CODE>drop</CODE></TD> +<TD><CODE>Integer -> Str -> Str</CODE></TD> +<TD>drop prefix of length</TD> +</TR> +<TR> +<TD><CODE>take</CODE></TD> +<TD><CODE>Integer -> Str -> Str</CODE></TD> +<TD>take prefix of length</TD> +</TR> +<TR> +<TD><CODE>tk</CODE></TD> +<TD><CODE>Integer -> Str -> Str</CODE></TD> +<TD>drop suffix of length</TD> +</TR> +<TR> +<TD><CODE>dp</CODE></TD> +<TD><CODE>Integer -> Str -> Str</CODE></TD> +<TD>take suffix of length</TD> +</TR> +<TR> +<TD><CODE>eqInt</CODE></TD> +<TD><CODE>Integer -> Integer -> PBool</CODE></TD> +<TD>test if equal integers</TD> +</TR> +<TR> +<TD><CODE>lessInt</CODE></TD> +<TD><CODE>Integer -> Integer -> PBool</CODE></TD> +<TD>test order of integers</TD> +</TR> +<TR> +<TD><CODE>plus</CODE></TD> +<TD><CODE>Integer -> Integer -> Integer</CODE></TD> +<TD>add integers</TD> +</TR> +<TR> +<TD><CODE>eqStr</CODE></TD> +<TD><CODE>Str -> Str -> PBool</CODE></TD> +<TD>test if equal strings</TD> +</TR> +<TR> +<TD><CODE>occur</CODE></TD> +<TD><CODE>Str -> Str -> PBool</CODE></TD> +<TD>test if occurs as substring</TD> +</TR> +<TR> +<TD><CODE>occurs</CODE></TD> +<TD><CODE>Str -> Str -> PBool</CODE></TD> +<TD>test if any char occurs</TD> +</TR> +<TR> +<TD><CODE>show</CODE></TD> +<TD><CODE>(P : Type) -> P -> Str</CODE></TD> +<TD>convert param to string</TD> +</TR> +<TR> +<TD><CODE>read</CODE></TD> +<TD><CODE>(P : Type) -> Str -> P</CODE></TD> +<TD>convert string to param</TD> +</TR> +<TR> +<TD><CODE>toStr</CODE></TD> +<TD><CODE>(L : Type) -> L -> Str</CODE></TD> +<TD>find the "first" string</TD> +</TR> +</TABLE> + +<P></P> +<P> +Compilation eliminates these operations, and they may therefore not +take arguments that depend on run-time variables. +</P> +<P> +The module <CODE>Predef</CODE> is included in the <I>opens</I> list of all +modules, and therefore does not need to be opened explicitly. +</P> +<A NAME="toc50"></A> +<H2>Flags and pragmas</H2> +<A NAME="toc51"></A> +<H3>Some flags and their values</H3> +<P> +<a name="flagvalues"></a> +</P> +<P> +The flag <CODE>coding</CODE> in concrete syntax sets the <B>character encoding</B> +used in the grammar. Internally, GF uses unicode, and <CODE>.gfcc</CODE> files +are always written in UTF8 encoding. The presence of the flag +<CODE>coding=utf8</CODE> prevents GF from encoding an already encoded +file. +</P> +<P> +The flag <CODE>lexer</CODE> in concrete syntax sets the lexer, +i.e. the processor that turns +strings into token lists sent to the parser. Some GF implementations +support the following lexers. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>lexer</TH> +<TH COLSPAN="2">description</TH> +</TR> +<TR> +<TD><CODE>words</CODE></TD> +<TD>(default) tokens are separated by spaces or newlines</TD> +</TR> +<TR> +<TD><CODE>literals</CODE></TD> +<TD>like words, but integer and string literals recognized</TD> +</TR> +<TR> +<TD><CODE>chars</CODE></TD> +<TD>each character is a token</TD> +</TR> +<TR> +<TD><CODE>code</CODE></TD> +<TD>program code conventions (uses Haskell's lex)</TD> +</TR> +<TR> +<TD><CODE>text</CODE></TD> +<TD>with conventions on punctuation and capital letters</TD> +</TR> +<TR> +<TD><CODE>codelit</CODE></TD> +<TD>like code, but recognize literals (unknown words as strings)</TD> +</TR> +<TR> +<TD><CODE>textlit</CODE></TD> +<TD>like text, but recognize literals (unknown words as strings)</TD> +</TR> +</TABLE> + +<P></P> +<P> +The flag <CODE>startcat</CODE> in abstract syntax sets the default start category for +parsing, random generation, and any other grammar operation that depends +on category. Its legal values are the categories defined or inherited in +the abstract syntax. +</P> +<P> +The flag <CODE>unlexer</CODE> in concrete syntax sets the lexer, +i.e. the processor that turns +token lists obrained from the linearizer to strings. Some GF implementations +support the following unlexers. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>unlexer</TH> +<TH COLSPAN="2">description</TH> +</TR> +<TR> +<TD><CODE>unwords</CODE></TD> +<TD>(default) space-separated token list</TD> +</TR> +<TR> +<TD><CODE>text</CODE></TD> +<TD>format as text: punctuation, capitals, paragraph <p></TD> +</TR> +<TR> +<TD><CODE>code</CODE></TD> +<TD>format as code (spacing, indentation)</TD> +</TR> +<TR> +<TD><CODE>textlit</CODE></TD> +<TD>like text, but remove string literal quotes</TD> +</TR> +<TR> +<TD><CODE>codelit</CODE></TD> +<TD>like code, but remove string literal quotes</TD> +</TR> +<TR> +<TD><CODE>concat</CODE></TD> +<TD>remove all spaces</TD> +</TR> +</TABLE> + +<P></P> +<A NAME="toc52"></A> +<H3>Compiler pragmas</H3> +<P> +<B>Compiler pragmas</B> are a special form of comments prefixed with <CODE>--#</CODE>. +Currently GF interprets the following pragmas. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>pragma</TH> +<TH COLSPAN="2">explanation</TH> +</TR> +<TR> +<TD><CODE>-path=</CODE>PATH</TD> +<TD>path list for searching modules</TD> +</TR> +</TABLE> + +<P></P> +<P> +For instance, the line +</P> +<PRE> + --# -path=.:present:prelude:/home/aarne/GF/tmp +</PRE> +<P> +in the top of <CODE>FILE.gf</CODE> causes the GF compiler, when invoked on <CODE>FILE.gf</CODE>, +to search through the current directory (<CODE>.</CODE>) and the directories +<CODE>present</CODE>, <CODE>prelude</CODE>, and <CODE>/home/aarne/GF/tmp</CODE>, in this order. +If a directory <CODE>DIR</CODE> is not found relative to the working directory, +also <CODE>$(GF_LIB_PATH)/DIR</CODE> is searched. +</P> +<A NAME="toc53"></A> +<H2>Alternative grammar input formats</H2> +<P> +While the GF language as specified in this document is the most versatile +and powerful way of writing GF grammars, there are several other formats +that a GF compiler may make available for users, either to get started +with small grammars or to semiautomatically convert grammars from other +formats to GF. Here are the ones supported by GF 2.8 and 3.0. +</P> +<A NAME="toc54"></A> +<H3>Old GF without modules</H3> +<P> +<a name="oldgf"></a> +</P> +<P> +Before GF compiler version 2.0, there was no module system, and +all kinds of judgement could be written in all files, without +any headers. This format is still available, and the compiler +(version 2.8) detects automatically if a file is in the current +or the old format. However, the old format is not recommended +because of pure modularity and missing separate compilation, +and also because libraries are not available, since the old +and the new format cannot be mixed. With version 2.8, grammars +in the old format can be converted to modular grammar with the +command +</P> +<PRE> + > import -o FILE.gf +</PRE> +<P> +which rewrites the grammar divided into three files: +an abstract, a concrete, and a resource module. +</P> +<A NAME="toc55"></A> +<H3>Context-free grammars</H3> +<P> +A quick way to write a GF grammar is to use the context-free format, +also known as BNF. Files of this form are recognized by the suffix +<CODE>.cf</CODE>. Rules in these files have the form +<center> +<I>Label</I> <CODE>.</CODE> <I>Cat</I> <CODE>::=</CODE> (<I>String</I> | <I>Cat</I>)* <CODE>;</CODE> +</center> +where <I>Label</I> and <I>Cat</I> are identifiers and <I>String</I> quoted strings. +</P> +<P> +There is a shortcut form generating labels automatically, +<center> +<I>Cat</I> <CODE>::=</CODE> (<I>String</I> | <I>Cat</I>)* <CODE>;</CODE> +</center> +In the shortcut form, vertical bars (<CODE>|</CODE>) can be used to give +several right-hand-sides at a time. An empty right-hand side +means the singleton of an empty sequence, and not an empty union. +</P> +<P> +Just like old-style GF files (previous section), contex-free grammar +files can be converted to modular GF by using the <CODE>-o</CODE> option to +the compiler in GF 2.8. +</P> +<A NAME="toc56"></A> +<H3>Extended BNF grammars</H3> +<P> +Extended BNF (<CODE>FILE.ebnf</CODE>) +goes one step further from the shortcut notation of previous section. +The rules have the form +<center> +<I>Cat</I> <CODE>::=</CODE> <I>RHS</I> <CODE>;</CODE> +</center> +where an <I>RHS</I> can be any regular expression +built from quoted strings and category symbols, in the following ways: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>RHS item</TH> +<TH COLSPAN="2">explanation</TH> +</TR> +<TR> +<TD><I>Cat</I></TD> +<TD>nonterminal</TD> +</TR> +<TR> +<TD><I>String</I></TD> +<TD>terminal</TD> +</TR> +<TR> +<TD><I>RHS</I> <I>RHS</I></TD> +<TD>sequence</TD> +</TR> +<TR> +<TD><I>RHS</I> <CODE>|</CODE> <I>RHS</I></TD> +<TD>alternatives</TD> +</TR> +<TR> +<TD><I>RHS</I> <CODE>?</CODE></TD> +<TD>optional</TD> +</TR> +<TR> +<TD><I>RHS</I> <CODE>*</CODE></TD> +<TD>repetition</TD> +</TR> +<TR> +<TD><I>RHS</I> <CODE>+</CODE></TD> +<TD>non-empty repetition|</TD> +</TR> +</TABLE> + +<P></P> +<P> +Parentheses are used to override standard precedences, where +<CODE>|</CODE> binds weaker than sequencing, which binds weaker than the unary operations. +</P> +<P> +The compiler generates not only labels, but also new categories corresponding +to the regular expression combinations actually in use. +</P> +<P> +Just like <CODE>.cf</CODE> files (previous section), <CODE>.ebnf</CODE> +files can be converted to modular GF by using the <CODE>-o</CODE> option to +the compiler in GF 2.8. +</P> +<A NAME="toc57"></A> +<H3>Example-based grammars</H3> +<P> +<B>Example-based grammars</B> (<CODE>.gfe</CODE>) provide a way to use +resource grammar libraries without having to know the names +of functions in them. The compiler works as a preprocessor, +saving the result in a (<CODE>.gf</CODE>) file, which can be compiled +as usual. +</P> +<P> +If a library is implemented as an abstract and concrete syntax, +it can be used for parsing. Calls of library functions can therefore +be formed by parsing strings in the library. GF has an expression +format for this, +<center> +<CODE>in</CODE> <I>C</I> <I>String</I> +</center> +where <I>C</I> is the category in which to parse (it can be qualified by +the module name) and the string is the input to parser. Expressions +of this form are replaced by the syntax trees that result. These +trees are always type-correct. If several parses are found, all but +the first one are given in comments. +</P> +<P> +Here is an example, from <CODE>GF/examples/animal/</CODE>: +</P> +<PRE> + --# -resource=../../lib/present/LangEng.gfc + --# -path=.:present:prelude + + incomplete concrete QuestionsI of Questions = open Lang in { + lincat + Phrase = Phr ; + Entity = N ; + Action = V2 ; + lin + Who love_V2 man_N = in Phr "who loves men" ; + Whom man_N love_V2 = in Phr "whom does the man love" ; + Answer woman_N love_V2 man_N = in Phr "the woman loves men" ; + } +</PRE> +<P> +The <CODE>resource</CODE> pragma shows the grammar that is used for parsing +the examples. +</P> +<P> +Notice that the variables <CODE>love_V2</CODE>, <CODE>man_N</CODE>, etc, are +actually constants in the library. In the resulting rules, such as +</P> +<PRE> + lin Whom = \man_N -> \love_V2 -> + PhrUtt NoPConj (UttQS (UseQCl TPres ASimul PPos + (QuestSlash whoPl_IP (SlashV2 (DetCN (DetSg (SgQuant + DefArt)NoOrd)(UseN man_N)) love_V2)))) NoVoc ; +</PRE> +<P> +those constants are nonetheless treated as variables, following +the normal binding conventions, as stated <a href="#renaming">here</a>. +</P> +<A NAME="toc58"></A> +<H2>The grammar of GF</H2> +<P> +The following grammar is actually used in the parser of GF, although we have +omitted +some obsolete rules still included in the parser for backward compatibility +reasons. +</P> +<P> +This document was automatically generated by the <I>BNF-Converter</I>. It was generated together with the lexer, the parser, and the abstract syntax module, which guarantees that the document matches with the implementation of the language (provided no hand-hacking has taken place). +</P> +<A NAME="toc59"></A> +<H2>The lexical structure of GF</H2> +<A NAME="toc60"></A> +<H3>Identifiers</H3> +<P> +Identifiers <I>Ident</I> are unquoted strings beginning with a letter, +followed by any combination of letters, digits, and the characters <CODE>_ '</CODE> +reserved words excluded. +</P> +<A NAME="toc61"></A> +<H3>Literals</H3> +<P> +Integer literals <I>Integer</I> are nonempty sequences of digits. +</P> +<P> +String literals <I>String</I> have the form +<CODE>"</CODE><I>x</I><CODE>"</CODE>}, where <I>x</I> is any sequence of any characters +except <CODE>"</CODE> unless preceded by <CODE>\</CODE>. +</P> +<P> +Double-precision float literals <I>Double</I> have the structure +indicated by the regular expression <CODE>digit+ '.' digit+ ('e' ('-')? digit+)?</CODE> i.e.\ +two sequences of digits separated by a decimal point, optionally +followed by an unsigned or negative exponent. +</P> +<A NAME="toc62"></A> +<H3>Reserved words and symbols</H3> +<P> +The set of reserved words is the set of terminals appearing in the grammar. Those reserved words that consist of non-letter characters are called symbols, and they are treated in a different way from those that are similar to identifiers. The lexer follows rules familiar from languages like Haskell, C, and Java, including longest match and spacing conventions. +</P> +<P> +The reserved words used in GF are the following: +</P> +<TABLE ALIGN="center" CELLPADDING="4"> +<TR> +<TD><CODE>PType</CODE></TD> +<TD><CODE>Str</CODE></TD> +<TD><CODE>Strs</CODE></TD> +<TD><CODE>Type</CODE></TD> +</TR> +<TR> +<TD><CODE>abstract</CODE></TD> +<TD><CODE>case</CODE></TD> +<TD><CODE>cat</CODE></TD> +<TD><CODE>concrete</CODE></TD> +</TR> +<TR> +<TD><CODE>data</CODE></TD> +<TD><CODE>def</CODE></TD> +<TD><CODE>flags</CODE></TD> +<TD><CODE>fun</CODE></TD> +</TR> +<TR> +<TD><CODE>in</CODE></TD> +<TD><CODE>incomplete</CODE></TD> +<TD><CODE>instance</CODE></TD> +<TD><CODE>interface</CODE></TD> +</TR> +<TR> +<TD><CODE>let</CODE></TD> +<TD><CODE>lin</CODE></TD> +<TD><CODE>lincat</CODE></TD> +<TD><CODE>lindef</CODE></TD> +</TR> +<TR> +<TD><CODE>of</CODE></TD> +<TD><CODE>open</CODE></TD> +<TD><CODE>oper</CODE></TD> +<TD><CODE>param</CODE></TD> +</TR> +<TR> +<TD><CODE>pre</CODE></TD> +<TD><CODE>printname</CODE></TD> +<TD><CODE>resource</CODE></TD> +<TD><CODE>strs</CODE></TD> +</TR> +<TR> +<TD><CODE>table</CODE></TD> +<TD><CODE>transfer</CODE></TD> +<TD><CODE>variants</CODE></TD> +<TD><CODE>where</CODE></TD> +</TR> +<TR> +<TD><CODE>with</CODE></TD> +<TD></TD> +<TD></TD> +</TR> +</TABLE> + +<P></P> +<P> +The symbols used in GF are the following: +</P> +<TABLE ALIGN="center" CELLPADDING="4"> +<TR> +<TD>;</TD> +<TD>=</TD> +<TD>:</TD> +<TD>-></TD> +</TR> +<TR> +<TD>{</TD> +<TD>}</TD> +<TD>**</TD> +<TD>,</TD> +</TR> +<TR> +<TD>(</TD> +<TD>)</TD> +<TD>[</TD> +<TD>]</TD> +</TR> +<TR> +<TD>-</TD> +<TD>.</TD> +<TD>|</TD> +<TD>?</TD> +</TR> +<TR> +<TD><</TD> +<TD>></TD> +<TD>@</TD> +<TD>!</TD> +</TR> +<TR> +<TD>*</TD> +<TD>+</TD> +<TD>++</TD> +<TD>\</TD> +</TR> +<TR> +<TD>=></TD> +<TD>_</TD> +<TD>$</TD> +<TD>/</TD> +</TR> +</TABLE> + +<P></P> +<A NAME="toc63"></A> +<H3>Comments</H3> +<P> +Single-line comments begin with --.Multiple-line comments are enclosed with {- and -}. +</P> +<A NAME="toc64"></A> +<H2>The syntactic structure of GF</H2> +<P> +Non-terminals are enclosed between < and >. +The symbols -> (production), <B>|</B> (union) +and <B>eps</B> (empty rule) belong to the BNF notation. +All other symbols are terminals. +</P> +<TABLE ALIGN="center" CELLPADDING="4"> +<TR> +<TD><I>Grammar</I></TD> +<TD>-></TD> +<TD><I>[ModDef]</I></TD> +</TR> +<TR> +<TD><I>[ModDef]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>ModDef</I> <I>[ModDef]</I></TD> +</TR> +<TR> +<TD><I>ModDef</I></TD> +<TD>-></TD> +<TD><I>ModDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>ComplMod</I> <I>ModType</I> <CODE>=</CODE> <I>ModBody</I></TD> +</TR> +<TR> +<TD><I>ModType</I></TD> +<TD>-></TD> +<TD><CODE>abstract</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>resource</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>interface</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>concrete</CODE> <I>Ident</I> <CODE>of</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>instance</CODE> <I>Ident</I> <CODE>of</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>transfer</CODE> <I>Ident</I> <CODE>:</CODE> <I>Open</I> <CODE>-></CODE> <I>Open</I></TD> +</TR> +<TR> +<TD><I>ModBody</I></TD> +<TD>-></TD> +<TD><I>Extend</I> <I>Opens</I> <CODE>{</CODE> <I>[TopDef]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Included]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Included</I> <CODE>with</CODE> <I>[Open]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Included</I> <CODE>with</CODE> <I>[Open]</I> <CODE>**</CODE> <I>Opens</I> <CODE>{</CODE> <I>[TopDef]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Included]</I> <CODE>**</CODE> <I>Included</I> <CODE>with</CODE> <I>[Open]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Included]</I> <CODE>**</CODE> <I>Included</I> <CODE>with</CODE> <I>[Open]</I> <CODE>**</CODE> <I>Opens</I> <CODE>{</CODE> <I>[TopDef]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD><I>[TopDef]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>TopDef</I> <I>[TopDef]</I></TD> +</TR> +<TR> +<TD><I>Extend</I></TD> +<TD>-></TD> +<TD><I>[Included]</I> <CODE>**</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD><I>[Open]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Open</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Open</I> <CODE>,</CODE> <I>[Open]</I></TD> +</TR> +<TR> +<TD><I>Opens</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>open</CODE> <I>[Open]</I> <CODE>in</CODE></TD> +</TR> +<TR> +<TD><I>Open</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>(</CODE> <I>QualOpen</I> <I>Ident</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>(</CODE> <I>QualOpen</I> <I>Ident</I> <CODE>=</CODE> <I>Ident</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD><I>ComplMod</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>incomplete</CODE></TD> +</TR> +<TR> +<TD><I>QualOpen</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD><I>[Included]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Included</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Included</I> <CODE>,</CODE> <I>[Included]</I></TD> +</TR> +<TR> +<TD><I>Included</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>[</CODE> <I>[Ident]</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>-</CODE> <CODE>[</CODE> <I>[Ident]</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD><I>Def</I></TD> +<TD>-></TD> +<TD><I>[Name]</I> <CODE>:</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Name]</I> <CODE>=</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Name</I> <I>[Patt]</I> <CODE>=</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Name]</I> <CODE>:</CODE> <I>Exp</I> <CODE>=</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD><I>TopDef</I></TD> +<TD>-></TD> +<TD><CODE>cat</CODE> <I>[CatDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>fun</CODE> <I>[FunDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>data</CODE> <I>[FunDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>def</CODE> <I>[Def]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>data</CODE> <I>[DataDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>param</CODE> <I>[ParDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>oper</CODE> <I>[Def]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>lincat</CODE> <I>[PrintDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>lindef</CODE> <I>[Def]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>lin</CODE> <I>[Def]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>printname</CODE> <CODE>cat</CODE> <I>[PrintDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>printname</CODE> <CODE>fun</CODE> <I>[PrintDef]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>flags</CODE> <I>[FlagDef]</I></TD> +</TR> +<TR> +<TD><I>CatDef</I></TD> +<TD>-></TD> +<TD><I>Ident</I> <I>[DDecl]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>[</CODE> <I>Ident</I> <I>[DDecl]</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>[</CODE> <I>Ident</I> <I>[DDecl]</I> <CODE>]</CODE> <CODE>{</CODE> <I>Integer</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD><I>FunDef</I></TD> +<TD>-></TD> +<TD><I>[Ident]</I> <CODE>:</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD><I>DataDef</I></TD> +<TD>-></TD> +<TD><I>Ident</I> <CODE>=</CODE> <I>[DataConstr]</I></TD> +</TR> +<TR> +<TD><I>DataConstr</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>.</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD><I>[DataConstr]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>DataConstr</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>DataConstr</I> <CODE>|</CODE> <I>[DataConstr]</I></TD> +</TR> +<TR> +<TD><I>ParDef</I></TD> +<TD>-></TD> +<TD><I>Ident</I> <CODE>=</CODE> <I>[ParConstr]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>=</CODE> <CODE>(</CODE> <CODE>in</CODE> <I>Ident</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD><I>ParConstr</I></TD> +<TD>-></TD> +<TD><I>Ident</I> <I>[DDecl]</I></TD> +</TR> +<TR> +<TD><I>PrintDef</I></TD> +<TD>-></TD> +<TD><I>[Name]</I> <CODE>=</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD><I>FlagDef</I></TD> +<TD>-></TD> +<TD><I>Ident</I> <CODE>=</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD><I>[Def]</I></TD> +<TD>-></TD> +<TD><I>Def</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Def</I> <CODE>;</CODE> <I>[Def]</I></TD> +</TR> +<TR> +<TD><I>[CatDef]</I></TD> +<TD>-></TD> +<TD><I>CatDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>CatDef</I> <CODE>;</CODE> <I>[CatDef]</I></TD> +</TR> +<TR> +<TD><I>[FunDef]</I></TD> +<TD>-></TD> +<TD><I>FunDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>FunDef</I> <CODE>;</CODE> <I>[FunDef]</I></TD> +</TR> +<TR> +<TD><I>[DataDef]</I></TD> +<TD>-></TD> +<TD><I>DataDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>DataDef</I> <CODE>;</CODE> <I>[DataDef]</I></TD> +</TR> +<TR> +<TD><I>[ParDef]</I></TD> +<TD>-></TD> +<TD><I>ParDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>ParDef</I> <CODE>;</CODE> <I>[ParDef]</I></TD> +</TR> +<TR> +<TD><I>[PrintDef]</I></TD> +<TD>-></TD> +<TD><I>PrintDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>PrintDef</I> <CODE>;</CODE> <I>[PrintDef]</I></TD> +</TR> +<TR> +<TD><I>[FlagDef]</I></TD> +<TD>-></TD> +<TD><I>FlagDef</I> <CODE>;</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>FlagDef</I> <CODE>;</CODE> <I>[FlagDef]</I></TD> +</TR> +<TR> +<TD><I>[ParConstr]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>ParConstr</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>ParConstr</I> <CODE>|</CODE> <I>[ParConstr]</I></TD> +</TR> +<TR> +<TD><I>[Ident]</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>,</CODE> <I>[Ident]</I></TD> +</TR> +<TR> +<TD><I>Name</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>[</CODE> <I>Ident</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD><I>[Name]</I></TD> +<TD>-></TD> +<TD><I>Name</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Name</I> <CODE>,</CODE> <I>[Name]</I></TD> +</TR> +<TR> +<TD><I>LocDef</I></TD> +<TD>-></TD> +<TD><I>[Ident]</I> <CODE>:</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Ident]</I> <CODE>=</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>[Ident]</I> <CODE>:</CODE> <I>Exp</I> <CODE>=</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD><I>[LocDef]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>LocDef</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>LocDef</I> <CODE>;</CODE> <I>[LocDef]</I></TD> +</TR> +<TR> +<TD><I>Exp6</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Sort</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>String</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Integer</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Double</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>?</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>[</CODE> <CODE>]</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>data</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>[</CODE> <I>Ident</I> <I>Exps</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>[</CODE> <I>String</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>{</CODE> <I>[LocDef]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE><</CODE> <I>[TupleComp]</I> <CODE>></CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE><</CODE> <I>Exp</I> <CODE>:</CODE> <I>Exp</I> <CODE>></CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>(</CODE> <I>Exp</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD><I>Exp5</I></TD> +<TD>-></TD> +<TD><I>Exp5</I> <CODE>.</CODE> <I>Label</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp6</I></TD> +</TR> +<TR> +<TD><I>Exp4</I></TD> +<TD>-></TD> +<TD><I>Exp4</I> <I>Exp5</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>table</CODE> <CODE>{</CODE> <I>[Case]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>table</CODE> <I>Exp6</I> <CODE>{</CODE> <I>[Case]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>table</CODE> <I>Exp6</I> <CODE>[</CODE> <I>[Exp]</I> <CODE>]</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>case</CODE> <I>Exp</I> <CODE>of</CODE> <CODE>{</CODE> <I>[Case]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>variants</CODE> <CODE>{</CODE> <I>[Exp]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>pre</CODE> <CODE>{</CODE> <I>Exp</I> <CODE>;</CODE> <I>[Altern]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>strs</CODE> <CODE>{</CODE> <I>[Exp]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>@</CODE> <I>Exp6</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp5</I></TD> +</TR> +<TR> +<TD><I>Exp3</I></TD> +<TD>-></TD> +<TD><I>Exp3</I> <CODE>!</CODE> <I>Exp4</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp3</I> <CODE>*</CODE> <I>Exp4</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp3</I> <CODE>**</CODE> <I>Exp4</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp4</I></TD> +</TR> +<TR> +<TD><I>Exp1</I></TD> +<TD>-></TD> +<TD><I>Exp2</I> <CODE>+</CODE> <I>Exp1</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp2</I></TD> +</TR> +<TR> +<TD><I>Exp</I></TD> +<TD>-></TD> +<TD><I>Exp1</I> <CODE>++</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>\</CODE> <I>[Bind]</I> <CODE>-></CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>\</CODE> <CODE>\</CODE> <I>[Bind]</I> <CODE>=></CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Decl</I> <CODE>-></CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp3</I> <CODE>=></CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>let</CODE> <CODE>{</CODE> <I>[LocDef]</I> <CODE>}</CODE> <CODE>in</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>let</CODE> <I>[LocDef]</I> <CODE>in</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp3</I> <CODE>where</CODE> <CODE>{</CODE> <I>[LocDef]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>in</CODE> <I>Exp5</I> <I>String</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp1</I></TD> +</TR> +<TR> +<TD><I>Exp2</I></TD> +<TD>-></TD> +<TD><I>Exp3</I></TD> +</TR> +<TR> +<TD><I>[Exp]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp</I> <CODE>;</CODE> <I>[Exp]</I></TD> +</TR> +<TR> +<TD><I>Exps</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp6</I> <I>Exps</I></TD> +</TR> +<TR> +<TD><I>Patt2</I></TD> +<TD>-></TD> +<TD><CODE>_</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>.</CODE> <I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Integer</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Double</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>String</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>{</CODE> <I>[PattAss]</I> <CODE>}</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE><</CODE> <I>[PattTupleComp]</I> <CODE>></CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>(</CODE> <I>Patt</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD><I>Patt1</I></TD> +<TD>-></TD> +<TD><I>Ident</I> <I>[Patt]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>.</CODE> <I>Ident</I> <I>[Patt]</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Patt2</I> <CODE>*</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Ident</I> <CODE>@</CODE> <I>Patt2</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>-</CODE> <I>Patt2</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Patt2</I></TD> +</TR> +<TR> +<TD><I>Patt</I></TD> +<TD>-></TD> +<TD><I>Patt</I> <CODE>|</CODE> <I>Patt1</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Patt</I> <CODE>+</CODE> <I>Patt1</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Patt1</I></TD> +</TR> +<TR> +<TD><I>PattAss</I></TD> +<TD>-></TD> +<TD><I>[Ident]</I> <CODE>=</CODE> <I>Patt</I></TD> +</TR> +<TR> +<TD><I>Label</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>$</CODE> <I>Integer</I></TD> +</TR> +<TR> +<TD><I>Sort</I></TD> +<TD>-></TD> +<TD><CODE>Type</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>PType</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>Str</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>Strs</CODE></TD> +</TR> +<TR> +<TD><I>[PattAss]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>PattAss</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>PattAss</I> <CODE>;</CODE> <I>[PattAss]</I></TD> +</TR> +<TR> +<TD><I>[Patt]</I></TD> +<TD>-></TD> +<TD><I>Patt2</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Patt2</I> <I>[Patt]</I></TD> +</TR> +<TR> +<TD><I>Bind</I></TD> +<TD>-></TD> +<TD><I>Ident</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><CODE>_</CODE></TD> +</TR> +<TR> +<TD><I>[Bind]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Bind</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Bind</I> <CODE>,</CODE> <I>[Bind]</I></TD> +</TR> +<TR> +<TD><I>Decl</I></TD> +<TD>-></TD> +<TD><CODE>(</CODE> <I>[Bind]</I> <CODE>:</CODE> <I>Exp</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp4</I></TD> +</TR> +<TR> +<TD><I>TupleComp</I></TD> +<TD>-></TD> +<TD><I>Exp</I></TD> +</TR> +<TR> +<TD><I>PattTupleComp</I></TD> +<TD>-></TD> +<TD><I>Patt</I></TD> +</TR> +<TR> +<TD><I>[TupleComp]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>TupleComp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>TupleComp</I> <CODE>,</CODE> <I>[TupleComp]</I></TD> +</TR> +<TR> +<TD><I>[PattTupleComp]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>PattTupleComp</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>PattTupleComp</I> <CODE>,</CODE> <I>[PattTupleComp]</I></TD> +</TR> +<TR> +<TD><I>Case</I></TD> +<TD>-></TD> +<TD><I>Patt</I> <CODE>=></CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD><I>[Case]</I></TD> +<TD>-></TD> +<TD><I>Case</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Case</I> <CODE>;</CODE> <I>[Case]</I></TD> +</TR> +<TR> +<TD><I>Altern</I></TD> +<TD>-></TD> +<TD><I>Exp</I> <CODE>/</CODE> <I>Exp</I></TD> +</TR> +<TR> +<TD><I>[Altern]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Altern</I></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Altern</I> <CODE>;</CODE> <I>[Altern]</I></TD> +</TR> +<TR> +<TD><I>DDecl</I></TD> +<TD>-></TD> +<TD><CODE>(</CODE> <I>[Bind]</I> <CODE>:</CODE> <I>Exp</I> <CODE>)</CODE></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>Exp6</I></TD> +</TR> +<TR> +<TD><I>[DDecl]</I></TD> +<TD>-></TD> +<TD><B>eps</B></TD> +</TR> +<TR> +<TD></TD> +<TD ALIGN="center"><B>|</B></TD> +<TD><I>DDecl</I> <I>[DDecl]</I></TD> +</TR> +</TABLE> + +<P></P> + +<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) --> +<!-- cmdline: txt2tags -thtml -\-toc gf-refman.txt --> +</BODY></HTML> diff --git a/doc/gf-tutorial.html b/doc/gf-tutorial.html new file mode 100644 index 000000000..1e6d961b8 --- /dev/null +++ b/doc/gf-tutorial.html @@ -0,0 +1,7952 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> +<HTML> +<HEAD> +<META NAME="generator" CONTENT="http://txt2tags.sf.net"> +<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> +<TITLE>Grammatical Framework Tutorial</TITLE> +</HEAD><BODY BGCOLOR="white" TEXT="black"> +<P ALIGN="center"><CENTER><H1>Grammatical Framework Tutorial</H1> +<FONT SIZE="4"> +<I>Aarne Ranta</I><BR> +Draft, November 2007 +</FONT></CENTER> + +<P></P> +<HR NOSHADE SIZE=1> +<P></P> + <UL> + <LI><A HREF="#toc1">Getting started with GF</A> + <UL> + <LI><A HREF="#toc2">What GF is</A> + <LI><A HREF="#toc3">Getting the GF system</A> + <LI><A HREF="#toc4">Running the GF system</A> + <LI><A HREF="#toc5">A "Hello World" grammar</A> + <UL> + <LI><A HREF="#toc6">The program: abstract syntax and concrete syntaxes</A> + <LI><A HREF="#toc7">Using the grammar in the GF system</A> + </UL> + <LI><A HREF="#toc8">Using grammars from outside GF</A> + <LI><A HREF="#toc9">What else can be done with the grammar</A> + <LI><A HREF="#toc10">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc11">Modules</A> + <LI><A HREF="#toc12">Judgements</A> + <LI><A HREF="#toc13">Types and terms</A> + <LI><A HREF="#toc14">Type checking</A> + </UL> + </UL> + <LI><A HREF="#toc15">Designing a grammar for complex phrases</A> + <UL> + <LI><A HREF="#toc16">The abstract syntax Food</A> + <LI><A HREF="#toc17">The concrete syntax FoodEng</A> + <LI><A HREF="#toc18">Commands for testing grammars</A> + <UL> + <LI><A HREF="#toc19">Generating trees and strings</A> + <LI><A HREF="#toc20">More on pipes; tracing</A> + <LI><A HREF="#toc21">Writing and reading files</A> + <LI><A HREF="#toc22">Visualizing trees</A> + <LI><A HREF="#toc23">System commands</A> + </UL> + <LI><A HREF="#toc24">An Italian concrete syntax</A> + <LI><A HREF="#toc25">Free variation</A> + <LI><A HREF="#toc26">More application of multilingual grammars</A> + <UL> + <LI><A HREF="#toc27">Multilingual treebanks</A> + <LI><A HREF="#toc28">Translation session</A> + <LI><A HREF="#toc29">Translation quiz</A> + <LI><A HREF="#toc30">Multilingual syntax editing</A> + </UL> + <LI><A HREF="#toc31">Context-free grammars and GF</A> + <UL> + <LI><A HREF="#toc32">The "cf" grammar format</A> + <LI><A HREF="#toc33">Restrictions of context-free grammars</A> + </UL> + <LI><A HREF="#toc34">Modules and files</A> + <LI><A HREF="#toc35">Using operations and resource modules</A> + <UL> + <LI><A HREF="#toc36">The golden rule of functional programming</A> + <LI><A HREF="#toc37">Operation definitions</A> + <LI><A HREF="#toc38">The ``resource`` module type</A> + <LI><A HREF="#toc39">Opening a resource</A> + <LI><A HREF="#toc40">Partial application</A> + <LI><A HREF="#toc41">Testing resource modules</A> + </UL> + <LI><A HREF="#toc42">Grammar architecture</A> + <UL> + <LI><A HREF="#toc43">Extending a grammar</A> + <LI><A HREF="#toc44">Multiple inheritance</A> + <LI><A HREF="#toc45">Visualizing module structure</A> + </UL> + <LI><A HREF="#toc46">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc47">Modules</A> + <LI><A HREF="#toc48">Judgements</A> + <LI><A HREF="#toc49">Free variation</A> + <LI><A HREF="#toc50">The context-free grammar format</A> + <LI><A HREF="#toc51">Character encoding</A> + </UL> + </UL> + <LI><A HREF="#toc52">Grammars with parameters</A> + <UL> + <LI><A HREF="#toc53">The problem: words have to be inflected</A> + <LI><A HREF="#toc54">Parameters and tables</A> + <LI><A HREF="#toc55">Inflection tables and paradigms</A> + <LI><A HREF="#toc56">Using parameters in concrete syntax</A> + <UL> + <LI><A HREF="#toc57">Agreement</A> + <LI><A HREF="#toc58">Determiners</A> + <LI><A HREF="#toc59">Parametric vs. inherent features</A> + </UL> + <LI><A HREF="#toc60">An English concrete syntax for Foods with parameters</A> + <LI><A HREF="#toc61">More on inflection paradigms</A> + <UL> + <LI><A HREF="#toc62">Worst-case functions</A> + <LI><A HREF="#toc63">Intelligent paradigms</A> + <LI><A HREF="#toc64">Function types with variables</A> + <LI><A HREF="#toc65">Separating operation types and definitions</A> + <LI><A HREF="#toc66">Overloading of operations</A> + <LI><A HREF="#toc67">Morphological analysis and morphology quiz</A> + </UL> + <LI><A HREF="#toc68">The Italian Foods grammar</A> + <LI><A HREF="#toc69">Discontinuous constituents</A> + <LI><A HREF="#toc70">Strings at compile time vs. run time</A> + <LI><A HREF="#toc71">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc72">Parameter and table types</A> + <LI><A HREF="#toc73">Pattern matching</A> + <LI><A HREF="#toc74">Overloading</A> + <LI><A HREF="#toc75">Local definitions</A> + <LI><A HREF="#toc76">Supplementary constructs</A> + </UL> + </UL> + <LI><A HREF="#toc77">Using the resource grammar library</A> + <UL> + <LI><A HREF="#toc78">The coverage of the library</A> + <LI><A HREF="#toc79">The structure of the library</A> + <UL> + <LI><A HREF="#toc80">Lexical vs. phrasal rules</A> + <LI><A HREF="#toc81">Lexical categories</A> + <LI><A HREF="#toc82">Lexical rules</A> + <LI><A HREF="#toc83">Phrasal categories</A> + </UL> + <LI><A HREF="#toc84">The resource API</A> + <LI><A HREF="#toc85">Example: English</A> + <LI><A HREF="#toc86">Functor implementation of multilingual grammars</A> + <LI><A HREF="#toc87">Interfaces and instances</A> + <LI><A HREF="#toc88">Adding languages to a functor implementation</A> + <LI><A HREF="#toc89">Division of labour revisited</A> + <LI><A HREF="#toc90">Restricted inheritance</A> + <LI><A HREF="#toc91">Grammar reuse</A> + <LI><A HREF="#toc92">Browsing the resource with GF commands</A> + <LI><A HREF="#toc93">An extended Foods grammar</A> + <UL> + <LI><A HREF="#toc94">Abstract syntax</A> + <LI><A HREF="#toc95">Linearization types</A> + <LI><A HREF="#toc96">Linearization rules</A> + </UL> + <LI><A HREF="#toc97">Tenses</A> + <LI><A HREF="#toc98">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc99">Interfaces and instances</A> + <LI><A HREF="#toc100">Grammar reuse</A> + <LI><A HREF="#toc101">Functors</A> + <LI><A HREF="#toc102">Restricted inheritance</A> + </UL> + </UL> + <LI><A HREF="#toc103">Refining semantics in abstract syntax</A> + <UL> + <LI><A HREF="#toc104">GF as a logical framework</A> + <LI><A HREF="#toc105">Dependent types</A> + <LI><A HREF="#toc106">Polymorphism</A> + <UL> + <LI><A HREF="#toc107">Digression: dependent types in concrete syntax</A> + </UL> + <LI><A HREF="#toc108">Proof objects</A> + <UL> + <LI><A HREF="#toc109">Proof-carrying documents</A> + </UL> + <LI><A HREF="#toc110">Restricted polymorphism</A> + <LI><A HREF="#toc111">Variable bindings</A> + <LI><A HREF="#toc112">Semantic definitions</A> + <LI><A HREF="#toc113">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc114">Judgements</A> + <LI><A HREF="#toc115">Dependent function types</A> + </UL> + </UL> + <LI><A HREF="#toc116">Grammars of formal languages</A> + <UL> + <LI><A HREF="#toc117">Arithmetic expressions</A> + <UL> + <LI><A HREF="#toc118">Abstract syntax</A> + <LI><A HREF="#toc119">Concrete syntax: a simple approach</A> + </UL> + <LI><A HREF="#toc120">Lexing and unlexing</A> + <LI><A HREF="#toc121">Precedence and fixity</A> + <LI><A HREF="#toc122">Code generation as linearization</A> + <LI><A HREF="#toc123">Speaking aloud arithmetic expressions</A> + <LI><A HREF="#toc124">Programs with variables</A> + <UL> + <LI><A HREF="#toc125">The concrete syntax of assignments</A> + <LI><A HREF="#toc126">A liberal syntax of variables</A> + </UL> + <LI><A HREF="#toc127">Conclusion</A> + <LI><A HREF="#toc128">Summary of GF language constructs</A> + <UL> + <LI><A HREF="#toc129">Lexers and unlexers</A> + <LI><A HREF="#toc130">Built-in abstract syntax types</A> + </UL> + </UL> + <LI><A HREF="#toc131">Embedded grammars</A> + <UL> + <LI><A HREF="#toc132">The portable grammar format</A> + <LI><A HREF="#toc133">The embedded interpreter and its API</A> + <LI><A HREF="#toc134">Embedded GF applications in Haskell</A> + <UL> + <LI><A HREF="#toc135">The EmbedAPI module</A> + <LI><A HREF="#toc136">First application: a translator</A> + <LI><A HREF="#toc137">A looping translator</A> + <LI><A HREF="#toc138">A question-answer system</A> + <LI><A HREF="#toc139">Exporting GF datatypes</A> + <LI><A HREF="#toc140">Putting it all together</A> + </UL> + <LI><A HREF="#toc141">Embedded GF applications in Java</A> + <UL> + <LI><A HREF="#toc142">Translets</A> + <LI><A HREF="#toc143">Dialogue systems</A> + </UL> + <LI><A HREF="#toc144">Language models for speech recognition</A> + <LI><A HREF="#toc145">Dependent types and spoken language models</A> + <UL> + <LI><A HREF="#toc146">Statistical language models</A> + </UL> + </UL> + </UL> + +<P></P> +<HR NOSHADE SIZE=1> +<P></P> +<P> +<h2>Overview</h2> +</P> +<P> +This tutorial gives a hands-on introduction to grammar writing in GF. +It has been written for all programmers +who want to learn to write grammars in GF. +It will go through the programming concepts of GF, and also +explain, without presupposing them, the main ingredients of GF: +linguistics, functional programming, and type theory. +This knowledge will be introduced as a part of grammar writing +practice. +Thus the tutorial should be accessible to anyone who has some +previous experience from any programming language; the basics +of using computers are also presupposed, e.g. the use of +text editors and the management of files. +</P> +<P> +We start in <a href="#chaptwo">the second chapter</a> +by building a "Hello World" grammar, which covers greetings +in three languages: English (<I>hello world</I>), +Finnish (<I>terve maailma</I>), and Italian (<I>ciao mondo</I>). +This <B>multilingual grammar</B> is based on the most central idea of GF: +the distinction between <B>abstract syntax</B> +(the logical structure) and <B>concrete syntax</B> (the +sequence of words). +</P> +<P> +From the "Hello World" example, we proceed +in <a href="#chapthree">the third chapter</a> +to a larger grammar for the domain of food. +In this grammar, you can say things like +<center> +<I>this Italian cheese is delicious</I> +</center> +in English and Italian. This grammar illustrates how translation is +more than just replacement of words. For instance, the order of +words may have to be changed: +<center> +<I>Italian cheese</I> +</P> +<P> +<I>formaggio italiano</I> +</center> +Moreover, words can have different forms, and which forms +they have vary from language to language. For instance, +Italian adjectives usually have four forms where English +has just one: +<center> +<I>delicious</I> (<I>wine, wines, pizza, pizzas</I>) +</P> +<P> +<I>vino delizioso, vini deliziosi, pizza deliziosa, pizze deliziose</I> +</center> +The <B>morphology</B> of a language describes the +forms of its words, and the basics of implementing morphology and +integrating it with syntax are covered in <a href="#chaptwo">the fourth chapter</a>. +</P> +<P> +The complete description of morphology and syntax in natural +languages is in GF preferably left to the <B>resource grammar library</B>. +Its use is therefore an important part of GF programming, and +it is covered in <a href="#chapfive">the fifth chapter</a>. How to contribute to resource +grammars as an author will only be covered in Part III; +however, the tutorial does go through all the +programming concepts of GF, including those involved in +resource grammars. +</P> +<P> +In addition to multilinguality, <B>semantics</B> is an important aspect of GF +grammars. The "purely linguistic" aspects (morphology and syntax) belong to +the concrete syntax part of GF, whereas semantics is expressed in the abstract +syntax. After the presentation of concrete syntax constructs, we proceed +in <a href="#chapsix">the sixth chapter</a> to the enrichment of abstract syntax with <B>dependent types</B>, +<B>variable bindings</B>, and <B>semantic definitions</B>. +<a href="#chapseven">the seventh chapter</a> concludes the tutorial by technical tips for implementing formal +languages. It will also illustrate the close relation between GF grammars +and compilers by actually implementing a small compiler from C-like statements +and expressions to machine code similar to Java Virtual Machine. +</P> +<P> +English and Italian are used as example languages in many grammars. +Of course, we will not presuppose that the reader knows any Italian. +We have chosen Italian because it has a rich structure +that illustrates very well the capacities of GF. +Moreover, even those readers who don't know Italian, will find many of +its words familiar, due to the Latin heritage. +The exercises will encourage the reader to +port the examples to other languages as well; in particular, +it should be instructive for the reader to look at her +own native language from the point of view of writing a grammar +implementation. +</P> +<P> +To learn how to write GF grammars is not the only goal of +this tutorial. We will also explain the most important +commands of the GF system, mostly in passing. With these commands, +simple application programs such as translation and +quiz systems, can be built simply by writing scripts for the +GF system. More complicated applications, such as natural-language +interfaces and dialogue systems, moreover require programming in +some general-purpose language; such applications are covered in <a href="#chapeight">the eighth chapter</a>. +</P> +<A NAME="toc1"></A> +<H1>Getting started with GF</H1> +<P> +<a name="chaptwo"></a> +</P> +<P> +In this chapter, we will introduce the GF system and write the first GF grammar, +a "Hello World" grammar. While extremely small, this grammar already illustrates +how GF can be used for the tasks of translation and multilingual +generation. +</P> +<A NAME="toc2"></A> +<H2>What GF is</H2> +<P> +We use the term GF for three different things: +</P> +<UL> +<LI>a <B>system</B> (computer program) used for working with grammars +<LI>a <B>programming language</B> in which grammars can be written +<LI>a <B>theory</B> about grammars and languages +</UL> + +<P> +The relation between these things is obvious: the GF system is an implementation +of the GF programming language, which in turn is built on the ideas of the +GF theory. The main focus of this book is on the GF programming language. +We learn how grammars are written in this language. At the same time, we learn +the way of thinking in the GF theory. To make this all useful and fun, and +to encourage experimenting, we make the grammars run on a computer by +using the GF system. +</P> +<P> +A GF program is called a <B>grammar</B>. A grammar is, traditionally, a +definition of a language. From this definition, different language +processing components can be derived: +</P> +<UL> +<LI><B>parsing</B>: to analyse the language +<LI><B>linearization</B>: to generate the language +<LI><B>translation</B>: to analyse one language and generate another +</UL> + +<P> +A GF grammar is thus a declarative program from which these +procedures can be automatically derived. In general, a GF grammar +is <B>multilingual</B>: it defines many languages and translations between them. +</P> +<A NAME="toc3"></A> +<H2>Getting the GF system</H2> +<P> +The GF system is open-source free software, which can be downloaded via the +GF Homepage: +<center> +<CODE>gf.digitalgrammars.com</CODE> +</center> +There you can download +</P> +<UL> +<LI>binaries for Linux, Mac OS X, and Windows +<LI>source code and documentation +<LI>grammar libraries and examples +</UL> + +<P> +In particular, many of the examples in this book are included in the +subdirectory <CODE>examples/tutorial</CODE> of the source distribution package. +This directory is also available +<A HREF="http://digitalgrammars.com/gf/examples/tutorial">online</A>. +</P> +<P> +If you want to compile GF from source, you need a Haskell compiler. +To compile the interactive editor, you also need a Java compilers. +But normally you don't have to compile anything yourself, and you definitely +don't need to know Haskell or Java to use GF. +</P> +<P> +We are assuming the availability of a Unix shell. Linux and Mac OS X users +have it automatically, the latter under the name "terminal". +Windows users are recommended to install Cywgin, the free Unix shell for Windows. +</P> +<A NAME="toc4"></A> +<H2>Running the GF system</H2> +<P> +To start the GF system, assuming you have installed it, just type +<CODE>gf</CODE> in the Unix (or Cygwin) shell: +</P> +<PRE> + % gf +</PRE> +<P> +You will see GF's welcome message and the prompt <CODE>></CODE>. +The command +</P> +<PRE> + > help +</PRE> +<P> +will give you a list of available commands. +</P> +<P> +As a common convention in this book, we will use +</P> +<UL> +<LI><CODE>%</CODE> as a prompt that marks system commands +<LI><CODE>></CODE> as a prompt that marks GF commands +</UL> + +<P> +Thus you should not type these prompts, but only the characters that +follow them. +</P> +<A NAME="toc5"></A> +<H2>A "Hello World" grammar</H2> +<P> +The tradition in programming language tutorials is to start with a +program that prints "Hello World" on the terminal. GF should be no +exception. But our program has features that distinguish it from +most "Hello World" programs: +</P> +<UL> +<LI><B>Multilinguality</B>: the message is printed in many languages. +<LI><B>Reversibility</B>: in addition to printing, you can <B>parse</B> the + message and <B>translate</B> it to other languages. +</UL> + +<A NAME="toc6"></A> +<H3>The program: abstract syntax and concrete syntaxes</H3> +<P> +A GF program, in general, is a <B>multilingual grammar</B>. Its main parts +are +</P> +<UL> +<LI>an <B>abstract syntax</B> +<LI>one or more <B>concrete syntaxes</B> +</UL> + +<P> +The abstract syntax defines, in a language-independent way, what <B>meanings</B> +can be expressed in the grammar. In the "Hello World" grammar we want +to express <I>Greetings</I>, where we greet a <I>Recipient</I>, which can be +<I>World</I> or <I>Mum</I> or <I>Friends</I>. Here is the entire +GF code for the abstract syntax: +</P> +<PRE> + -- a "Hello World" grammar + abstract Hello = { + + flags startcat = Greeting ; + + cat Greeting ; Recipient ; + + fun + Hello : Recipient -> Greeting ; + World, Mum, Friends : Recipient ; + } +</PRE> +<P> +The code has the following parts: +</P> +<UL> +<LI>a <B>comment</B> (optional), saying what the module is doing +<LI>a <B>module header</B> indicating that it is an abstract syntax + module named <CODE>Hello</CODE> +<LI>a <B>module body</B> in braces, consisting of + <UL> + <LI>a <B>startcat flag declaration</B> stating that <CODE>Greeting</CODE> is the + main category, i.e. the one in which parsing and generation are + performed by default + <LI><B>category declarations</B> stating that <CODE>Greeting</CODE> and <CODE>Recipient</CODE> + are categories, i.e. types of meanings + <LI><B>function declarations</B> stating what meaning-building functions there + are; these are the function <CODE>Hello</CODE> constructing a greeting from a recipient, + as well as the three possible recipients + </UL> +</UL> + +<P> +A concrete syntax defines a mapping from the abstract meanings to their +expressions in a language. We first give an English concrete syntax: +</P> +<PRE> + concrete HelloEng of Hello = { + + lincat Greeting, Recipient = {s : Str} ; + + lin + Hello recip = {s = "hello" ++ recip.s} ; + World = {s = "world"} ; + Mum = {s = "mum"} ; + Friends = {s = "friends"} ; + } +</PRE> +<P> +The major parts of this code are: +</P> +<UL> +<LI>a module header indicating that it is a concrete syntax of the abstract syntax + <CODE>Hello</CODE>, itself named <CODE>HelloEng</CODE> +<LI>a module body in curly brackets, consisting of + <UL> + <LI><B>linearization type definitions</B> stating that + <CODE>Greeting</CODE> and <CODE>Recipient</CODE> are <B>records</B> with a <B>string</B> <CODE>s</CODE> + <LI><B>linearization definitions</B> telling what records are assigned to + each of the meanings defined in the abstract syntax; the recipients are + linearized to records containing single words, whereas the <CODE>Hello</CODE> greeting + has a function telling that the word <CODE>hello</CODE> is prefixed to the string + <CODE>s</CODE> contained in the record <CODE>recip</CODE> + </UL> +</UL> + +<P> +To make the grammar truly multilingual, we add a Finnish and an Italian concrete +syntax: +</P> +<PRE> + concrete HelloFin of Hello = { + lincat Greeting, Recipient = {s : Str} ; + lin + Hello recip = {s = "terve" ++ recip.s} ; + World = {s = "maailma"} ; + Mum = {s = "äiti"} ; + Friends = {s = "ystävät"} ; + } + + concrete HelloIta of Hello = { + lincat Greeting, Recipient = {s : Str} ; + lin + Hello recip = {s = "ciao" ++ recip.s} ; + World = {s = "mondo"} ; + Mum = {s = "mamma"} ; + Friends = {s = "amici"} ; + } +</PRE> +<P> +Now we have a trilingual grammar usable for translation and +many other tasks, which we will now start experimenting with. +</P> +<A NAME="toc7"></A> +<H3>Using the grammar in the GF system</H3> +<P> +In order to compile the grammar in GF, each of the four modules +has to be put into a file named <I>Modulename</I><CODE>.gf</CODE>: +</P> +<PRE> + Hello.gf HelloEng.gf HelloFin.gf HelloIta.gf +</PRE> +<P> +The first GF command needed when using a grammar is to <B>import</B> it. +The command has a long name, <CODE>import</CODE>, and a short name, <CODE>i</CODE>. +When you have started GF (by the shell command <CODE>gf</CODE>), you can thus type either +</P> +<PRE> + > import HelloEng.gf +</PRE> +<P> +or +</P> +<PRE> + > i HelloEng.gf +</PRE> +<P> +to get the same effect. In general, all GF commands have a long and a short name; +short names are convenient when typing commands by hand, whereas long command +names are more readable in scripts, i.e. files that include sequences of commands. +</P> +<P> +The effect of <CODE>import</CODE> is that the GF system <B>compiles</B> your grammar +into an internal representation, and shows a new prompt when it is ready. +It will also show how much CPU time was consumed: +</P> +<PRE> + > i HelloEng.gf + - compiling Hello.gf... wrote file Hello.gfc 8 msec + - compiling HelloEng.gf... wrote file HelloEng.gfc 12 msec + + 12 msec + > +</PRE> +<P> +You can now use GF for <B>parsing</B>: +</P> +<PRE> + > parse "hello world" + Hello World +</PRE> +<P> +The <CODE>parse</CODE> (= <CODE>p</CODE>) command takes a <B>string</B> +(in double quotes) and returns an <B>abstract syntax tree</B> --- the meaning +of the string as defined in the abstract syntax. +A tree is, in general, something easier than a string +for a machine to understand and to process further, although this +is not so obvious in this simple grammar. The syntax for trees is that +of <B>function application</B>, which in GF is written +</P> +<PRE> + function argument1 ... argumentn +</PRE> +<P> +Parentheses are only needed for grouping. For instance, <CODE>f (a b)</CODE> is +<CODE>f</CODE> applied to the application of <CODE>a</CODE> to <CODE>b</CODE>. This is different +from <CODE>f a b</CODE>, which is <CODE>f</CODE> applied to <CODE>a</CODE> and <CODE>b</CODE>. +</P> +<P> +Strings that return a tree when parsed do so in virtue of the grammar +you imported. Try to parse something that is not in grammar, and you will fail +</P> +<PRE> + > parse "hello dad" + Unknown words: dad + + > parse "world hello" + no tree found +</PRE> +<P> +In the first example, the failure is caused by an unknown word. +In the second example, the combination of words is ungrammatical. +</P> +<P> +In addition to parsing, you can also use GF for <B>linearization</B> +(<CODE>linearize = l</CODE>). This is the inverse of +parsing, taking trees into strings: +</P> +<PRE> + > linearize Hello World + hello world +</PRE> +<P> +What is the use of this? Typically not that you type in a tree at +the GF prompt. The utility of linearization comes from the fact that +you can obtain a tree from somewhere else --- for instance, from +a parser. A prime example of this is <B>translation</B>: you parse +with one concrete syntax and linearize with another. Let us +now do this by first importing the Italian grammar: +</P> +<PRE> + > import HelloIta.gf +</PRE> +<P> +We can now parse with <CODE>HelloEng</CODE> and <B>pipe</B> the result +into linearizing with <CODE>HelloIta</CODE>: +</P> +<PRE> + > parse -lang=HelloEng "hello mum" | linearize -lang=HelloIta + ciao mamma +</PRE> +<P> +Notice that, since there are now two concrete syntaxes read into the +system, the commands use a <B>language flag</B> to indicate +which concrete syntax is used in each operation. If no language flag is +given, the last-imported language is applied. +</P> +<P> +To conclude the translation exercise, we import the Finnish grammar +and pipe English parsing into <B>multilingual generation</B>: +</P> +<PRE> + > parse -lang=HelloEng "hello friends" | linearize -multi + terve ystävät + ciao amici + hello friends +</PRE> +<P></P> +<P> +<B>Exercise</B>. Test the parsing and translation examples shown above, as well as +some other examples, in different combinations of languages. +</P> +<P> +<B>Exercise</B>. Extend the grammar <CODE>Hello.gf</CODE> and some of the +concrete syntaxes by five new recipients and one new greeting +form. +</P> +<P> +<B>Exercise</B>. Add a concrete syntax for some other +languages you might know. +</P> +<P> +<B>Exercise</B>. Add a pair of greetings that are expressed in one and the same way in +one language and in two different ways in another. For instance, <I>good morning</I> +and <I>good afternoon</I> in English are both expressed as <I>buongiorno</I> in Italian. +Test what happens when you translate <I>buongiorno</I> to English in GF. +</P> +<P> +<B>Exercise</B>. Inject errors in the <CODE>Hello</CODE> grammars, for example, leave out +some line, omit a variable in a <CODE>lin</CODE> rule, or change the name in one occurrence +of a variable. Inspect the error messages generated by GF. +</P> +<A NAME="toc8"></A> +<H2>Using grammars from outside GF</H2> +<P> +A normal "hello world" program written in C is executable from the +Unix shell and print its output on the terminal. This is possible in GF +as well, by using the <CODE>gf</CODE> program in a Unix pipe. Invoking <CODE>gf</CODE> +can be made with grammar names as arguments, +</P> +<PRE> + % gf HelloEng.gf HelloFin.gf HelloIta.gf +</PRE> +<P> +which has the same effect as opening <CODE>gf</CODE> and then importing the +grammars. A command can be send to this <CODE>gf</CODE> state by piping it from +Unix's <CODE>echo</CODE> command: +</P> +<PRE> + % echo "l -multi Hello Wordl" | gf HelloEng.gf HelloFin.gf HelloIta.gf +</PRE> +<P> +which will execute the command and then quit. Alternatively, one can write +a <B>script</B>, a file containing the lines +</P> +<PRE> + import HelloEng.gf + import HelloFin.gf + import HelloIta.gf + linearize -multi Hello World +</PRE> +<P> +If we name this script <CODE>hello.gfs</CODE>, we can do +</P> +<PRE> + $ gf -batch -s <hello.gfs s + + ciao mondo + terve maailma + hello world +</PRE> +<P> +The options <CODE>-batch</CODE> and <CODE>-s</CODE> ("silent") remove prompts, CPU time, +and other messages. Writing GF scripts and Unix shell scripts that call +GF is the simplest way to build application programs that use GF grammars. +In <a href="#chapeight">the eighth chapter</a>, we will see how to build stand-alone programs that don't need +the GF system to run. +</P> +<P> +<B>Exercise</B>. (For Unix hackers.) Write a GF application that reads +an English string from the standard input and writes an Italian +translation to the output. +</P> +<A NAME="toc9"></A> +<H2>What else can be done with the grammar</H2> +<P> +Now we have built our first multilingual grammar and seen the basic +functionalities of GF: parsing and linearization. We have tested +these functionalities inside the GF program. In the forthcoming +chapters, we will build larger grammars and can then get more out of +these functionalities. But we will also introduce new ones: +</P> +<UL> +<LI><B>morphological analysis</B>: find out the possible inflection forms of words +<LI><B>morphological synthesis</B>: generate all inflection forms of words +<LI><B>random generation</B>: generate random expressions +<LI><B>corpus generation</B>: generate all expressions +<LI><B>treebank generation</B>: generate a list of trees with their linearizations +<LI><B>teaching quizzes</B>: train morphology and translation +<LI><B>multilingual authoring</B>: create a document in many languages simultaneously +<LI><B>speech input</B>: optimize a speech recognition system for a grammar +</UL> + +<P> +The usefulness of GF would be quite limited if grammars were +usable only inside the GF system. In <a href="#chapeight">the eighth chapter</a>, +we will see other ways of using grammars: +</P> +<UL> +<LI>compile them to new formats, such as speech recognition grammars +<LI>embed them in Java and Haskell programs +<LI>build applications using compilation and embedding: + <UL> + <LI>voice commands + <LI>spoken language translators + <LI>dialogue systems + <LI>user interfaces + <LI>localization: parametrize the messages printed by a program + to support different languages + </UL> +</UL> + +<P> +All GF functionalities, both those inside the GF program and those +ported to other environments, +are of course already applicable to the simplest of grammars, +such as the <CODE>Hello</CODE> grammars presented above. But the main focus +of this tutorial will be on grammar writing. Thus we will show +how larger and more expressive grammars can be built by using +the constructs of the GF programming language, before entering the +applications. +</P> +<A NAME="toc10"></A> +<H2>Summary of GF language features</H2> +<P> +As the last section of each chapter, we will give a summary of the GF language +features covered in the chapter. The presentation is rather technical and intended +as a reference for later use, rather than to be read at once. The summaries +may cover some new features, which complement the discussion in the main chapter. +</P> +<A NAME="toc11"></A> +<H3>Modules</H3> +<P> +A GF grammar consists of <B>modules</B>, +into which judgements are grouped. The most important +module forms are +</P> +<UL> +<LI><CODE>abstract</CODE> A <CODE>= {...}</CODE> , abstract syntax A with judgements in + the <B>module body</B> <CODE>{...}</CODE>. +<LI><CODE>concrete</CODE> C <CODE>of</CODE> A <CODE>= {...}</CODE>, concrete syntax C of the + abstract syntax A, with judgements in the module body <CODE>{...}</CODE>. +</UL> + +<P> +Each module is written in a file named <I>Modulename</I><CODE>.gf</CODE>. +</P> +<A NAME="toc12"></A> +<H3>Judgements</H3> +<P> +<a name="secjment"></a> +</P> +<P> +Rules in a module body are called <B>judgements</B>. Keywords such as +<CODE>fun</CODE> and <CODE>lin</CODE> are used for distinguishing between +<B>judgement forms</B>. Here is a summary of the most important +judgement forms, which we have considered by now: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>form</TH> +<TH>reading</TH> +<TH COLSPAN="2">module type</TH> +</TR> +<TR> +<TD><CODE>cat</CODE> <I>C</I></TD> +<TD><I>C</I> is a category</TD> +<TD>abstract</TD> +</TR> +<TR> +<TD><CODE>fun</CODE> <I>f</I> <CODE>:</CODE> <I>A</I></TD> +<TD><I>f</I> is a function of type <I>A</I></TD> +<TD>abstract</TD> +</TR> +<TR> +<TD><CODE>lincat</CODE> <I>C</I> <CODE>=</CODE> <I>T</I></TD> +<TD>category <I>C</I> has linearization type <I>T</I></TD> +<TD>concrete</TD> +</TR> +<TR> +<TD><CODE>lin</CODE> <I>f <i>x</i><sub>1</sub> ... <i>x</i><sub>n</sub></I> <CODE>=</CODE> <I>t</I></TD> +<TD>function <I>f</I> has linearization <I>t</I></TD> +<TD>concrete</TD> +</TR> +<TR> +<TD><CODE>flags</CODE> <I>p</I> <CODE>=</CODE> <I>v</I></TD> +<TD>flag <I>p</I> has value <I>v</I></TD> +<TD>any</TD> +</TR> +</TABLE> + +<P></P> +<P> +Both abstract and concrete modules may moreover contain <B>comments</B> of the forms +</P> +<UL> +<LI><CODE>--</CODE> <I>anything until a newline</I> +<LI><CODE>{-</CODE> <I>anything except hyphen followed by closing brace</I> <CODE>-}</CODE> +</UL> + +<P> +Judgements are terminated by semicolons. Shorthands permit the sharing of +the keyword in subsequent judgements, +</P> +<PRE> + cat C ; D ; === cat C ; cat D ; +</PRE> +<P> +and of the right-hand-side in subsequent judgements of the same form +</P> +<PRE> + fun f, g : A ; === fun f : A ; g : A ; +</PRE> +<P> +We will use the symbol <CODE>===</CODE> to indicate <B>syntactic sugar</B> when +speaking about GF. Thus it is not a symbol of the GF language. +</P> +<P> +Each judgement declares a <B>name</B>, which is an <B>identifier</B>. +An identifier is a letter followed by a sequence of letters, digits, and +characters <CODE>'</CODE> or <CODE>_</CODE>. Each identifier can only be +defined once in the same module (that is, as next to the judgement keyword; +local variables such as those in <CODE>lin</CODE> judgemenrs can be +reused in other judgements). +</P> +<P> +Names are in <B>scope</B> in the rest of the module, i.e. usable in the other +judgements of the module (subject to type restrictions, of course). Also +the name of the module is an identifier in scope. +</P> +<P> +The order of judgements in a module is free. In particular, an identifier +need not be declared before it is used. +</P> +<A NAME="toc13"></A> +<H3>Types and terms</H3> +<P> +A <B>type</B> in an abstract syntax are either a <B>basic type</B>, +i.e. one introduced in a <CODE>cat</CODE> judgement, or a +<B>function type</B> of the form +</P> +<PRE> + A1 -> ... -> An -> A +</PRE> +<P> +where each of <CODE>A1, ..., An, A</CODE> is a basic type. +The last type in the arrow-separated sequence +is the <B>value type</B> of the function type, and the earlier types are +its <B>argument types</B>. +</P> +<P> +In a concrete syntax, the available types include +</P> +<UL> +<LI>the type of <B>token lists</B>, <CODE>Str</CODE> +<LI><B>record types</B> of form <CODE>{</CODE> r1 : T1 ; ... ; rn : Tn <CODE>}</CODE> +</UL> + +<P> +Token lists are often briefly called <B>strings</B>. +</P> +<P> +Each semi-colon separated part in a record type is called a +<B>field</B>. The identifier introduced by the left-hand-side of a field +is called a <B>label</B>. +</P> +<P> +A <B>term</B> in abstract syntax is a <B>function application</B> of form +</P> +<PRE> + f a1 ... an +</PRE> +<P> +where <CODE>f</CODE> is a function declared in a <CODE>fun</CODE> judgement and <CODE>a1 ... an</CODE> +are terms. These terms are also called <B>abstract syntax trees</B>, or just +<B>trees</B>. +The tree above is well-typed and has the type A, if +</P> +<PRE> + f : A1 -> ... -> An -> A +</PRE> +<P> +and each <CODE>ai</CODE> has type <CODE>an</CODE>. +</P> +<P> +A term used in concrete syntax has one the forms +</P> +<UL> +<LI>quoted string: <CODE>"foo"</CODE>, of type <CODE>Str</CODE> +<LI>concatenation of strings: <CODE>"foo" ++ "bar"</CODE>, +<LI>record: <CODE>{</CODE> r1 = t1 ; ... ; rn = Tn <CODE>}</CODE>, + of type <CODE>{</CODE> r1 : R1 ; ... ; rn : Rn <CODE>}</CODE> +<LI>projection <CODE>t.r</CODE> of a term <CODE>t</CODE> that has a record type, + with the record label <CODE>r</CODE>; the projection has the corresponding record + field type +<LI>argument variable <CODE>x</CODE> bound by the left-hand-side of a <CODE>lin</CODE> rule, + of the corresponding linearization type +</UL> + +<P> +Each quoted string is treated as one <B>token</B>, and strings concatenated by +<CODE>++</CODE> are treated as separate tokens. Tokens are, by default, written with +a space in between. This behaviour can be changed by <CODE>lexer</CODE> and <CODE>unlexer</CODE> +flags, as will be explained later "Rseclexing. Therefore it is usually +not correct to have a space in a token. Writing +</P> +<PRE> + "hello world" +</PRE> +<P> +in a grammar would give the parser the task to find a token with a space +in it, rather than two tokens <CODE>"hello"</CODE> and <CODE>"world"</CODE>. If the latter is +what is meant, it is possible to use the shorthand +</P> +<PRE> + ["hello world"] === "hello" ++ "world" +</PRE> +<P> +The <B>empty string</B> is denoted by <CODE>[]</CODE> or, equivalently, <CODE>`` or ``[]</CODE>. +</P> +<A NAME="toc14"></A> +<H3>Type checking</H3> +<P> +An important functionality of the GF system is <B>static type checking</B>. +This means that the grammars are controlled to be well-formed, so that all +run-time errors are eliminated. The main type checking principles are the +following: +</P> +<UL> +<LI>a concrete syntax must define the <CODE>lincat</CODE> of each <CODE>cat</CODE> and a <CODE>lin</CODE> + for each <CODE>fun</CODE> in the abstract syntax that it is "<CODE>of</CODE>" +<LI><CODE>lin</CODE> rules are type checked with respect to the <CODE>lincat</CODE> and <CODE>fun</CODE> + rules +<LI>terms have types as defined in the previous section +</UL> + +<A NAME="toc15"></A> +<H1>Designing a grammar for complex phrases</H1> +<P> +<a name="chapthree"></a> +</P> +<P> +In this chapter, we will write a grammar that has much more structure than +the <CODE>Hello</CODE> grammar. We will look at how the abstract syntax +is divided into suitable categories, and how infinitely many +phrases can be generated by using recursive rules. We will also +introduce modularity by showing how a grammar can be +divided into modules, and how functional programming +can be used to share code in and among modules. +</P> +<A NAME="toc16"></A> +<H2>The abstract syntax Food</H2> +<P> +We will write a grammar that +defines a set of phrases usable for speaking about food: +</P> +<UL> +<LI>the start category is <CODE>Phrase</CODE> +<LI>a <CODE>Phrase</CODE> can be built by assigning a <CODE>Quality</CODE> to an <CODE>Item</CODE> + (e.g. <I>this cheese is Italian</I>) +<LI>an<CODE>Item</CODE> are build from a <CODE>Kind</CODE> by prefixing <I>this</I> or <I>that</I> + (e.g. <I>this wine</I>) +<LI>a <CODE>Kind</CODE> is either <B>atomic</B> (e.g. <I>cheese</I>), or formed + qualifying a given <CODE>Kind</CODE> with a <CODE>Quality</CODE> (e.g. <I>Italian cheese</I>) +<LI>a <CODE>Quality</CODE> is either atomic (e.g. <I>Italian</I>, + or built by modifying a given <CODE>Quality</CODE> with the word <I>very</I> (e.g. <I>very warm</I>) +</UL> + +<P> +These verbal descriptions can be expressed as the following abstract syntax: +</P> +<PRE> + abstract Food = { + + flags startcat = Phrase ; + + cat + Phrase ; Item ; Kind ; Quality ; + + fun + Is : Item -> Quality -> Phrase ; + This, That : Kind -> Item ; + QKind : Quality -> Kind -> Kind ; + Wine, Cheese, Fish : Kind ; + Very : Quality -> Quality ; + Fresh, Warm, Italian, Expensive, Delicious, Boring : Quality ; + } +</PRE> +<P> +In this abstract syntax, we can build <CODE>Phrase</CODE>s such as +</P> +<PRE> + Is (This (QKind Delicious (QKind Italian Wine))) (Very (Very Expensive)) +</PRE> +<P> +In the English concrete syntax, we will want to linearize this into +</P> +<PRE> + this delicious Italian wine is very very expensive +</PRE> +<P></P> +<A NAME="toc17"></A> +<H2>The concrete syntax FoodEng</H2> +<P> +The English concrete syntax gives no surprises: +</P> +<PRE> + concrete FoodEng of Food = { + + lincat + Phrase, Item, Kind, Quality = {s : Str} ; + + lin + Is item quality = {s = item.s ++ "is" ++ quality.s} ; + This kind = {s = "this" ++ kind.s} ; + That kind = {s = "that" ++ kind.s} ; + QKind quality kind = {s = quality.s ++ kind.s} ; + Wine = {s = "wine"} ; + Cheese = {s = "cheese"} ; + Fish = {s = "fish"} ; + Very quality = {s = "very" ++ quality.s} ; + Fresh = {s = "fresh"} ; + Warm = {s = "warm"} ; + Italian = {s = "Italian"} ; + Expensive = {s = "expensive"} ; + Delicious = {s = "delicious"} ; + Boring = {s = "boring"} ; + } +</PRE> +<P> +Let us test how the grammar works in parsing: +</P> +<PRE> + > import FoodEng.gf + > parse "this delicious wine is very very Italian" + Is (This (QKind Delicious Wine)) (Very (Very Italian)) +</PRE> +<P> +We can also try parsing in other categories than the <CODE>startcat</CODE>, +by setting the command-line <CODE>cat</CODE> flag: +</P> +<PRE> + p -cat=Kind "very Italian wine" + QKind (Very Italian) Wine +</PRE> +<P></P> +<P> +<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar by ten new food kinds and +qualities, and run the parser with new kinds of examples. +</P> +<P> +<B>Exercise</B>. Add a rule that enables question phrases of the form +<I>is this cheese Italian</I>. +</P> +<P> +<B>Exercise</B>. Enable the optional prefixing of +phrases with the words "excuse me but". Do this in such a way that +the prefix can occur at most once. +</P> +<A NAME="toc18"></A> +<H2>Commands for testing grammars</H2> +<A NAME="toc19"></A> +<H3>Generating trees and strings</H3> +<P> +When we have a grammar above a trivial size, especially a recursive +one, we need more efficient ways of testing it than just by parsing +sentences that happen to come to our minds. One way to do this is +based on automatic generation, which can be either +<B>random generation</B> or <B>exhaustive generation</B>. +</P> +<P> +Random generation (<CODE>generate_random = gr</CODE>) is an operation that +builds a random tree in accordance with an abstract syntax: +</P> +<PRE> + > generate_random + Is (This (QKind Italian Fish)) Fresh +</PRE> +<P> +By using a pipe, random generation can be fed into linearization: +</P> +<PRE> + > generate_random | linearize + this Italian fish is fresh +</PRE> +<P> +Random generation is a good way to test a grammar. It can also give results +that are surprising, which shows how fast we lose intuition +when we write complex grammars. +</P> +<P> +By using the <CODE>number</CODE> flag, several trees can be generated +in one command: +</P> +<PRE> + > gr -number=10 | l + that wine is boring + that fresh cheese is fresh + that cheese is very boring + this cheese is Italian + that expensive cheese is expensive + that fish is fresh + that wine is very Italian + this wine is Italian + this cheese is boring + this fish is boring +</PRE> +<P> +To generate <I>all</I> phrases that a grammar can produce, +GF provides the command <CODE>generate_trees = gt</CODE>. +</P> +<PRE> + > generate_trees | l + that cheese is very Italian + that cheese is very boring + that cheese is very delicious + that cheese is very expensive + that cheese is very fresh + ... + this wine is expensive + this wine is fresh + this wine is warm + +</PRE> +<P> +We get quite a few trees but not all of them: only up to a given +<B>depth</B> of trees. The default depth is 3; the depth can be +set by using the <CODE>depth</CODE> flag: +</P> +<PRE> + > generate_trees -depth=5 | l +</PRE> +<P> +Other options to the generation commands (like all commands) can be seen +by GF's <CODE>help = h</CODE> command: +</P> +<PRE> + > help gr + > help gt +</PRE> +<P></P> +<P> +<B>Exercise</B>. If the command <CODE>gt</CODE> generated all +trees in your grammar, it would never terminate. Why? +</P> +<P> +<B>Exercise</B>. Measure how many trees the grammar gives with depths 4 and 5, +respectively. <B>Hint</B>. You can +use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines. +</P> +<A NAME="toc20"></A> +<H3>More on pipes; tracing</H3> +<P> +A pipe of GF commands can have any length, but the "output type" +(either string or tree) of one command must always match the "input type" +of the next command, in order for the result to make sense. +</P> +<P> +The intermediate results in a pipe can be observed by putting the +<B>tracing</B> option <CODE>-tr</CODE> to each command whose output you +want to see: +</P> +<PRE> + > gr -tr | l -tr | p + + Is (This Cheese) Boring + this cheese is boring + Is (This Cheese) Boring +</PRE> +<P> +This facility is useful for test purposes: the pipe above can show +if a grammar is <B>ambiguous</B>, i.e. +contains strings that can be parsed in more than one way. +</P> +<P> +<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar so that it produces ambiguous +strings, and try out the ambiguity test. +</P> +<A NAME="toc21"></A> +<H3>Writing and reading files</H3> +<P> +To save the outputs of GF commands into a file, you can +pipe it to the <CODE>write_file = wf</CODE> command, +</P> +<PRE> + > gr -number=10 | linearize | write_file exx.tmp +</PRE> +<P> +You can read the file back to GF with the +<CODE>read_file = rf</CODE> command, +</P> +<PRE> + > read_file exx.tmp | parse -lines +</PRE> +<P> +Notice the flag <CODE>-lines</CODE> given to the parsing +command. This flag tells GF to parse each line of +the file separately. Without the flag, the grammar could +not recognize the string in the file, because it is not +a sentence but a sequence of ten sentences. +</P> +<P> +Files with examples can be used for <B>regression testing</B> +of grammars. The most systematic way to do this is by +generating treebanks; see <a href="#sectreebank">here</a>. +</P> +<A NAME="toc22"></A> +<H3>Visualizing trees</H3> +<P> +The gibberish code with parentheses returned by the parser does not +look like trees. Why is it called so? From the abstract mathematical +point of view, trees are a data structure that +represents <B>nesting</B>: trees are branching entities, and the branches +are themselves trees. Parentheses give a linear representation of trees, +useful for the computer. But the human eye may prefer to see a visualization; +for this purpose, GF provides the command <CODE>visualize_tree = vt</CODE>, to which +parsing (and any other tree-producing command) can be piped: +</P> +<PRE> + > parse "this delicious cheese is very Italian" | visualize_tree +</PRE> +<P></P> +<P> +<IMG ALIGN="middle" SRC="mytree.png" BORDER="0" ALT=""> +</P> +<P> +This command uses the programs Graphviz and Ghostview, which you +might not have, but which are freely available on the web. +</P> +<P> +Alternatively, you can print the tree into a file +e.g. a <CODE>.png</CODE> file that +can be be viewed with e.g. an HTML browser and also included in an +HTML document. You can do this +by saving the file <CODE>grphtmp.dot</CODE>, which the command <CODE>vt</CODE> +produces. Then you can process this file with the <CODE>dot</CODE> +program (from the Graphviz package). +</P> +<PRE> + % dot -Tpng grphtmp.dot > mytree.png +</PRE> +<P></P> +<A NAME="toc23"></A> +<H3>System commands</H3> +<P> +If you don't have Ghostview, or want to view graphs in some other way, +you can call <CODE>dot</CODE> and a suitable +viewer (e.g. <CODE>open</CODE> in Mac) without leaving GF, by using +a <B>system command</B>: <CODE>!</CODE> followed by a Unix command, +</P> +<PRE> + > ! dot -Tpng grphtmp.dot > mytree.png + > ! open mytree.png +</PRE> +<P> +Another form of system commands are those that receive arguments from +GF pipes. The escape symbol +is then <CODE>?</CODE>. +</P> +<PRE> + > generate_trees | ? wc +</PRE> +<P></P> +<P> +<B>Exercise</B>. (Exercise drom 3.3.1 revisited.) +Measure how many trees the grammar <CODE>FoodEng</CODE> gives with depths 4 and 5, +respectively. Use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines, and +a pipe from a GF command into a Unix command. +</P> +<A NAME="toc24"></A> +<H2>An Italian concrete syntax</H2> +<P> +<a name="secanitalian"></a> +</P> +<P> +We write the Italian grammar in a straightforward way, by replacing +English words with their dictionary equivalents: +</P> +<PRE> + concrete FoodIta of Food = { + + lincat + Phrase, Item, Kind, Quality = {s : Str} ; + + lin + Is item quality = {s = item.s ++ "è" ++ quality.s} ; + This kind = {s = "questo" ++ kind.s} ; + That kind = {s = "quello" ++ kind.s} ; + QKind quality kind = {s = kind.s ++ quality.s} ; + Wine = {s = "vino"} ; + Cheese = {s = "formaggio"} ; + Fish = {s = "pesce"} ; + Very quality = {s = "molto" ++ quality.s} ; + Fresh = {s = "fresco"} ; + Warm = {s = "caldo"} ; + Italian = {s = "italiano"} ; + Expensive = {s = "caro"} ; + Delicious = {s = "delizioso"} ; + Boring = {s = "noioso"} ; + } +</PRE> +<P> +An alert reader, or one who already knows Italian, may notice one point in +which the change is more substantial than just replacement of words: the order of +a quality and the kind it modifies in +</P> +<PRE> + QKind quality kind = {s = kind.s ++ quality.s} ; +</PRE> +<P> +Thus Italian says <CODE>vino italiano</CODE> for <CODE>Italian wine</CODE>. (Some Italian adjectives +are put before the noun. This distinction can be controlled by parameters, which +are introduced in <a href="#chaptwo">the fourth chapter</a>.) +</P> +<P> +<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for some other language. +You will probably end up with grammatically incorrect linearizations --- but don't +worry about this yet. +</P> +<P> +<B>Exercise</B>. If you have written <CODE>Food</CODE> for German, Swedish, or some +other language, test with random or exhaustive generation what constructs +come out incorrect, and prepare a list of those ones that cannot be helped +with the currently available fragment of GF. You can return to your list +after having worked out <a href="#chaptwo">the fourth chapter</a>. +</P> +<A NAME="toc25"></A> +<H2>Free variation</H2> +<P> +Sometimes there are alternative ways to define a concrete syntax. +For instance, if we use the <CODE>Food</CODE> grammars in a restaurant phrase +book, we may want to accept different words for expressing the quality +"delicious" ---- and different languages can differ in how many +such words they have. Then we don't want to put the distinctions into +the abstract syntax, but into concrete syntaxes. Such semantically +neutral distinctions are known as <B>free variation</B> in linguistics. +</P> +<P> +The <CODE>variants</CODE> construct of GF expresses free variation. For example, +</P> +<PRE> + lin Delicious = {s = variants {"delicious" ; "exquisit" ; "tasty"}} ; +</PRE> +<P> +says that <CODE>Delicious</CODE> can be linearized to any of <I>delicious</I>, +<I>exquisit</I>, and <I>tasty</I>. As a consequence, both these words result in the +tree <CODE>Delicious</CODE> when parsed. By default, the <CODE>linearize</CODE> command +shows only the first variant from each <CODE>variants</CODE> list; to see them +all, the option <CODE>-all</CODE> can be used: +</P> +<PRE> + > p "this exquisit wine is delicious" | l -all + this delicious wine is delicious + this delicious wine is exquisit + ... +</PRE> +<P> +In linguistics, it is well known that free variation is almost +non-existing, if all aspects of expressions are taken into account, including style. +Therefore, free variation should not be used in grammars that are meant as +libraries for other grammars, as in <a href="#chapfive">the fifth chapter</a>. However, in a specific +application, free variation is an excellent way to scale up the parser to +variations in user input that make no difference in the semantic +treatment. +</P> +<P> +An example that clearly illustrates these points is the +English negation. If we added to the <CODE>Food</CODE> grammar the negation +of a quality, we could accept both contracted and uncontracted <I>not</I>: +</P> +<PRE> + fun IsNot : Item -> Quality -> Phrase ; + lin IsNot item qual = + {s = item.s ++ variants {"isn't" ; ["is not"]} ++ qual.s} ; +</PRE> +<P> +Both forms are likely to occur in user input. Since there is no +corresponding contrast in Italian, we do not want to put the distinction +in the abstract syntax. Yet there is a stylistic difference between +these two forms. In particular, if we are doing generation rather +than parsing, we will want to choose the one or +the other depending on the kind of language we want to generate. +</P> +<P> +A limiting case of free variation is an empty variant list +</P> +<PRE> + variants {} +</PRE> +<P> +It can be used e.g. if a word lacks a certain inflection form. +</P> +<P> +Free variation works for all types in concrete syntax; all terms in +a <CODE>variants</CODE> list must be of the same type. +</P> +<P> +<B>Exercise</B>. Modify <CODE>FoodIta</CODE> in such a way that a quality can +be assigned to an item by using two different word orders, exemplified +by <I>questo vino è delizioso</I> and <I>è delizioso questo vino</I> +(a real variation in Italian), +and that it is impossible to say that something is boring +(a rather contrived example). +</P> +<A NAME="toc26"></A> +<H2>More application of multilingual grammars</H2> +<A NAME="toc27"></A> +<H3>Multilingual treebanks</H3> +<P> +<a name="sectreebank"></a> +</P> +<P> +A <B>multilingual treebank</B> is a set of trees with their +translations in different languages: +</P> +<PRE> + > gr -number=2 | tree_bank + + Is (That Cheese) (Very Boring) + quello formaggio è molto noioso + that cheese is very boring + + Is (That Cheese) Fresh + quello formaggio è fresco + that cheese is fresh +</PRE> +<P> +There is also an XML format for treebanks and a set of commands +suitable for regression testing; see <CODE>help tb</CODE> for more details. +</P> +<A NAME="toc28"></A> +<H3>Translation session</H3> +<P> +If translation is what you want to do with a set of grammars, a convenient +way to do it is to open a <CODE>translation_session = ts</CODE>. In this session, +you can translate between all the languages that are in scope. +A dot <CODE>.</CODE> terminates the translation session. +</P> +<PRE> + > ts + + trans> that very warm cheese is boring + quello formaggio molto caldo è noioso + that very warm cheese is boring + + trans> questo vino molto italiano è molto delizioso + questo vino molto italiano è molto delizioso + this very Italian wine is very delicious + + trans> . + > +</PRE> +<P></P> +<A NAME="toc29"></A> +<H3>Translation quiz</H3> +<P> +This is a simple language exercise that can be automatically +generated from a multilingual grammar. The system generates a set of +random sentences, displays them in one language, and checks the user's +answer given in another language. The command <CODE>translation_quiz = tq</CODE> +makes this in a subshell of GF. +</P> +<PRE> + > translation_quiz FoodEng FoodIta + + Welcome to GF Translation Quiz. + The quiz is over when you have done at least 10 examples + with at least 75 % success. + You can interrupt the quiz by entering a line consisting of a dot ('.'). + + this fish is warm + questo pesce è caldo + > Yes. + Score 1/1 + + this cheese is Italian + questo formaggio è noioso + > No, not questo formaggio è noioso, but + questo formaggio è italiano + + Score 1/2 + this fish is expensive +</PRE> +<P> +You can also generate a list of translation exercises and save it in a +file for later use, by the command <CODE>translation_list = tl</CODE> +</P> +<PRE> + > translation_list -number=25 FoodEng FoodIta | write_file transl.txt +</PRE> +<P> +The <CODE>number</CODE> flag gives the number of sentences generated. +</P> +<A NAME="toc30"></A> +<H3>Multilingual syntax editing</H3> +<P> +<a name="secediting"></a> +</P> +<P> +Any multilingual grammar can be used in the graphical syntax editor, which is +opened by the shell +command <CODE>gfeditor</CODE> followed by the names of the grammar files. +Thus +</P> +<PRE> + % gfeditor FoodEng.gf FoodIta.gf +</PRE> +<P> +opens the editor for the two <CODE>Food</CODE> grammars. +</P> +<P> +The editor supports commands for manipulating an abstract syntax tree. +The process is started by choosing a category from the "New" menu. +Choosing <CODE>Phrase</CODE> creates a new tree of type <CODE>Phrase</CODE>. A new tree +is in general completely unknown: it consists of a <B>metavariable</B> +<CODE>?1</CODE>. However, since the category <CODE>Phrase</CODE> in <CODE>Food</CODE> has +only one possible constructor, <CODE>Is</CODE>, the tree is readily +given the form <CODE>Is ?1 ?2</CODE>. Here is what the editor looks like at +this stage: +</P> +<P> + <IMG ALIGN="right" SRC="food1.png" BORDER="0" ALT=""> +</P> +<P> +Editing goes on by <B>refinements</B>, i.e. choices of constructors from +the menu, until no metavariables remain. Here is a tree resulting from the +current editing session: +</P> +<P> + <IMG ALIGN="right" SRC="food2.png" BORDER="0" ALT=""> +</P> +<P> +Editing can be continued even when the tree is finished. The user can shift +the <B>focus</B> to some of the subtrees by clicking at it or the corresponding +part of a linearization. In the picture, the focus is on "fish". +Since there are no metavariables, +the menu shows no refinements, but some other possible actions: +</P> +<UL> +<LI>to <B>change</B> "fish" to "cheese" or "wine" +<LI>to <B>delete</B> "fish", i.e. change it to a metavariable +<LI>to <B>wrap</B> "fish" in a qualification, i.e. change it to + <CODE>QKind ? Fish</CODE>, where the quality can be given in a later refinement +</UL> + +<P> +In addition to menu-based editing, the tool supports refinement by parsing, +which is accessible by middle-clicking in the tree or in the linearization field. +</P> +<P> +<B>Exercise</B>. Construct the sentence +<I>this very expensive cheese is very very delicious</I> +and its Italian translation by using <CODE>gfeditor</CODE>. +</P> +<A NAME="toc31"></A> +<H2>Context-free grammars and GF</H2> +<P> +Readers not familar with context-free grammars, also known as BNF grammars, can +skip this section. Those that are familar with them will find here the exact +relation between GF and context-free grammars. We will moreover show how +the BNF format can be used as input to the GF program; it is often more +concise than GF proper, but also more restricted in expressive power. +</P> +<A NAME="toc32"></A> +<H3>The "cf" grammar format</H3> +<P> +The grammar <CODE>FoodEng</CODE> could be written in a BNF format as follows: +</P> +<PRE> + Is. Phrase ::= Item "is" Quality ; + That. Item ::= "that" Kind ; + This. Item ::= "this" Kind ; + QKind. Kind ::= Quality Kind ; + Cheese. Kind ::= "cheese" ; + Fish. Kind ::= "fish" ; + Wine. Kind ::= "wine" ; + Italian. Quality ::= "Italian" ; + Boring. Quality ::= "boring" ; + Delicious. Quality ::= "delicious" ; + Expensive. Quality ::= "expensive" ; + Fresh. Quality ::= "fresh" ; + Very. Quality ::= "very" Quality ; + Warm. Quality ::= "warm" ; +</PRE> +<P> +In this format, each rule is prefixed by a <B>label</B> that gives +the constructor function GF gives in its <CODE>fun</CODE> rules. In fact, +each context-free rule is a fusion of a <CODE>fun</CODE> and a <CODE>lin</CODE> rule: +it states simultaneously that +</P> +<UL> +<LI>the label is a function from the nonterminal categories + on the right-hand side to the category on the left-hand side; + the first rule gives +<PRE> + fun Is : Item -> Quality -> Phrase +</PRE> +<LI>trees built by the label are linearized in the way indicated + by the right-hand side; + the first rule gives +<PRE> + lin Is item quality = {s = item.s ++ "is" ++ quality.s} +</PRE> +</UL> + +<P> +The translation from BNF to GF described above is in fact used in +the GF system to convert BNF grammars into GF. BNF files are recognized +by the file name suffix <CODE>.cf</CODE>; thus the grammar above can be +put into a file named <CODE>food.cf</CODE> and read into GF by +</P> +<PRE> + > import food.cf +</PRE> +<P></P> +<A NAME="toc33"></A> +<H3>Restrictions of context-free grammars</H3> +<P> +Even though we managed to write <CODE>FoodEng</CODE> in the context-free format, +we cannot do this for GF grammars in general. It is enough to try this +with <CODE>FoodIta</CODE> at the same time as <CODE>FoodEng</CODE>, +we lose an important aspect of multilinguality: +that the order of constituents is defined only in concrete syntax. +Thus we could not use context-free <CODE>FoodEng</CODE> and <CODE>FoodIta</CODE> in a multilingual +grammar that supports translation via common abstract syntax: the +qualification function <CODE>QKind</CODE> has different types in the two +grammars. +</P> +<P> +In general terms, the separation of concrete and abstract syntax allows +three deviations from context-free grammar: +</P> +<UL> +<LI><B>permutation</B>: changing the order of constituents +<LI><B>suppression</B>: omitting constituents +<LI><B>reduplication</B>: repeating constituents +</UL> + +<P> +The third property is the one that definitely shows that GF is +stronger than context-free: GF can define the <B>copy language</B> +<CODE>{x x | x <- (a|b)*}</CODE>, which is known not to be context-free. +The other properties have more to do with the kind of trees that +the grammar can associate with strings: permutation is important +in multilingual grammars, and suppression is exploited in grammars +where trees carry some hidden semantic information (see <a href="#chapsix">the sixth chapter</a> +below). +</P> +<P> +Of course, context-free grammars are also restricted from the +grammar engineering point of view. They give no support to +modules, functions, and parameters, which are so central +for the productivity of GF. Despite the initial conciseness +of context-free grammars, GF can easily produce grammars where +30 lines of GF code would need hundreds of lines of +context-free grammar code to produce; see exercises +<a href="#secitalian">here</a> and <a href="#sectense">here</a>. +</P> +<P> +<B>Exercise</B>. GF can also interpret unlabelled BNF grammars, by +creating labels automatically. The right-hand sides of BNF rules +can moreover be disjunctions, e.g. +</P> +<PRE> + Quality ::= "fresh" | "Italian" | "very" Quality ; +</PRE> +<P> +Experiment with this format in GF, possibly with a grammar that +you import from some other source, such as a programming language +document. +</P> +<P> +<B>Exercise</B>. Define the copy language <CODE>{x x | x <- (a|b)*}</CODE> in GF. +</P> +<A NAME="toc34"></A> +<H2>Modules and files</H2> +<P> +GF uses suffixes to recognize different file formats. The most +important ones are: +</P> +<UL> +<LI>Source files: <I>Modulename</I><CODE>.gf</CODE> +<LI>Target files: <I>Modulename</I><CODE>.gfc</CODE> +</UL> + +<P> +When you import <CODE>FoodEng.gf</CODE>, you see the target files being +generated: +</P> +<PRE> + > i FoodEng.gf + - compiling Food.gf... wrote file Food.gfc 16 msec + - compiling FoodEng.gf... wrote file FoodEng.gfc 20 msec +</PRE> +<P> +You also see that the GF program does not only read the file +<CODE>FoodEng.gf</CODE>, but also all other files that it +depends on --- in this case, <CODE>Food.gf</CODE>. +</P> +<P> +For each file that is compiled, a <CODE>.gfc</CODE> file +is generated. The GFC format (="GF Canonical") is the +"machine code" of GF, which is faster to process than +GF source files. When reading a module, GF decides whether +to use an existing <CODE>.gfc</CODE> file or to generate +a new one, by looking at modification times. +</P> +<P> +<I>In GF version 3, the</I> <CODE>gfc</CODE> <I>format is replaced by the format suffixed</I> +<CODE>gfo</CODE>, <I>"GF object"</I>. +</P> +<P> +<B>Exercise</B>. What happens when you import <CODE>FoodEng.gf</CODE> for +a second time? Try this in different situations: +</P> +<UL> +<LI>Right after importing it the first time (the modules are kept in + the memory of GF and need no reloading). +<LI>After issuing the command <CODE>empty</CODE> (<CODE>e</CODE>), which clears the memory + of GF. +<LI>After making a small change in <CODE>FoodEng.gf</CODE>, be it only an added space. +<LI>After making a change in <CODE>Food.gf</CODE>. +</UL> + +<A NAME="toc35"></A> +<H2>Using operations and resource modules</H2> +<A NAME="toc36"></A> +<H3>The golden rule of functional programming</H3> +<P> +When writing a grammar, you have to type lots of +characters. You have probably +done this by the copy-and-paste method, which is a universally +available way to avoid repeating work. +</P> +<P> +However, there is a more elegant way to avoid repeating work than +the copy-and-paste +method. The <B>golden rule of functional programming</B> says that +</P> +<UL> +<LI>whenever you find yourself programming by copy-and-paste, + write a function instead. +</UL> + +<P> +A function separates the shared parts of different computations from the +changing parts, its <B>arguments</B>, or <B>parameters</B>. +In functional programming languages, such as +Haskell, it is possible to share much more +code with functions than in languages such as C and Java, because +of higher-order functions (functions that takes functions as arguments). +</P> +<A NAME="toc37"></A> +<H3>Operation definitions</H3> +<P> +GF is a functional programming language, not only in the sense that +the abstract syntax is a system of functions (<CODE>fun</CODE>), but also because +functional programming can be used when defining concrete syntax. This is +done by using a new form of judgement, with the keyword <CODE>oper</CODE> (for +<B>operation</B>), distinct from <CODE>fun</CODE> for the sake of clarity. +Here is a simple example of an operation: +</P> +<PRE> + oper ss : Str -> {s : Str} = \x -> {s = x} ; +</PRE> +<P> +The operation can be <B>applied</B> to an argument, and GF will +<B>compute</B> the application into a value. For instance, +</P> +<PRE> + ss "boy" ===> {s = "boy"} +</PRE> +<P> +We use the symbol <CODE>===</CODE> to indicate how an expression is +computed into a value; this symbol is not a part of GF. +</P> +<P> +Thus an <CODE>oper</CODE> judgement includes the name of the defined operation, +its type, and an expression defining it. As for the syntax of the defining +expression, notice the <B>lambda abstraction</B> form <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I> of +the function. It reads: function with variable <I>x</I> and <B>function body</B> +<I>t</I>. Any occurrence of <I>x</I> in <I>t</I> is said to be <B>bound</B> in <I>t</I>. +</P> +<P> +For lambda abstraction with multiple arguments, we have the shorthand +</P> +<PRE> + \x,y -> t === \x -> \y -> t +</PRE> +<P> +The notation we have used for linearization rules, where +variables are bound on the left-hand side, is actually syntactic +sugar for abstraction: +</P> +<PRE> + lin f x = t === lin f = \x -> t +</PRE> +<P></P> +<A NAME="toc38"></A> +<H3>The ``resource`` module type</H3> +<P> +Operator definitions can be included in a concrete syntax. +But they are usually not really tied to a particular +set of linearization rules. +They should rather be seen as <B>resources</B> +usable in many concrete syntaxes. +</P> +<P> +The <CODE>resource</CODE> module type is used to package +<CODE>oper</CODE> definitions into reusable resources. Here is +an example, with a handful of operations to manipulate +strings and records. +</P> +<PRE> + resource StringOper = { + oper + SS : Type = {s : Str} ; + ss : Str -> SS = \x -> {s = x} ; + cc : SS -> SS -> SS = \x,y -> ss (x.s ++ y.s) ; + prefix : Str -> SS -> SS = \p,x -> ss (p ++ x.s) ; + } +</PRE> +<P></P> +<A NAME="toc39"></A> +<H3>Opening a resource</H3> +<P> +Any number of <CODE>resource</CODE> modules can be +<B>open</B>ed in a <CODE>concrete</CODE> syntax, which +makes definitions contained +in the resource usable in the concrete syntax. Here is +an example, where the resource <CODE>StringOper</CODE> is +opened in a new version of <CODE>FoodEng</CODE>. +</P> +<PRE> + concrete FoodEng of Food = open StringOper in { + + lincat + S, Item, Kind, Quality = SS ; + + lin + Is item quality = cc item (prefix "is" quality) ; + This k = prefix "this" k ; + That k = prefix "that" k ; + QKind k q = cc k q ; + Wine = ss "wine" ; + Cheese = ss "cheese" ; + Fish = ss "fish" ; + Very = prefix "very" ; + Fresh = ss "fresh" ; + Warm = ss "warm" ; + Italian = ss "Italian" ; + Expensive = ss "expensive" ; + Delicious = ss "delicious" ; + Boring = ss "boring" ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Use the same string operations to write <CODE>FoodIta</CODE> +more concisely. +</P> +<A NAME="toc40"></A> +<H3>Partial application</H3> +<P> +<a name="secpartapp"></a> +</P> +<P> +GF, like Haskell, permits <B>partial application</B> of +functions. An example of this is the rule +</P> +<PRE> + lin This k = prefix "this" k ; +</PRE> +<P> +which can be written more concisely +</P> +<PRE> + lin This = prefix "this" ; +</PRE> +<P> +The first form is perhaps more intuitive to write +but, once you get used to partial application, you will appreciate its +conciseness and elegance. The logic of partial application +is known as <B>currying</B>, with a reference to Haskell B. Curry. +The idea is that any <I>n</I>-place function can be seen as a 1-place +function whose value is an <I>n-</I>1 -place function. Thus +</P> +<PRE> + oper prefix : Str -> SS -> SS ; +</PRE> +<P> +can be used as a 1-place function that takes a <CODE>Str</CODE> into a +function <CODE>SS -> SS</CODE>. The expected linearization of <CODE>This</CODE> is exactly +a function of such a type, operating on an argument of type <CODE>Kind</CODE> +whose linearization is of type <CODE>SS</CODE>. Thus we can define the +linearization directly as <CODE>prefix "this"</CODE>. +</P> +<P> +An important part of the art of functional programming is to decide the order +of arguments in a function, so that partial application can be used as much +as possible. For instance, of the operation <CODE>prefix</CODE> we know that it +will be typically applied to linearization variables with constant strings. +This is the reason to put the <CODE>Str</CODE> argument before the <CODE>SS</CODE> argument --- not +the prefixity. A <CODE>postfix</CODE> function would have exactly the same order of arguments. +</P> +<P> +<B>Exercise</B>. Define an operation <CODE>infix</CODE> analogous to <CODE>prefix</CODE>, +such that it allows you to write +</P> +<PRE> + lin Is = infix "is" ; +</PRE> +<P></P> +<A NAME="toc41"></A> +<H3>Testing resource modules</H3> +<P> +To test a <CODE>resource</CODE> module independently, you must import it +with the flag <CODE>-retain</CODE>, which tells GF to retain <CODE>oper</CODE> definitions +in the memory; the usual behaviour is that <CODE>oper</CODE> definitions +are just applied to compile linearization rules +(this is called <B>inlining</B>) and then thrown away. +</P> +<PRE> + > import -retain StringOper.gf +</PRE> +<P> +The command <CODE>compute_concrete = cc</CODE> computes any expression +formed by operations and other GF constructs. For example, +</P> +<PRE> + > compute_concrete prefix "in" (ss "addition") + { + s : Str = "in" ++ "addition" + } +</PRE> +<P></P> +<A NAME="toc42"></A> +<H2>Grammar architecture</H2> +<P> +<a name="secarchitecture"></a> +</P> +<A NAME="toc43"></A> +<H3>Extending a grammar</H3> +<P> +The module system of GF makes it possible to write a new module that <B>extend</B>s +an old one. The syntax of extension is +shown by the following example. We extend <CODE>Food</CODE> into <CODE>MoreFood</CODE> by +adding a category of questions and two new functions. +</P> +<PRE> + abstract Morefood = Food ** { + cat + Question ; + fun + QIs : Item -> Quality -> Question ; + Pizza : Kind ; + + } +</PRE> +<P> +Parallel to the abstract syntax, extensions can +be built for concrete syntaxes: +</P> +<PRE> + concrete MorefoodEng of Morefood = FoodEng ** { + lincat + Question = {s : Str} ; + lin + QIs item quality = {s = "is" ++ item.s ++ quality.s} ; + Pizza = {s = "pizza"} ; + } +</PRE> +<P> +The effect of extension is that all of the contents of the extended +and extending module are put together. We also say that the new +module <B>inherits</B> the contents of the old module. +</P> +<P> +At the same time as extending a module of the same type, a concrete +syntax module may open resources. Since <CODE>open</CODE> takes effect in +the module body and not in the extended module, its logical place +in the module header is after the extend part: +</P> +<PRE> + concrete MorefoodIta of Morefood = FoodIta ** open StringOper in { + lincat + Question = SS ; + lin + QIs item quality = ss (item.s ++ "è" ++ quality.s) ; + Pizza = ss "pizza" ; + } +</PRE> +<P> +Resource modules can extend other resource modules, in the +same way as modules of other types can extend modules of the +same type. Thus it is possible to build resource hierarchies. +</P> +<A NAME="toc44"></A> +<H3>Multiple inheritance</H3> +<P> +Specialized vocabularies can be represented as small grammars that +only do "one thing" each. For instance, the following are grammars +for fruit and mushrooms +</P> +<PRE> + abstract Fruit = { + cat Fruit ; + fun Apple, Peach : Fruit ; + } + + abstract Mushroom = { + cat Mushroom ; + fun Cep, Agaric : Mushroom ; + } +</PRE> +<P> +They can afterwards be combined into bigger grammars by using +<B>multiple inheritance</B>, i.e. extension of several grammars at the +same time: +</P> +<PRE> + abstract Foodmarket = Food, Fruit, Mushroom ** { + fun + FruitKind : Fruit -> Kind ; + MushroomKind : Mushroom -> Kind ; + } +</PRE> +<P> +The main advantages with splitting a grammar to modules are +<B>reusability</B>, <B>separate compilation</B>, and <B>division of labour</B>. +Reusability means +that one and the same module can be put into different uses; for instance, +a module with mushroom names might be used in a mycological information system +as well as in a restaurant phrasebook. Separate compilation means that a module +once compiled into <CODE>.gfc</CODE> need not be compiled again unless changes have +taken place. +Division of labour means simply that programmers that are experts in +special areas can work on modules belonging to those areas. +</P> +<P> +<B>Exercise</B>. Refactor <CODE>Food</CODE> by taking apart <CODE>Wine</CODE> into a special +<CODE>Drink</CODE> module. +</P> +<A NAME="toc45"></A> +<H3>Visualizing module structure</H3> +<P> +When you have created all the abstract syntaxes and +one set of concrete syntaxes needed for <CODE>Foodmarket</CODE>, +your grammar consists of eight GF modules. To see how their +dependences look like, you can use the command +<CODE>visualize_graph = vg</CODE>, +</P> +<PRE> + > visualize_graph +</PRE> +<P> +and the graph will pop up in a separate window: +</P> +<P> +<IMG ALIGN="middle" SRC="foodmarket.png" BORDER="0" ALT=""> +</P> +<P> +The graph uses +</P> +<UL> +<LI>oval boxes for abstract modules +<LI>square boxes for concrete modules +<LI>black-headed arrows for inheritance +<LI>white-headed arrows for the concrete-of-abstract relation +</UL> + +<P> +Just as the <CODE>visualize_tree = vt</CODE> command, the freely available tools +Ghostview and Graphviz are needed. As an alternative, you can again print +the graph into a <CODE>.dot</CODE> file by using the command <CODE>print_multi = pm</CODE>: +</P> +<PRE> + > print_multi -printer=graph | write_file Foodmarket.dot + > ! dot -Tpng Foodmarket.dot > Foodmarket.png +</PRE> +<P></P> +<A NAME="toc46"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc47"></A> +<H3>Modules</H3> +<P> +The general form of a module is +<center> + <I>Moduletype</I> <I>M</I> <I>Of</I> <CODE>=</CODE> (<I>Extends</I> <CODE>**</CODE>)? (<CODE>open</CODE> <I>Opens</I> <CODE>in</CODE>)? <I>Body</I> +</center> +where <I>Moduletype</I> is one of <CODE>abstract</CODE>, <CODE>concrete</CODE>, and <CODE>resource</CODE>. +</P> +<P> +If <I>Moduletype</I> is <CODE>concrete</CODE>, the <I>Of</I>-part has the form <CODE>of</CODE> <I>A</I>, +where <I>A</I> is the name of an abstract module. Otherwise it is empty. +</P> +<P> +The name of the module is given by the identifier <I>M</I>. +</P> +<P> +The optional <I>Extends</I> part is a comma-separated +list of module names, which have to be modules of +the same <I>Moduletype</I>. The contents of these modules are <B>inherited</B> by +<I>M</I>. This means that they are both usable in <I>Body</I> and exported by <I>M</I>, +i.e. inherited when <I>M</I> is inherited and available when <I>M</I> is opened. +(Exception: <CODE>oper</CODE> and <CODE>param</CODE> judgements are not inherited from +<CODE>concrete</CODE> modules.) +</P> +<P> +The optional <I>Opens</I> part is a comma-separated +list of resource module names. The contents of these +modules are usable in the <I>Body</I>, but they are not exported. +</P> +<P> +Opening can be <B>qualified</B>, e.g. +</P> +<PRE> + concrete C of A = open (P = Prelude) in ... +</PRE> +<P> +This means that the names from <CODE>Prelude</CODE> are only available in the form +<CODE>P.name</CODE>. This form of qualifying a name is always possible, and it can +be used to resolve <B>name conflicts</B>, which result when the same name is +declared in more than one module that is in scope. +</P> +<A NAME="toc48"></A> +<H3>Judgements</H3> +<P> +The <I>Body</I> part consists of judgements. The judgement form table #secjment +is extended with the following forms: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>form</TH> +<TH>reading</TH> +<TH COLSPAN="2">module type</TH> +</TR> +<TR> +<TD ALIGN="center"><CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> <CODE>=</CODE> <I>t</I></TD> +<TD>operation <I>h</I> of type <I>T</I> is defined as <I>t</I></TD> +<TD>resource, concrete</TD> +</TR> +<TR> +<TD ALIGN="right"><CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD> +<TD>parameter type P has constructors <I>C1...Cn</I></TD> +<TD>resource, concrete</TD> +</TR> +</TABLE> + +<P></P> +<P> +The <CODE>param</CODE> judgement will be explained in the next chapter. +</P> +<P> +The type part of an <CODE>oper</CODE> judgement can be omitted, if the type can be inferred +by the GF compiler. +</P> +<PRE> + oper hello = "hello" ++ "world" ; +</PRE> +<P> +As a rule, type inference works for all terms except lambda abstracts. +</P> +<P> +<B>Lambda abstracts</B> are expressions of the form <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I>, +where <I>x</I> is a variable <B>bound</B> in the expression <I>t</I>, which is the +<B>body</B> of the lambda abstract. The type of the lambda abstract is +<I>A</I> <CODE>-></CODE><I>B</I>, where <I>A</I> is the type of the variable <CODE>x</CODE> and +<I>B</I> the type of the body <I>t</I>. +</P> +<P> +For multiple lambda abstractions, there is a shorthand +</P> +<PRE> + \x,y -> t === \x -> \y -> t +</PRE> +<P> +For <CODE>lin</CODE> judgements, there is the shorthand +</P> +<PRE> + lin f x = t === lin f = \x -> t +</PRE> +<P></P> +<A NAME="toc49"></A> +<H3>Free variation</H3> +<P> +The <CODE>variants</CODE> construct of GF can be used to give a list of +concrete syntax terms, of the same type, in free variation. For example, +</P> +<PRE> + variants {["does not"] ; "doesn't"} +</PRE> +<P> +A limiting case is the empty variant list <CODE>variants {}</CODE>. +</P> +<A NAME="toc50"></A> +<H3>The context-free grammar format</H3> +<P> +The <CODE>.cf</CODE> file format is used for <B>context-free grammars</B>, which are +always interpretable as GF grammars. Files of this format consist of +rules of the form +<center> + (<I>Label</I> <CODE>.</CODE>)? <I>Cat</I> <CODE>::=</CODE> <I>RHS</I> <CODE>;</CODE> +</center> +where the <I>RHS</I> is a sequence of terminals (quoted strings) and +nonterminals (identifiers). The optional <I>Label</I> gives the abstract +syntax function created. If it is omitted, a function name is generated +automatically. Then it is also possible to have more than one <I>RHS</I>, +separated by <I>|</I>. An empty <I>RHS</I> is interpreted as an empty sequence +of terminals, not as an empty disjunction. +</P> +<P> +The <B>Extended BNF</B> format (<B>EBNF</B>) can also be used, in files suffixed <CODE>.ebnf</CODE>. +This format does not allow user-written labels. The right-hand-side of a rule +can contain everything that is possible in the <CODE>.cf</CODE> format, but also +optional parts (<CODE>p ?</CODE>), sequences (<CODE>p *</CODE>) and non-empty sequences (<CODE>p +</CODE>). +For example, the phrases in <CODE>FoodEng</CODE> could be recognized with the following +EBNF grammar: +</P> +<PRE> + Phrase ::= + ("this" | "that") Quality* ("wine" | "cheese" | "fish") "is" Quality ; + Quality ::= + ("very"* ("fresh" | "warm" | "boring" | "Italian" | "expensive")) ; +</PRE> +<P></P> +<A NAME="toc51"></A> +<H3>Character encoding</H3> +<P> +The default encoding is iso-latin-1. UTF-8 can be set by the flag <CODE>coding=utf8</CODE> +in the grammar. The resource grammar libraries are in iso-latin-1, except Russian +and Arabic, which are in UTF-8. The resources may be changed to UTF-8 in future. +Letters in identifiers must currently be iso-latin-1. +</P> +<A NAME="toc52"></A> +<H1>Grammars with parameters</H1> +<P> +<a name="chapfour"></a> +</P> +<P> +In this chapter, we will introduce the techniques needed for +describing the inflection of words, as well as the rules by +which correct word forms are selected in syntactic combinations. +These techniques are already needed in a very slight extension +of the Food grammar of the previous chapter. While explaining +how the linguistic problems are solved for English and Italian, +we also cover all the language constructs GF has for +defining concrete syntax. +</P> +<P> +It is in principle possible to skip this chapter and go directly +to the next, since the use of the GF Resource Grammar library +makes it unnecessary to use any more constructs of GF than we +have already covered: parameters could be left to library implementors. +</P> +<A NAME="toc53"></A> +<H2>The problem: words have to be inflected</H2> +<P> +Suppose we want to say, with the vocabulary included in +<CODE>Food.gf</CODE>, things like +<center> +<I>these Italian wines are delicious</I> +</center> +The new grammatical facility we need are the plural forms +of nouns and verbs (<I>wines, are</I>), as opposed to their +singular forms. +</P> +<P> +The introduction of plural forms requires two things: +</P> +<UL> +<LI>the <B>inflection</B> of nouns and verbs in singular and plural +<LI>the <B>agreement</B> of the verb to subject: + the verb must have the same number as the subject +</UL> + +<P> +Different languages have different types of inflection and agreement. +For instance, Italian has also agreement in gender (masculine vs. feminine). +In a multilingual grammar, +we want to express such differences between languages in the +concrete syntax while ignoring them in the abstract syntax. +</P> +<P> +To be able to do all this, we need one new judgement form +and some new expression forms. +We also need to generalize linearization types +from strings to more complex types. +</P> +<P> +<B>Exercise</B>. Make a list of the possible forms that nouns, +adjectives, and verbs can have in some languages that you know. +</P> +<A NAME="toc54"></A> +<H2>Parameters and tables</H2> +<P> +We define the <B>parameter type</B> of number in English by +using a new form of judgement: +</P> +<PRE> + param Number = Sg | Pl ; +</PRE> +<P> +This judgement defines the parameter type <CODE>Number</CODE> by listing +its two <B>constructors</B>, <CODE>Sg</CODE> and <CODE>Pl</CODE> (common shorthands for +singular and plural). +</P> +<P> +To state that <CODE>Kind</CODE> expressions in English have a linearization +depending on number, we replace the linearization type <CODE>{s : Str}</CODE> +with a type where the <CODE>s</CODE> field is a <B>table</B> depending on number: +</P> +<PRE> + lincat Kind = {s : Number => Str} ; +</PRE> +<P> +The <B>table type</B> <CODE>Number => Str</CODE> is in many respects similar to +a function type (<CODE>Number -> Str</CODE>). The main difference is that the +argument type of a table type must always be a parameter type. This means +that the argument-value pairs can be listed in a finite table. The following +example shows such a table: +</P> +<PRE> + lin Cheese = { + s = table { + Sg => "cheese" ; + Pl => "cheeses" + } + } ; +</PRE> +<P> +The table consists of <B>branches</B>, where a <B>pattern</B> on the +left of the arrow <CODE>=></CODE> is assigned a <B>value</B> on the right. +</P> +<P> +The application of a table to a parameter is done by the <B>selection</B> +operator <CODE>!</CODE>, which is computed by <B>pattern matching</B>: it returns +the value from the first branch whose pattern matches the +selection argument. For instance, +</P> +<PRE> + table {Sg => "cheese" ; Pl => "cheeses"} ! Pl + ===> "cheeses" +</PRE> +<P> +As syntactic sugar for table selections, we can define the +<B>case expressions</B>, which are common in functional programming and also +handy to use in GF. +</P> +<PRE> + case e of {...} === table {...} ! e +</PRE> +<P></P> +<P> +A parameter type can have any number of constructors, and these can +also take arguments from other parameter types. For instance, an accurate +type system for English verbs (except <I>be</I>) is +</P> +<PRE> + param VerbForm = VPresent Number | VPast | VPastPart | VPresPart ; +</PRE> +<P> +This system expresses accurately the fact that only the present tense has +number variation. (Agreement also requires variation in person, but +this can be defined in syntax rules, by picking the singular form for third person +singular subjects and the plural forms for all others). As an example of +a table, here are the forms of the verb <I>drink</I>: +</P> +<PRE> + table { + VPresent Sg => "drinks" ; + VPresent Pl => "drink" ; + VPast => "drank" ; + VPastPart => "drunk" ; + VPresPart => "drinking" + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. In an earlier exercise (previous section), +you made a list of the possible +forms that nouns, adjectives, and verbs can have in some languages that +you know. Now take some of the results and implement them by +using parameter type definitions and tables. Write them into a <CODE>resource</CODE> +module, which you can test by using the command <CODE>compute_concrete</CODE>. +</P> +<A NAME="toc55"></A> +<H2>Inflection tables and paradigms</H2> +<P> +All English common nouns are inflected for number, most of them in the +same way: the plural form is obtained from the singular by adding the +ending <I>s</I>. This rule is an example of +a <B>paradigm</B> --- a formula telling how a class of words is inflected. +</P> +<P> +From the GF point of view, a paradigm is a function that takes +a <B>lemma</B> --- also known as a <B>dictionary form</B> or a <B>citation form</B> --- and +returns an inflection +table of desired type. Paradigms are not functions in the sense of the +<CODE>fun</CODE> judgements of abstract syntax (which operate on trees and not +on strings), but operations defined in <CODE>oper</CODE> judgements. +The following operation defines the regular noun paradigm of English: +</P> +<PRE> + oper regNoun : Str -> {s : Number => Str} = \dog -> { + s = table { + Sg => dog ; + Pl => dog + "s" + } + } ; +</PRE> +<P> +The <B>gluing</B> operator <CODE>+</CODE> tells that +the string held in the variable <CODE>dog</CODE> and the ending <CODE>"s"</CODE> +are written together to form one <B>token</B>. Thus, for instance, +</P> +<PRE> + (regNoun "cheese").s ! Pl ===> "cheese" + "s" ===> "cheeses" +</PRE> +<P> +A more complex example are regular verbs: +</P> +<PRE> + oper regVerb : Str -> {s : VerbForm => Str} = \talk -> { + s = table { + VPresent Sg => talk + "s" ; + VPresent Pl => talk ; + VPresPart => talk + "ing" ; + _ => talk + "ed" + } + } ; +</PRE> +<P> +Notice how a catch-all case for the past tense and the past participle +is expressed by using a <B>wild card</B> pattern <CODE>_</CODE>. Here again, pattern matching +tries all patterns in order until it finds a matching pattern; +and it is the wild card that is the first match for both <CODE>VPast</CODE> and +<CODE>VPastPart</CODE>. +</P> +<P> +<B>Exercise</B>. Identify cases in which the <CODE>regNoun</CODE> paradigm does not +apply in English, and implement some alternative paradigms. +</P> +<P> +<B>Exercise</B>. Implement some regular paradigms for other languages you have +considered in earlier exercises. +</P> +<A NAME="toc56"></A> +<H2>Using parameters in concrete syntax</H2> +<P> +We can now enrich the concrete syntax definitions to +comprise morphology. This will permit a more radical +variation between languages (e.g. English and Italian) +than just the use of different words. In general, +parameters and linearization types are different in +different languages --- but this does not prevent using a +the common abstract syntax. +</P> +<P> +We consider a grammar <CODE>Foods</CODE>, which is similar to +<CODE>Food</CODE>, with the addition two rules for forming plural items: +</P> +<PRE> + fun These, Those : Kind -> Item ; +</PRE> +<P> +We also add a noun which in Italian has the feminine case; all nouns in +<CODE>Food</CODE> were carefully chosen to be masculine! +</P> +<PRE> + fun Pizza : Kind ; +</PRE> +<P> +This noun will force us to deal with gender in the Italian grammar, +which is what we need for the grammar to scale up for larger applications. +</P> +<A NAME="toc57"></A> +<H3>Agreement</H3> +<P> +In the English <CODE>Foods</CODE> grammar, we need just one type of parameters: +<CODE>Number</CODE> as defined above. The phrase-forming rule +</P> +<PRE> + fun Is : Item -> Quality -> Phrase ; +</PRE> +<P> +is affected by the number because of <B>subject-verb agreement</B>. +In English, agreement says that the verb of a sentence +must be inflected in the number of the subject. Thus we will linearize +</P> +<PRE> + Is (This Pizza) Warm ===> "this pizza is warm" + Is (These Pizza) Warm ===> "these pizzas are warm" +</PRE> +<P> +Here it is the <B>copula</B>, i.e. the verb <I>be</I> that is affected. We define +the copula as the operation +</P> +<PRE> + oper copula : Number -> Str = \n -> + case n of { + Sg => "is" ; + Pl => "are" + } ; +</PRE> +<P> +We don't need to inflect the copula for person and tense in this grammar. +</P> +<P> +The form of the copula in a sentence depends on the +<B>subject</B> of the sentence, i.e. the item +that is qualified. This means that an <CODE>Item</CODE> must have such a number to provide. +The obvious way to guarantee this is by including a number field in +the linearization type: +</P> +<PRE> + lincat Item = {s : Str ; n : Number} ; +</PRE> +<P> +Now we can write precisely the <CODE>Is</CODE> rule that expresses agreement: +</P> +<PRE> + lin Is item qual = {s = item.s ++ copula item.n ++ qual.s} ; +</PRE> +<P> +The copula receives the number that it needs from the subject item. +</P> +<A NAME="toc58"></A> +<H3>Determiners</H3> +<P> +Let us turn to <CODE>Item</CODE> subjects and see how they receive their +numbers. The two rules +</P> +<PRE> + fun This, These : Kind -> Item ; +</PRE> +<P> +form <CODE>Item</CODE>s from <CODE>Kind</CODE>s by adding <B>determiners</B>, either +<I>this</I> or <I>these</I>. The determiners +require different numbers of their <CODE>Kind</CODE> arguments: <CODE>This</CODE> +requires the singular (<I>this pizza</I>) and <CODE>These</CODE> the plural +(<I>these pizzas</I>). The <CODE>Kind</CODE> is the same in both cases: <CODE>Pizza</CODE>. +Thus a <CODE>Kind</CODE> must have both singular and plural forms. +The obvious way to express this is by using a table: +</P> +<PRE> + lincat Kind = {s : Number => Str} ; +</PRE> +<P> +The linearization rules for <CODE>This</CODE> and <CODE>These</CODE> can now be written +</P> +<PRE> + lin This kind = { + s = "this" ++ kind.s ! Sg ; + n = Sg + } ; + + lin These kind = { + s = "these" ++ kind.s ! Pl ; + n = Pl + } ; +</PRE> +<P> +The grammatical relation between the determiner and the noun is similar to +agreement, but due to some differences into which we don't go here +it is often called <B>government</B>. +</P> +<P> +Since the same pattern for determination is used four times in +the <CODE>FoodsEng</CODE> grammar, we codify it as an operation, +</P> +<PRE> + oper det : + Str -> Number -> {s : Number => Str} -> {s : Str ; n : Number} = + \det,n,kind -> { + s = det ++ kind.s ! n ; + n = n + } ; +</PRE> +<P> +Now we can write, for instance, +</P> +<PRE> + lin This = det Sg "this" ; + lin These = det Pl "these" ; +</PRE> +<P> +Notice the order of arguments that permits partial +application (<a href="#secpartapp">here</a>). +</P> +<P> +In a more <B>lexicalized</B> grammar, determiners would be made into a +category of their own and given an inherent number: +</P> +<PRE> + lincat Det = {s : Str ; n : Number} ; + fun Det : Det -> Kind -> Item ; + lin Det det kind = { + s = det.s ++ kind.s ! det.n ; + n = det.n + } ; +</PRE> +<P> +Linguistically motivated grammars, such as the GF resource grammars, +usually favour lexicalized treatments of words; see <a href="#seclexical">here</a> below. +Notice that the fields of the record in <CODE>Det</CODE> are precisely the two +arguments needed in the <CODE>det</CODE> operation. +</P> +<A NAME="toc59"></A> +<H3>Parametric vs. inherent features</H3> +<P> +<CODE>Kind</CODE>s, as in general <B>common nouns</B> in English, have both singular +and plural forms; what form is chosen is determined by the construction +in which the noun is used. We say that the number is a +<B>parametric feature</B> of nouns. In GF, parametric features +appear as argument types of tables in linearization types. +</P> +<PRE> + lincat Kind = {s : Number => Str} ; +</PRE> +<P> +<CODE>Item</CODE>s, as in general <B>noun phrases</B> in English, don't +have variation in number. The number is instead an <B>inherent feature</B>, +which the noun phrase passes to the verb. In GF, inherent features +appear as record fields in linearization types. +</P> +<PRE> + lincat Item = {s : Str ; n : Number} ; +</PRE> +<P> +A category can have both parametric and inherent features. As we will see +in the Italian <CODE>Foods</CODE> grammar, nouns have parametric number and +inherent gender: +</P> +<PRE> + lincat Kind = {s : Number => Str ; g : Gender} ; +</PRE> +<P> +Nothing prevents the same parameter type from appearing both +as parametric and inherent feature, or the appearance of several inherent +features of the same type, etc. Determining the linearization types +of categories is one of the most crucial steps in the design of a GF +grammar. These two conditions must be in balance: +</P> +<UL> +<LI>existence: what forms are possible to build by morphological and + other means? +<LI>need: what features are expected via agreement or government? +</UL> + +<P> +Grammar books and dictionaries give good advice on existence; for instance, +an Italian dictionary has entries such as +<center> +<B>uomo</B>, pl. <I>uomini</I>, n.m. "man" +</center> +which tells that <I>uomo</I> is a masculine noun with the plural form <I>uomini</I>. +From this alone, or with a couple more examples, we can generalize to the type +for all nouns in Italian: they have both singular and plural forms and thus +a parametric number, and they have an inherent gender. +</P> +<P> +The distinction between parametric and inherent features can be stated in +object-oriented programming terms: a linearization type is like a <B>class</B>, +which has a <B>method</B> for linearization and also some <B>attributes</B>. +In this class, the parametric features appear as arguments to the +linearization method, whereas the inherent features appear as attributes. +</P> +<P> +For words, inherent features are usually given <I>ad hoc</I> as lexical information. +For combinations, they are typically <I>inherited</I> from some part of the construction. +For instance, qualified noun constructs in Italian inherit their gender from noun part +(called the <B>head</B> of the construction in linguistics): +</P> +<PRE> + lin QKind qual kind = + let gen = kind.g in { + s = table {n => kind.s ! n ++ qual.s ! gen ! n} ; + g = gen + } ; +</PRE> +<P> +This rule uses a <B>local definition</B> (also known as a <B>let expression</B>) to +avoid computing <CODE>kind.g</CODE> twice, and also to express the linguistic +generalization that it is the same gender that is both passed to +the adjective and inherited by the construct. +The parametric number feature is in this rule passed to both the noun and +the adjective. In the table, a <B>variable pattern</B> is used to match +any possible number. Variables introduced in patterns are in scope in +the right-hand sides of corresponding branches. Again, it is good to +use a variable to express the linguistic generalization that the number +is passed to the parts, rather than expand the table into <CODE>Sg</CODE> and <CODE>Pl</CODE> +branches. +</P> +<P> +Sometimes the puzzle of making agreement and government work in a grammar has +several solutions. For instance, <B>precedence</B> in programming languages can +be equivalently described by a parametric or an inherent feature +(see <a href="#secprecedence">here</a> below). +</P> +<P> +In natural language applications that use the resource grammar library, +all parameters are hidden from the user, who thereby does not need to bother +about them. The only thing that she has to think about is what linguistic +categories are given as linearization types to each semantic category. +</P> +<P> +For instance, the GF resource grammar library has a category <CODE>NP</CODE> of +noun phrases, <CODE>AP</CODE> of adjectival phrases, and <CODE>Cl</CODE> of sentence-like clauses. +In the implementation of <CODE>Foods</CODE> <a href="#secenglish">here</a>, we will define +</P> +<PRE> + lincat Phrase = Cl ; Item = NP ; Quality = AP ; +</PRE> +<P> +To express that an item has a quality, we will use a resource function +</P> +<PRE> + mkCl : NP -> AP -> Cl ; +</PRE> +<P> +in the linearization rule: +</P> +<PRE> + lin Is = mkCl ; +</PRE> +<P> +In this way, we have no need to think about parameters and agreement. +<a href="#chapfive">the fifth chapter</a> will show a complete implementation of <CODE>Foods</CODE> by the +resource grammar, port it to many new languages, and extend it with +many new constructs. +</P> +<A NAME="toc60"></A> +<H2>An English concrete syntax for Foods with parameters</H2> +<P> +We repeat some of the rules above by showing the entire +module <CODE>FoodsEng</CODE>, equipped with parameters. The parameters and +operations are, for the sake of brevity, included in the same module +and not in a separate <CODE>resource</CODE>. However, some string operations +from the library <CODE>Prelude</CODE> are used. +</P> +<PRE> + --# -path=.:prelude + + concrete FoodsEng of Foods = open Prelude in { + + lincat + S, Quality = SS ; + Kind = {s : Number => Str} ; + Item = {s : Str ; n : Number} ; + + lin + Is item quality = ss (item.s ++ copula item.n ++ quality.s) ; + This = det Sg "this" ; + That = det Sg "that" ; + These = det Pl "these" ; + Those = det Pl "those" ; + QKind quality kind = {s = table {n => quality.s ++ kind.s ! n}} ; + Wine = regNoun "wine" ; + Cheese = regNoun "cheese" ; + Fish = noun "fish" "fish" ; + Pizza = regNoun "pizza" ; + Very = prefixSS "very" ; + Fresh = ss "fresh" ; + Warm = ss "warm" ; + Italian = ss "Italian" ; + Expensive = ss "expensive" ; + Delicious = ss "delicious" ; + Boring = ss "boring" ; + + param + Number = Sg | Pl ; + + oper + det : Number -> Str -> {s : Number => Str} -> {s : Str ; n : Number} = + \n,d,cn -> { + s = d ++ cn.s ! n ; + n = n + } ; + noun : Str -> Str -> {s : Number => Str} = + \man,men -> {s = table { + Sg => man ; + Pl => men + } + } ; + regNoun : Str -> {s : Number => Str} = + \car -> noun car (car + "s") ; + copula : Number -> Str = + \n -> case n of { + Sg => "is" ; + Pl => "are" + } ; + } +</PRE> +<P> +To find the Prelude library --- or in general, +GF files located in other directories, a <B>path directive</B> is needed +either on the command line or as the first line of +the topmost file compiled. +The paths in the path list are separated by colons (<CODE>:</CODE>), and every item +is interpreted primarily relative to the current directory and, secondarily, +to the value of <CODE>GF_LIB_PATH</CODE> (<B>GF library path</B>). Hence it is a +good idea to make <CODE>GF_LIB_PATH</CODE> to point into your <CODE>GF/lib/</CODE> whenever +you start working in GF. For instance, in the Bash shell this is done by +</P> +<PRE> + % export GF_LIB_PATH=<the location of GF/lib in your file system> +</PRE> +<P></P> +<A NAME="toc61"></A> +<H2>More on inflection paradigms</H2> +<P> +<a name="secinflection"></a> +</P> +<P> +Let us try to extend the English noun paradigms so that we can +deal with all nouns, not just the regular ones. The goal is to +provide a morphology module that is maximally easy to use when +words are added to the lexicon. In fact, we can think of a +division of labour where a linguistically trained grammarian +writes a morphology and hands it over to the lexicon writer +who knows much less about the rules of inflection. +</P> +<P> +In passing, we will introduce some new GF constructs: local definitions, +regular expression patterns, and operation overloading. +</P> +<A NAME="toc62"></A> +<H3>Worst-case functions</H3> +<P> +To start with, it is useful to perform <B>data abstraction</B> from the type +of nouns by writing a constructor operation, a <B>worst-case function</B>: +</P> +<PRE> + oper mkNoun : Str -> Str -> Noun = \x,y -> { + s = table { + Sg => x ; + Pl => y + } + } ; +</PRE> +<P> +This presupposes that we have defined +</P> +<PRE> + oper Noun : Type = {s : Number => Str} ; +</PRE> +<P> +Using <CODE>mkNoun</CODE>, we can define +</P> +<PRE> + lin Mouse = mkNoun "mouse" "mice" ; +</PRE> +<P> +and +</P> +<PRE> + oper regNoun : Str -> Noun = \x -> mkNoun x (x + "s") ; +</PRE> +<P> +instead of writing the inflection tables explicitly. +</P> +<P> +Nouns like <I>mouse</I>-<I>mice</I>, are so irregular that +it hardly makes sense to see them as instances of a +paradigm that forms the plural from the singular form. +But in general, as we will see, there can be different +regular patterns in a language. +</P> +<P> +The grammar engineering advantage of worst-case functions is that +the author of the resource module may change the definitions of +<CODE>Noun</CODE> and <CODE>mkNoun</CODE>, and still retain the +interface (i.e. the system of type signatures) that makes it +correct to use these functions in concrete modules. In programming +terms, <CODE>Noun</CODE> is then treated as an <B>abstract datatype</B>: +its definition is not available, but only an indirect way of constructing +its objects. +</P> +<P> +A case where a change of the <CODE>Noun</CODE> type could +actually happen is if we introduces <B>case</B> (nominative or +genitive) in the noun inflection: +</P> +<PRE> + param Case = Nom | Gen ; + + oper Noun : Type = {s : Number => Case => Str} ; +</PRE> +<P> +Now we have to redefine the worst-case function +</P> +<PRE> + oper mkNoun : Str -> Str -> Noun = \x,y -> { + s = table { + Sg => table { + Nom => x ; + Gen => x + "'s" + } ; + Pl => table { + Nom => y ; + Gen => y + case last y of { + "s" => "'" ; + _ => "'s" + } + } + } ; +</PRE> +<P> +But up from this level, we can retain the old definitions +</P> +<PRE> + lin Mouse = mkNoun "mouse" "mice" ; + oper regNoun : Str -> Noun = \x -> mkNoun x (x + "s") ; +</PRE> +<P> +which will just compute to different values now. +</P> +<P> +In the last definition of <CODE>mkNoun</CODE>, we used a case expression +on the last character of the plural form to decide if the genitive +should be formed with an <CODE>'</CODE> (as in <I>dogs</I>-<I>dogs'</I>) or with +<CODE>'s</CODE> (as in <I>mice</I>-<I>mice's</I>). The expression <CODE>last y</CODE> +uses the <CODE>Prelude</CODE> operation +</P> +<PRE> + last : Str -> Str ; +</PRE> +<P> +The case expression uses <B>pattern matching over strings</B>, which +is supported in GF, alongside with pattern matching over +parameters. +</P> +<A NAME="toc63"></A> +<H3>Intelligent paradigms</H3> +<P> +Between the completely regular <I>dog</I>-<I>dogs</I> and the completely +irregular <I>mouse</I>-<I>mice</I>, there are some +predictable variations: +</P> +<UL> +<LI>nouns ending with an <I>y</I>: <I>fly</I>-<I>flies</I>, except if + a vowel precedes the <I>y</I>: <I>boy</I>-<I>boys</I> +<LI>nouns ending with <I>s</I>, <I>ch</I>, and a number of + other endings: <I>bus</I>-<I>buses</I>, <I>leech</I>-<I>leeches</I> +</UL> + +<P> +One way to deal with them would be to provide alternative paradigms: +</P> +<PRE> + noun_y : Str -> Noun = \fly -> mkNoun fly (init fly + "ies") ; + noun_s : Str -> Noun = \bus -> mkNoun bus (bus + "es") ; +</PRE> +<P> +The Prelude function <CODE>init</CODE> drops the last character of a token. +But this solution has some drawbacks: +</P> +<UL> +<LI>it can be difficult to select the correct paradigm +<LI>it can be difficult to remember the names of the different paradigms +</UL> + +<P> +To help the lexicon builder in this task, the morphology programmer +can put some intelligence in the regular noun paradigm. The easiest +way to express this in GF is by the use of <B>regular expression patterns</B>: +</P> +<PRE> + regNoun : Str -> Noun = \w -> + let + ws : Str = case w of { + _ + ("a" | "e" | "i" | "o") + "o" => w + "s" ; -- bamboo + _ + ("s" | "x" | "sh" | "o") => w + "es" ; -- bus, hero + _ + "z" => w + "zes" ;-- quiz + _ + ("a" | "e" | "o" | "u") + "y" => w + "s" ; -- boy + x + "y" => x + "ies" ;-- fly + _ => w + "s" -- car + } + in + mkNoun w ws +</PRE> +<P> +In this definition, we have used a local definition just in order to +structure the code, even though there is no multiple evaluation to eliminate. +In the case expression itself, we have used +</P> +<UL> +<LI><B>disjunctive patterns</B> <I>P</I> <CODE>|</CODE> <I>Q</I> +<LI><B>concatenation patterns</B> <I>P</I> <CODE>+</CODE> <I>Q</I> +</UL> + +<P> +The patterns are ordered in such a way that, for instance, +the suffix <CODE>"oo"</CODE> prevents <I>bamboo</I> from matching the suffix +<CODE>"o"</CODE>. +</P> +<P> +<B>Exercise</B>. The same rules that form plural nouns in English also +apply in the formation of third-person singular verbs. +Write a regular verb paradigm that uses this idea, but first +rewrite <CODE>regNoun</CODE> so that the analysis needed to build <I>s</I>-forms +is factored out as a separate <CODE>oper</CODE>, which is shared with +<CODE>regVerb</CODE>. +</P> +<P> +<B>Exercise</B>. Extend the verb paradigms to cover all verb forms +in English, with special care taken of variations with the suffix +<I>ed</I> (e.g. <I>try</I>-<I>tried</I>, <I>use</I>-<I>used</I>). +</P> +<P> +<B>Exercise</B>. Implement the German <B>Umlaut</B> operation on word stems. +The operation changes the vowel of the stressed stem syllable as follows: +<I>a</I> to <I>ä</I>, <I>au</I> to <I>äu</I>, <I>o</I> to <I>ö</I>, and <I>u</I> to <I>ü</I>. You +can assume that the operation only takes syllables as arguments. Test the +operation to see whether it correctly changes <I>Arzt</I> to <I>Ärzt</I>, +<I>Baum</I> to <I>Bäum</I>, <I>Topf</I> to <I>Töpf</I>, and <I>Kuh</I> to <I>Küh</I>. +</P> +<A NAME="toc64"></A> +<H3>Function types with variables</H3> +<P> +In <a href="#chapsix">the sixth chapter</a>, we will introduce <B>dependent function types</B>, where +the value type depends on the argument. For this end, we need a notation +that binds a variable to the argument type, as in +</P> +<PRE> + switchOff : (k : Kind) -> Action k +</PRE> +<P> +Function types <I>without</I> +variables are actually a shorthand notation: writing +</P> +<PRE> + PredVP : NP -> VP -> S +</PRE> +<P> +is shorthand for +</P> +<PRE> + PredVP : (x : NP) -> (y : VP) -> S +</PRE> +<P> +or any other naming of the variables. Actually the use of variables +sometimes shortens the code, since they can share a type: +</P> +<PRE> + octuple : (x,y,z,u,v,w,s,t : Str) -> Str +</PRE> +<P> +If a bound variable is not used, it can here, as elsewhere in GF, be replaced by +a wildcard: +</P> +<PRE> + octuple : (_,_,_,_,_,_,_,_ : Str) -> Str +</PRE> +<P> +A good practice for functions with many arguments of the same type +is to indicate the number of arguments: +</P> +<PRE> + octuple : (x1,_,_,_,_,_,_,x8 : Str) -> Str +</PRE> +<P> +One can also use heuristic variable names to document what +information each argument is expected to provide. +This is very handy in the types of inflection paradigms: +</P> +<PRE> + mkNoun : (mouse,mice : Str) -> Noun +</PRE> +<P></P> +<A NAME="toc65"></A> +<H3>Separating operation types and definitions</H3> +<P> +In grammars intended as libraries, it is useful to separate oparation +definitions from their type signatures. The user is only interested +in the type, whereas the definition is kept for the implementor and +the maintainer. This is possible by using separate <CODE>oper</CODE> fragments +for the two parts: +</P> +<PRE> + oper regNoun : Str -> Noun ; + oper regNoun s = mkNoun s (s + "s") ; +</PRE> +<P> +The type checker combines the two into one <CODE>oper</CODE> judgement to see +if the definition matches the type. Notice that, in this syntax, it +is moreover possible to bind the argument variables on the left hand side +instead of using lambda abstration. +</P> +<P> +In the library module, the type signatures are typically placed in +the beginning and the definitions in the end. A more radical separation +can be achieved by using the <CODE>interface</CODE> and <CODE>instance</CODE> module types +(see <a href="#secinterface">here</a>): the type signatures are placed in the interface +and the definitions in the instance. +</P> +<A NAME="toc66"></A> +<H3>Overloading of operations</H3> +<P> +Large libraries, such as the GF Resource Grammar Library, may define +hundreds of names. This can be unpractical +for both the library author and the user: the author has to invent longer +and longer names which are not always intuitive, +and the author has to learn or at least be able to find all these names. +A solution to this problem, adopted by languages such as C++, +is <B>overloading</B>: one and the same name can be used for several functions. +When such a name is used, the +compiler performs <B>overload resolution</B> to find out which of +the possible functions is meant. Overload resolution is based on +the types of the functions: all functions that +have the same name must have different types. +</P> +<P> +In C++, functions with the same name can be scattered everywhere in the program. +In GF, they must be grouped together in <CODE>overload</CODE> groups. Here is an example +of an overload group, giving the different ways to define nouns in English: +</P> +<PRE> + oper mkN : overload { + mkN : (dog : Str) -> Noun ; -- regular nouns + mkN : (mouse,mice : Str) -> Noun ; -- irregular nouns + } +</PRE> +<P> +Intuitively, the function comes very close to the way in which +regular and irregular words are given in most dictionaries. If the +word is regular, just one form is needed. If it is irregular, +more forms are given. There is no need to use explicit paradigm +names. +</P> +<P> +The <CODE>mkN</CODE> example gives only the possible types of the overloaded +operation. Their definitions can be given separately, possibly in another module. +Here is a definition of the above overload group: +</P> +<PRE> + oper mkN = overload { + mkN : (dog : Str) -> Noun = regNoun ; + mkN : (mouse,mice : Str) -> Noun = mkNoun ; + } +</PRE> +<P> +Notice that the types of the branches must be repeated so that they can be +associated with proper definitions; the order of the branches has no +significance. +</P> +<P> +<B>Exercise</B>. Design a system of English verb paradigms presented by +an overload group. +</P> +<A NAME="toc67"></A> +<H3>Morphological analysis and morphology quiz</H3> +<P> +Even though morphology is in GF +mostly used as an auxiliary for syntax, it +can also be useful on its own right. The command <CODE>morpho_analyse = ma</CODE> +can be used to read a text and return for each word the analyses that +it has in the current concrete syntax. +</P> +<PRE> + > read_file bible.txt | morpho_analyse +</PRE> +<P> +In the same way as translation exercises, morphological exercises can +be generated, by the command <CODE>morpho_quiz = mq</CODE>. Usually, +the category is then set to some lexical category. For instance, +French irregular verbs in the resource grammar library can be trained as +follows: +</P> +<PRE> + % gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/IrregFre.gfc + + > morpho_quiz -cat=V + + Welcome to GF Morphology Quiz. + ... + + réapparaître : VFin VCondit Pl P2 + réapparaitriez + > No, not réapparaitriez, but + réapparaîtriez + Score 0/1 +</PRE> +<P> +Just like translation exercises, a list of morphological exercises can be generated +off-line and saved in a +file for later use, by the command <CODE>morpho_list = ml</CODE> +</P> +<PRE> + > morpho_list -number=25 -cat=V | write_file exx.txt +</PRE> +<P> +The <CODE>number</CODE> flag gives the number of exercises generated. +</P> +<A NAME="toc68"></A> +<H2>The Italian Foods grammar</H2> +<P> +<a name="secitalian"></a> +</P> +<P> +We conclude the parametrization of the Food grammar by presenting an +Italian variant, now complete with parameters, inflection, and +agreement. +</P> +<P> +The header part is similar to English: +</P> +<PRE> + --# -path=.:prelude + + concrete FoodsIta of Foods = open Prelude in { +</PRE> +<P> +Parameters include not only number but also gender. +</P> +<PRE> + param + Number = Sg | Pl ; + Gender = Masc | Fem ; +</PRE> +<P> +Qualities are inflected for gender and number, whereas kinds +have a parametric number (as in English) and an inherent gender. +Items have an inherent number (as in English) but also gender. +</P> +<PRE> + lincat + Phr = SS ; + Quality = {s : Gender => Number => Str} ; + Kind = {s : Number => Str ; g : Gender} ; + Item = {s : Str ; g : Gender ; n : Number} ; +</PRE> +<P> +A Quality is expressed by an adjective, which in Italian has one form for each +gender-number combination. +</P> +<PRE> + oper + adjective : (_,_,_,_ : Str) -> {s : Gender => Number => Str} = + \nero,nera,neri,nere -> { + s = table { + Masc => table { + Sg => nero ; + Pl => neri + } ; + Fem => table { + Sg => nera ; + Pl => nere + } + } + } ; +</PRE> +<P> +The very common case of regular adjectives works by adding +endings to the stem. +</P> +<PRE> + regAdj : Str -> {s : Gender => Number => Str} = \nero -> + let ner = init nero + in adjective nero (ner + "a") (ner + "i") (ner + "e") ; +</PRE> +<P></P> +<P> +For noun inflection, there are several paradigms; since only two forms +are ever needed, we will just give them explicitly (the resource grammar +library also has a paradigm that takes the singular form and infers the +plural and the gender from it). +</P> +<PRE> + noun : Str -> Str -> Gender -> {s : Number => Str ; g : Gender} = + \vino,vini,g -> { + s = table { + Sg => vino ; + Pl => vini + } ; + g = g + } ; +</PRE> +<P> +As in <CODE>FoodEng</CODE>, we need only number variation for the copula. +</P> +<PRE> + copula : Number -> Str = + \n -> case n of { + Sg => "è" ; + Pl => "sono" + } ; +</PRE> +<P> +Determination is more complex than in English, because of gender: +it uses separate determiner forms for the two genders, and selects +one of them as function of the noun determined. +</P> +<PRE> + det : Number -> Str -> Str -> {s : Number => Str ; g : Gender} -> + {s : Str ; g : Gender ; n : Number} = + \n,m,f,cn -> { + s = case cn.g of {Masc => m ; Fem => f} ++ cn.s ! n ; + g = cn.g ; + n = n + } ; +</PRE> +<P> +Here is, finally, the complete set of linearization rules. +</P> +<PRE> + lin + Is item quality = + ss (item.s ++ copula item.n ++ quality.s ! item.g ! item.n) ; + This = det Sg "questo" "questa" ; + That = det Sg "quello" "quella" ; + These = det Pl "questi" "queste" ; + Those = det Pl "quelli" "quelle" ; + QKind quality kind = { + s = \\n => kind.s ! n ++ quality.s ! kind.g ! n ; + g = kind.g + } ; + Wine = noun "vino" "vini" Masc ; + Cheese = noun "formaggio" "formaggi" Masc ; + Fish = noun "pesce" "pesci" Masc ; + Pizza = noun "pizza" "pizze" Fem ; + Very qual = {s = \\g,n => "molto" ++ qual.s ! g ! n} ; + Fresh = adjective "fresco" "fresca" "freschi" "fresche" ; + Warm = regAdj "caldo" ; + Italian = regAdj "italiano" ; + Expensive = regAdj "caro" ; + Delicious = regAdj "delizioso" ; + Boring = regAdj "noioso" ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Experiment with multilingual generation and translation in the +<CODE>Foods</CODE> grammars. +</P> +<P> +<B>Exercise</B>. Add items, qualities, and determiners to the grammar, and try to get +their inflection and inherent features right. +</P> +<P> +<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for a language of your choice, +now aiming for complete grammatical correctness by the use of parameters. +</P> +<P> +<B>Exercise</B>. Measure the size of the context-free grammar corresponding to +<CODE>FoodsIta</CODE>. You can do this by printing the grammar in the context-free format +(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines. +</P> +<A NAME="toc69"></A> +<H2>Discontinuous constituents</H2> +<P> +A linearization type may contain more strings than one. +An example of where this is useful are English particle +verbs, such as <I>switch off</I>. The linearization of +a sentence may place the object between the verb and the particle: +<I>he switched it off</I>. +</P> +<P> +The following judgement defines transitive verbs as +<B>discontinuous constituents</B>, i.e. as having a linearization +type with two strings and not just one. +</P> +<PRE> + lincat TV = {s : Number => Str ; part : Str} ; +</PRE> +<P> +In the abstract syntax, we can now have a rule that combines a subject and an +object item with a transitive verb to form a sentence: +</P> +<PRE> + fun AppTV : Item -> TV -> Item -> Phrase ; +</PRE> +<P> +The linearization rule places the object between the two parts of the verb: +</P> +<PRE> + lin AppTV subj tv obj = + {s = subj.s ++ tv.s ! subj.n ++ obj.s ++ tv.part} ; +</PRE> +<P> +There is no restriction in the number of discontinuous constituents +(or other fields) a <CODE>lincat</CODE> may contain. The only condition is that +the fields must be built from records, tables, +parameters, and <CODE>Str</CODE>, but not functions. +</P> +<P> +Notice that the parsing and linearization commands only give accurate +results for categories whose linearization type has a unique <CODE>Str</CODE> +valued field labelled <CODE>s</CODE>. Therefore, discontinuous constituents +are not a good idea in top-level categories accessed by the users +of a grammar application. +</P> +<P> +<B>Exercise</B>. Define the language <CODE>a^n b^n c^n</CODE> in GF, i.e. +any number of <I>a</I>'s followed by the same number of <I>b</I>'s and +the same number of <I>c</I>'s. This language is not context-free, +but can be defined in GF by using discontinuous constituents. +</P> +<A NAME="toc70"></A> +<H2>Strings at compile time vs. run time</H2> +<P> +A common difficulty in GF are the conditions under which tokens +can be created. Tokens are created in the following ways: +</P> +<UL> +<LI>quoted string: <CODE>"foo"</CODE> +<LI>gluing : <CODE>t + s</CODE> +<LI>predefined operations <CODE>init, tail, tk, dp</CODE> +<LI>pattern matching over strings +</UL> + +<P> +The general principle is that +<I>tokens must be known at compile time</I>. This means that the above operations +may not have <B>run-time variables</B> in their arguments. Run-time variables, in +turn, are the variables that stand for function arguments in linearization rules. +</P> +<P> +Hence it is not legal to write +</P> +<PRE> + cat Noun ; + fun Plural : Noun -> Noun ; + lin Plural n = {s = n.s + "s"} ; +</PRE> +<P> +because <CODE>n</CODE> is a run-time variable. Also +</P> +<PRE> + lin Plural n = {s = (regNoun n).s ! Pl} ; +</PRE> +<P> +is incorrect with <CODE>regNoun</CODE> as defined <a href="#secinflection">here</a>, because the run-time +variable is eventually sent to string pattern matching and gluing. +</P> +<P> +Writing tokens together without a space is an often-wanted behaviour, for instance, +with punctuation marks. Thus one might try to write +</P> +<PRE> + lin Question p = {s = p + "?"} ; +</PRE> +<P> +which is incorrect. The way to go is to use an <B>unlexer</B> that creates correct spacing +after linearization. Correspondingly, a <B>lexer</B> that e.g. analyses <CODE>"warm?"</CODE> into +to tokens is needed before parsing. This can be done by using flags: +</P> +<PRE> + flags lexer=text ; unlexer=text ; +</PRE> +<P> +works in the desired way for English text. More on lexers and unlexers will be +told <a href="#seclexing">here</a>. +</P> +<A NAME="toc71"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc72"></A> +<H3>Parameter and table types</H3> +<P> +A judgement of the form +<center> + <CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <I>X1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I> <I>Xn</I> +</center> +defines a <B>parameter type</B> <I>P</I> with <B>constructors</B> <I>C1</I> ... <I>Cn</I>. +Each constructor has a <B>context</B> <I>X</I>, which is a (possibly empty) +sequence of parameter types. A <B>parameter value</B> is an application +of a constructor to a sequence of parameter values from each type in +its context. +</P> +<P> +In addition to types defined in <CODE>param</CODE> judgements, also +records of parameter types are parameter types. Their values are records +of corresponding field values. +</P> +<P> +Moreover, the type <CODE>Ints</CODE> <I>n</I> is a parameter type for any positive +integer <I>n</I>, and its values are <CODE>0</CODE>, ..., <I>n-1</I>. +</P> +<P> +A <B>table type</B> <I>P</I> <CODE>=></CODE> <I>T</I> must have a parameter type <I>P</I> as +its argument type. The normal form of an object of this type is a <B>table</B> +<center> + <CODE>table {</CODE> <I>V1</I> <CODE>=></CODE> <I>t1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>Vm</I> <CODE>=></CODE> <I>tm</I> <CODE>}</CODE> +</center> +which has a <B>branch</B> for every parameter value <I>Vi</I> of type <I>P</I>. +A table can be given in many other ways by using pattern matching. +</P> +<P> +Tables with only one branch are a common special case. +GF provides syntactic sugar for writing one-branch tables concisely: +</P> +<PRE> + \\P,...,Q => t === table {P => ... table {Q => t} ...} +</PRE> +<P></P> +<A NAME="toc73"></A> +<H3>Pattern matching</H3> +<P> +<a name="secmatching"></a> +</P> +<P> +We will list all forms of patterns that can be used in table branches. +the following are available for any parameter types, as well +as for the types <CODE>Int</CODE> and <CODE>Str</CODE> +</P> +<UL> +<LI>a constructor pattern <I>C P1 ... Pn</I> matches any value <I>C V1 ... Vn</I> where + each <I>Vi</I> matches <I>Pi</I>, + and binds the union of all variables bound in the subpatterns <I>Pi</I> +<LI>a record pattern + <CODE>{</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>}</CODE> + matches any record that has values of the corresponding fields. + and binds the union of all variables bound in the subpatterns <I>Pi</I> +<LI>a variable pattern <I>x</I> + (identifier other than constant parameter) matches any value, and + binds <I>x</I> to this value +<LI>the wild card <CODE>_</CODE> matches any value +<LI>a disjunctive pattern <I>P</I> <CODE>|</CODE> <I>Q</I> matches anything that + either <I>P</I> or <I>Q</I> matches; bindings must be the same in both +<LI>a negative pattern <CODE>-</CODE><I>P</I> matches anything that <I>P</I> does not match; + no bindings are returned +<LI>an alias pattern <I>x</I> <CODE>@</CODE> <I>P</I> matches whatever value <I>P</I> matches and + binds <I>x</I> to this value; also the bindings in <I>P</I> are returned +</UL> + +<P> +The following patterns are only available for the type <CODE>Str</CODE>: +</P> +<UL> +<LI>a string literal pattern, e.g. <CODE>"s"</CODE>, matches the same string +<LI>a concatenation pattern <I>P</I> <CODE>+</CODE> <I>Q</I> matches any string that consists + of a prefix matching <I>P</I> and a suffix matching <I>Q</I>; + the union of bindings is returned +<LI>a repetition pattern <I>P</I><CODE>*</CODE> matches any string that can be decomposed + into strings that match <I>P</I>; no bindings are returned +</UL> + +<P> +The following pattern is only available for the types <CODE>Int</CODE> and <CODE>Ints</CODE> <I>n</I>: +</P> +<UL> +<LI>an integer literal pattern, e.g. <CODE>214</CODE>, matches the same integer +</UL> + +<P> +Pattern matching is performed in the order in which the branches +appear in the table: the branch of the first matching pattern is followed. +The type checker reject sets of patterns that are not exhaustive, and +warns for completely overshadowed patterns. +To guarantee exhaustivity when the infinite types <CODE>Int</CODE> and <CODE>Str</CODE> are +used as argument types, the last pattern must be a "catch-all" variable +or wild card. +</P> +<P> +It follows from the definition of record pattern matching +that it can utilize partial records: the branch +</P> +<PRE> + {g = Fem} => t +</PRE> +<P> +in a table of type <CODE>{g : Gender ; n : Number} => T</CODE> means the same as +</P> +<PRE> + {g = Fem ; n = _} => t +</PRE> +<P> +Variables in regular expression patterns +are always bound to the <B>first match</B>, which is the first +in the sequence of binding lists. For example: +</P> +<UL> +<LI><CODE>x + "e" + y</CODE> matches <CODE>"peter"</CODE> with <CODE>x = "p", y = "ter"</CODE> +<LI><CODE>x + "er"*</CODE> matches <CODE>"burgerer"</CODE> with ``x = "burg" +</UL> + +<A NAME="toc74"></A> +<H3>Overloading</H3> +<P> +Judgements of the <CODE>oper</CODE> form can introduce overloaded functions. +The syntax is record-like, but all fields must have the same +name and different types. +</P> +<PRE> + oper mkN = overload { + mkN : (dog : Str) -> Noun = regNoun ; + mkN : (mouse,mice : Str) -> Noun = mkNoun ; + } +</PRE> +<P> +To give just the type of an overloaded operation, the record type +syntax is used. +</P> +<PRE> + oper mkN : overload { + mkN : (dog : Str) -> Noun ; -- regular nouns + mkN : (mouse,mice : Str) -> Noun ; -- irregular nouns + } +</PRE> +<P> +Overloading is not possible in other forms of judgement. +</P> +<A NAME="toc75"></A> +<H3>Local definitions</H3> +<P> +Local definitions ("<CODE>let</CODE> expressions") can appear in groups: +</P> +<PRE> + oper regNoun : Str -> Noun = \vino -> + let + vin : Str = init vino ; + o = last vino + in + ... +</PRE> +<P> +The type can be omitted if it can be inferred. Later definitions may +refer to earlier ones. +</P> +<A NAME="toc76"></A> +<H3>Supplementary constructs</H3> +<P> +The rest of the GF language constructs are presented for the sake +of completeness. They will not be used in the rest of this tutorial. +</P> +<H4>Record extension and subtyping</H4> +<P> +Record types and records can be <B>extended</B> with new fields. For instance, +in German it is natural to see transitive verbs as verbs with a case, which +is usually accusative or dative, and is passed to the object of the verb. +The symbol <CODE>**</CODE> is used for both record types and record objects. +</P> +<PRE> + lincat TV = Verb ** {c : Case} ; + + lin Follow = regVerb "folgen" ** {c = Dative} ; +</PRE> +<P> +To extend a record type or a record with a field whose label it +already has is a type error. It is also an error to extend a type or +object that is not a record. +</P> +<P> +A record type <I>T</I> is a <B>subtype</B> of another one <I>R</I>, if <I>T</I> has +all the fields of <I>R</I> and possibly other fields. For instance, +an extension of a record type is always a subtype of it. +If <I>T</I> is a subtype of <I>R</I>, then <I>R</I> is a <B>supertype</B> of <I>T</I>. +</P> +<P> +If <I>T</I> is a subtype of <I>R</I>, an object of <I>T</I> can be used whenever +an object of <I>R</I> is required. +For instance, a transitive verb can be used whenever a verb is required. +</P> +<P> +<B>Covariance</B> means that a function returning a record <I>T</I> as value can +also be used to return a value of a supertype <I>R</I>. +<B>Contravariance</B> means that a function taking an <I>R</I> as argument +can also be applied to any object of a subtype <I>T</I>. +</P> +<H4>Tuples and product types</H4> +<P> +Product types and tuples are syntactic sugar for record types and records: +</P> +<PRE> + T1 * ... * Tn === {p1 : T1 ; ... ; pn : Tn} + <t1, ..., tn> === {p1 = T1 ; ... ; pn = Tn} +</PRE> +<P> +Thus the labels <CODE>p1, p2,...</CODE> are hard-coded. +As patterns, tuples are translated to record patterns in the +same way as tuples to records; partial patterns make it +possible to write, slightly surprisingly, +</P> +<PRE> + case <g,n,p> of { + <Fem> => t + ... + } +</PRE> +<P></P> +<H4>Prefix-dependent choices</H4> +<P> +Sometimes a token has different forms depending on the token +that follows. An example is the English indefinite article, +which is <I>an</I> if a vowel follows, <I>a</I> otherwise. +Which form is chosen can only be decided at run time, i.e. +when a string is actually build. GF has a special construct for +such tokens, the <CODE>pre</CODE> construct exemplified in +</P> +<PRE> + oper artIndef : Str = + pre {"a" ; "an" / strs {"a" ; "e" ; "i" ; "o"}} ; +</PRE> +<P> +Thus +</P> +<PRE> + artIndef ++ "cheese" ---> "a" ++ "cheese" + artIndef ++ "apple" ---> "an" ++ "apple" +</PRE> +<P> +This very example does not work in all situations: the prefix +<I>u</I> has no general rules, and some problematic words are +<I>euphemism, one-eyed, n-gram</I>. Since the branches are matched in +order, it is possible to write +</P> +<PRE> + oper artIndef : Str = + pre {"a" ; + "a" / strs {"eu" ; "one"} ; + "an" / strs {"a" ; "e" ; "i" ; "o" ; "n-"} + } ; +</PRE> +<P> +Somewhat illogically, the default value is given as the first element in the list. +</P> +<P> +<I>Prefix-dependent choice may be deprecated in GF version 3.</I> +</P> +<A NAME="toc77"></A> +<H1>Using the resource grammar library</H1> +<P> +<a name="chapfive"></a> +</P> +<P> +In this chapter, we will take a look at the GF resource grammar library. +We will use the library to implement the <CODE>Foods</CODE> grammar of the +previous chapter +and port it to some new languages. Some new concepts of GF's module system +are also introduced, most notably the technique of <B>parametrized modules</B>, +which has become an important "design pattern" for multilingual grammars. +</P> +<A NAME="toc78"></A> +<H2>The coverage of the library</H2> +<P> +The GF Resource Grammar Library contains grammar rules for +10 languages. In addition, 2 languages are available as yet incomplete +implementations, and a few more are under construction. The purpose +of the library is to define the low-level morphological and syntactic +rules of languages, and thereby enable application programmers +to concentrate on the semantic and stylistic +aspects of their grammars. The guiding principle is that +<center> +grammar checking becomes type checking +</center> +that is, whatever is type-correct in the resource grammar is also +grammatically correct. +</P> +<P> +The intended level of application grammarians +is that of a skilled programmer with +a practical knowledge of the target languages, but without +theoretical knowledge about their grammars. +Such a combination of +skills is typical of programmers who, for instance, want to localize +language software to new languages. +</P> +<P> +The current resource languages are +</P> +<UL> +<LI><CODE>Ara</CODE>bic (incomplete) +<LI><CODE>Cat</CODE>alan (incomplete) +<LI><CODE>Dan</CODE>ish +<LI><CODE>Eng</CODE>lish +<LI><CODE>Fin</CODE>nish +<LI><CODE>Fre</CODE>nch +<LI><CODE>Ger</CODE>man +<LI><CODE>Ita</CODE>lian +<LI><CODE>Nor</CODE>wegian +<LI><CODE>Rus</CODE>sian +<LI><CODE>Spa</CODE>nish +<LI><CODE>Swe</CODE>dish +</UL> + +<P> +The first three letters (<CODE>Eng</CODE> etc) are used in grammar module names. +We use the three-letter codes for languages from the ISO 639 standard. +</P> +<P> +The incomplete Arabic and Catalan implementations are +sufficient for use in some applications; they both contain, amoung other +things, complete inflectional morphology. +</P> +<A NAME="toc79"></A> +<H2>The structure of the library</H2> +<P> +<a name="seclexical"></a> +</P> +<A NAME="toc80"></A> +<H3>Lexical vs. phrasal rules</H3> +<P> +So far we have looked at grammars from a semantic point of view: +a grammar defines a system of meanings (specified in the abstract syntax) and +tells how they are expressed in some language (as specified in the concrete syntax). +In resource grammars, as often in the linguistic tradition, the goal is more modest: +to specify the <B>grammatically correct combinations of words</B>, whatever their +meanings are. With this more modest goal, it is possible to achieve a much +wider coverage than with semantic grammars. +</P> +<P> +Given the focus on <I>words</I> and their combinations, +the resource grammar has two kinds of categories and two kinds of rules: +</P> +<UL> +<LI>lexical: + <UL> + <LI>lexical categories, to classify words + <LI>lexical rules, to define words and their properties + </UL> +</UL> + +<UL> +<LI>phrasal (combinatorial, syntactic): + <UL> + <LI>phrasal categories, to classify phrases of arbitrary size + <LI>phrasal rules, to combine phrases into larger phrases + </UL> +</UL> + +<P> +Some grammar formalisms make a formal distinction between +the lexical and syntactic +components; sometimes it is necessary to use separate formalisms for these +two kinds of rules. GF has no such restrictions. +Nevertheless, it has turned out +to be a good discipline to maintain a distinction between +the lexical and syntactic components in the resource grammar. This fits +also well with what is needed in applications: while syntactic structures +are more or less the same across applications, vocabularies can be +very different. +</P> +<A NAME="toc81"></A> +<H3>Lexical categories</H3> +<P> +Within lexical categories, there is a further classification +into <B>closed</B> and <B>open</B> categories. The definining property +of closed categories is that the +words in them can easily be enumerated; it is very seldom that any +new words are introduced in them. In general, closed categories +contain <B>structural words</B>, also known as <B>function words</B>. +Examples of closed categories are +</P> +<PRE> + QuantSg ; -- singular quantifier e.g. "this" + QuantPl ; -- plural quantifier e.g. "those" + AdA ; -- adadjective e.g. "very" +</PRE> +<P> +We have already used words of all these categories in the <CODE>Food</CODE> +examples; they have just not been assigned a category, but +treated as <B>syncategorematic</B>. In GF, a syncategoramatic +word is one that is introduced in a linearization rule of +some construction alongside with some other expressions that +are combined; there is no abstract syntax tree for that word +alone. Thus in the rules +</P> +<PRE> + fun That : Kind -> Item ; + lin That k = {"that" ++ k.s} ; +</PRE> +<P> +the word <I>that</I> is syncategoramatic. In linguistically motivated +grammars, syncategorematic words are avoided, whereas in +semantically motivated grammars, structural words are typically treated +as syncategoramatic. This is partly so because the function expressed +by a structural word in one language is often expressed by some other +means than an individual word in another. For instance, the definite +article <I>the</I> is a determiner word in English, whereas Swedish expresses +determination by inflecting the determined noun: <I>the wine</I> is <I>vinet</I> +in Swedish. +</P> +<P> +As for open categories, we will start with these two: +</P> +<PRE> + N ; -- noun e.g. "pizza" + A ; -- adjective e.g. "good" +</PRE> +<P> +Later in this chapter we will also need verbs of different kinds. +</P> +<P> +<I>Note</I>. Having adadjectives as a closed category is not quite right, because +one can form adadjectives from adjectives: <I>incredibly warm</I>. +</P> +<A NAME="toc82"></A> +<H3>Lexical rules</H3> +<P> +The words of closed categories can be listed once and for all in a +library. In the first example, the <CODE>Foods</CODE> grammar of the previous section, +we will use the following structural words from the <CODE>Syntax</CODE> module: +</P> +<PRE> + this_QuantSg, that_QuantSg : QuantSg ; + these_QuantPl, those_QuantPl : QuantPl ; + very_AdA : AdA ; +</PRE> +<P> +The naming convention for lexical rules is that we use a word followed by +the category. In this way we can for instance distinguish the quantifier +<I>that</I> from the conjunction <I>that</I>. +</P> +<P> +Open lexical categories have no objects in <CODE>Syntax</CODE>. Such objects +will be built as they are needed in applications. The abstract +syntax of words in applications is already familiar, e.g. +</P> +<PRE> + fun Wine : Kind ; +</PRE> +<P> +The concrete syntax can be given directly, e.g. +</P> +<PRE> + lin Wine = mkN "wine" ; +</PRE> +<P> +by using the morphological paradigm library <CODE>ParadigmsEng</CODE>. +However, there are some advantages in giving the concrete syntax +indirectly, via the creation of a <B>resource lexicon</B>. In this lexicon, +there will be entries such as +</P> +<PRE> + oper wine_N : N = mkN "wine" ; +</PRE> +<P> +which can then be used in the linearization rules, +</P> +<PRE> + lin Wine = wine_N ; +</PRE> +<P> +One advantage of this indirect method is that each new word gives +an addition to a reusable resource lexicon, instead of just doing +the job of implementing the application. Another advantage will +be shown <a href="#secfunctor">here</a>: the possibility to write functors over +lexicon interfaces. +</P> +<A NAME="toc83"></A> +<H3>Phrasal categories</H3> +<P> +There are just four phrasal categories needed in the first application: +</P> +<PRE> + Cl ; -- clause e.g. "this pizza is good" + NP ; -- noun phrase e.g. "this pizza" + CN ; -- common noun e.g. "warm pizza" + AP ; -- adjectival phrase e.g. "very warm" +</PRE> +<P> +Clauses are, roughly, the same as declarative sentences; we will +define in <a href="#secextended">here</a> a sentence <CODE>S</CODE> as a clause that has a fixed tense. +The distinction between common nouns and noun phrases is that common nouns +cannot generally be used alone as subjects (?<I>dog sleeps</I>), +whereas noun phrases can (<I>the dog sleeps</I>). +Noun phrases can be built from common nouns by adding determiners, +such as quantifiers; but there are also other kinds of noun phrases, e.g. +pronouns. +</P> +<P> +The syntactic combinations we need are the following: +</P> +<PRE> + mkCl : NP -> AP -> Cl ; -- e.g. "this pizza is very warm" + mkNP : QuantSg -> CN -> NP ; -- e.g. "this pizza" + mkNP : QuantPl -> CN -> NP ; -- e.g. "these pizzas" + mkCN : AP -> CN -> CN ; -- e.g. "warm pizza" + mkAP : AdA -> AP -> AP ; -- e.g. "very warm" +</PRE> +<P> +To start building phrases, we need rules of <B>lexical insertion</B>, which +form phrases from single words: +</P> +<PRE> + mkCN : N -> NP ; + mkAP : A -> AP ; +</PRE> +<P> +Notice that all (or, as many as possible) operations in the resource library +have the name <CODE>mk</CODE><I>C</I>, where <I>C</I> is the value category of the operation. +This means of course heavy overloading. For instance, the current library +(version 1.2) has no less than 23 operations named <CODE>mkNP</CODE>! +</P> +<P> +Now the sentence +<center> +<I>these very warm pizzas are Italian</I> +</center> +can be built as follows: +</P> +<PRE> + mkCl + (mkNP these_QuantPl + (mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN))) + (mkAP italian_AP) +</PRE> +<P> +The task we are facing now is to define the concrete syntax of <CODE>Foods</CODE> so that +this syntactic tree gives the value of linearizing the semantic tree +</P> +<PRE> + Is (These (QKind (Very Warm) Pizza)) Italian +</PRE> +<P></P> +<A NAME="toc84"></A> +<H2>The resource API</H2> +<P> +The resource library API is divided into language-specific +and language-independent parts. To put it roughly, +</P> +<UL> +<LI>the syntax API is language-independent, i.e. has the same types and + functions for all languages. + Its name is <CODE>Syntax</CODE><I>L</I> for each language <I>L</I> +<LI>the morphology API is language-specific, i.e. has partly + different types and functions + for different languages. + Its name is <CODE>Paradigms</CODE><I>L</I> for each language <I>L</I> +</UL> + +<P> +A full documentation of the API is available on-line in the +<B>resource synopsis</B>. +For the examples of this chapter, we will only need a +fragment of the full API. The fragment needed for <CODE>Foods</CODE> has +already been introduced, but let us summarize the descriptions +by giving tables of the same form as used in the resource synopsis. +</P> +<P> +Thus we will make use of the following categories from the module <CODE>Syntax</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Category</TH> +<TH>Explanation</TH> +<TH COLSPAN="2">Example</TH> +</TR> +<TR> +<TD><CODE>Cl</CODE></TD> +<TD>clause (sentence), with all tenses</TD> +<TD><I>she looks at this</I></TD> +</TR> +<TR> +<TD><CODE>AP</CODE></TD> +<TD>adjectival phrase</TD> +<TD><I>very warm</I></TD> +</TR> +<TR> +<TD><CODE>CN</CODE></TD> +<TD>common noun (without determiner)</TD> +<TD><I>red house</I></TD> +</TR> +<TR> +<TD><CODE>NP</CODE></TD> +<TD>noun phrase (subject or object)</TD> +<TD><I>the red house</I></TD> +</TR> +<TR> +<TD><CODE>AdA</CODE></TD> +<TD>adjective-modifying adverb,</TD> +<TD><I>very</I></TD> +</TR> +<TR> +<TD><CODE>QuantSg</CODE></TD> +<TD>singular quantifier</TD> +<TD><I>these</I></TD> +</TR> +<TR> +<TD><CODE>QuantPl</CODE></TD> +<TD>plural quantifier</TD> +<TD><I>these</I></TD> +</TR> +<TR> +<TD><CODE>A</CODE></TD> +<TD>one-place adjective</TD> +<TD><I>warm</I></TD> +</TR> +<TR> +<TD><CODE>N</CODE></TD> +<TD>common noun</TD> +<TD><I>house</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +We will use the following syntax rules from <CODE>Syntax</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH>Type</TH> +<TH COLSPAN="2">Example</TH> +</TR> +<TR> +<TD><CODE>mkCl</CODE></TD> +<TD><CODE>NP -> AP -> Cl</CODE></TD> +<TD><I>John is very old</I></TD> +</TR> +<TR> +<TD><CODE>mkNP</CODE></TD> +<TD><CODE>QuantSg -> CN -> NP</CODE></TD> +<TD><I>this old man</I></TD> +</TR> +<TR> +<TD><CODE>mkNP</CODE></TD> +<TD><CODE>QuantPl -> CN -> NP</CODE></TD> +<TD><I>these old man</I></TD> +</TR> +<TR> +<TD><CODE>mkCN</CODE></TD> +<TD><CODE>N -> CN</CODE></TD> +<TD><I>house</I></TD> +</TR> +<TR> +<TD><CODE>mkCN</CODE></TD> +<TD><CODE>AP -> CN -> CN</CODE></TD> +<TD><I>very big blue house</I></TD> +</TR> +<TR> +<TD><CODE>mkAP</CODE></TD> +<TD><CODE>A -> AP</CODE></TD> +<TD><I>old</I></TD> +</TR> +<TR> +<TD><CODE>mkAP</CODE></TD> +<TD><CODE>AdA -> AP -> AP</CODE></TD> +<TD><I>very very old</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +We will use the following structural words from <CODE>Syntax</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH>Type</TH> +<TH COLSPAN="2">In English</TH> +</TR> +<TR> +<TD><CODE>this_QuantSg</CODE></TD> +<TD><CODE>QuantSg</CODE></TD> +<TD><I>this</I></TD> +</TR> +<TR> +<TD><CODE>that_QuantSg</CODE></TD> +<TD><CODE>QuantSg</CODE></TD> +<TD><I>that</I></TD> +</TR> +<TR> +<TD><CODE>these_QuantPl</CODE></TD> +<TD><CODE>QuantPl</CODE></TD> +<TD><I>this</I></TD> +</TR> +<TR> +<TD><CODE>those_QuantPl</CODE></TD> +<TD><CODE>QuantPl</CODE></TD> +<TD><I>that</I></TD> +</TR> +<TR> +<TD><CODE>very_AdA</CODE></TD> +<TD><CODE>AdA</CODE></TD> +<TD><I>very</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +For English, we will use the following part of <CODE>ParadigmsEng</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(dog : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(man,men : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(cold : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +For Italian, we need just the following part of <CODE>ParadigmsIta</CODE> +(Exercise). The "smart" paradigms will take care of variations +such as <I>formaggio</I>-<I>formaggi</I>, and also infer the genders +correctly. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(vino : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(caro : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +For German, we will use the following part of <CODE>ParadigmsGer</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>Gender</CODE></TD> +<TD><CODE>Type</CODE></TD> +</TR> +<TR> +<TD><CODE>masculine</CODE></TD> +<TD><CODE>Gender</CODE></TD> +</TR> +<TR> +<TD><CODE>feminine</CODE></TD> +<TD><CODE>Gender</CODE></TD> +</TR> +<TR> +<TD><CODE>neuter</CODE></TD> +<TD><CODE>Gender</CODE></TD> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(Stufe : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(Bild,Bilder : Str) -> Gender -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(klein : Str) -> A</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(gut,besser,beste : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +For Finnish, we only need the smart regular paradigms: +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(talo : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(hieno : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +<B>Exercise</B>. Try out the morphological paradigms in different languages. Do +as follows: +</P> +<PRE> + > i -path=alltenses:prelude -retain alltenses/ParadigmsGer.gfr + > cc mkN "Farbe" + > cc mkA "gut" "besser" "beste" +</PRE> +<P></P> +<A NAME="toc85"></A> +<H2>Example: English</H2> +<P> +<a name="secenglish"></a> +</P> +<P> +We work with the abstract syntax <CODE>Foods</CODE> from <a href="#chaptwo">the fourth chapter</a>, and +build first an English implementation. Now we can do it without +thinking about inflection and agreement, by just picking appropriate +functions from the resource grammar library. +</P> +<P> +The concrete syntax opens <CODE>SyntaxEng</CODE> and <CODE>ParadigmsEng</CODE> +to get access to the resource libraries needed. In order to find +the libraries, a <CODE>path</CODE> directive is prepended. It contains +two resource subdirectories --- <CODE>present</CODE> and <CODE>prelude</CODE> --- +which are found relative to the environment variable <CODE>GF_LIB_PATH</CODE>. +It also contains the current directory <CODE>.</CODE> and the directory <CODE>../foods</CODE>, +in which <CODE>Foods.gf</CODE> resides. +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsEng of Foods = open SyntaxEng,ParadigmsEng in { +</PRE> +<P> +As linearization types, we will use clauses for <CODE>Phrase</CODE>, noun phrases +for <CODE>Item</CODE>, common nouns for <CODE>Kind</CODE>, and adjectival phrases for <CODE>Quality</CODE>. +</P> +<PRE> + lincat + Phrase = Cl ; + Item = NP ; + Kind = CN ; + Quality = AP ; +</PRE> +<P> +These types fit perfectly with the way we have used the categories +in the application; hence +the combination rules we need almost write themselves automatically: +</P> +<PRE> + lin + Is item quality = mkCl item quality ; + This kind = mkNP this_QuantSg kind ; + That kind = mkNP that_QuantSg kind ; + These kind = mkNP these_QuantPl kind ; + Those kind = mkNP those_QuantPl kind ; + QKind quality kind = mkCN quality kind ; + Very quality = mkAP very_AdA quality ; +</PRE> +<P> +We write the lexical part of the grammar by using resource paradigms directly. +Notice that we have to apply the lexical insertion rules to get type-correct +linearizations. Notice also that we need to use the two-place noun paradigm for +<I>fish</I>, but everythins else is regular. +</P> +<PRE> + Wine = mkCN (mkN "wine") ; + Pizza = mkCN (mkN "pizza") ; + Cheese = mkCN (mkN "cheese") ; + Fish = mkCN (mkN "fish" "fish") ; + Fresh = mkAP (mkA "fresh") ; + Warm = mkAP (mkA "warm") ; + Italian = mkAP (mkA "Italian") ; + Expensive = mkAP (mkA "expensive") ; + Delicious = mkAP (mkA "delicious") ; + Boring = mkAP (mkA "boring") ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Compile the grammar <CODE>FoodsEng</CODE> and generate +and parse some sentences. +</P> +<P> +<B>Exercise</B>. Write a concrete syntax of <CODE>Foods</CODE> for Italian +or some other language included in the resource library. You can +compare the results with the hand-written +grammars presented earlier in this tutorial. +</P> +<A NAME="toc86"></A> +<H2>Functor implementation of multilingual grammars</H2> +<P> +<a name="secfunctor"></a> +</P> +<P> +If you did the exercise of writing a concrete syntax of <CODE>Foods</CODE> for some other +language, you probably noticed that much of the code looks exactly the same +as for English. The reason for this is that the <CODE>Syntax</CODE> API is the +same for all languages. This is in turn possible because +all languages (at least those in the resource package) +implement the same syntactic structures. Moreover, languages tend to use the +syntactic structures in similar ways, even though this is not exceptionless. +But usually, it is only the lexical parts of a concrete syntax that +we need to write anew for a new language. Thus, to port a grammar to +a new language, you +</P> +<OL> +<LI>copy the concrete syntax of a given language +<LI>change the words (strings and inflection paradigms) +</OL> + +<P> +Now, programming by copy-and-paste is not worthy of a functional programmer! +So, can we write a <I>function</I> that takes care of the shared parts of grammar modules? +Yes, we can. It is not a function in the <CODE>fun</CODE> or <CODE>oper</CODE> sense, but +a function operating on modules, called a <B>functor</B>. This construct +is familiar from the functional programming +languages ML and OCaml, but it does not +exist in Haskell. It also bears some resemblance to templates in C++. +Functors are also known as <B>parametrized modules</B>. +</P> +<P> +In GF, a functor is a module that <CODE>open</CODE>s one or more <B>interfaces</B>. +An <CODE>interface</CODE> is a module similar to a <CODE>resource</CODE>, but it only +contains the <I>types</I> of <CODE>oper</CODE>s, not their definitions. You can think +of an interface as a kind of a record type. The <CODE>oper</CODE> names are the +labels of this record type. The corresponding <I>record</I> is called an +<B>instance</B> of the interface. +Thus a functor is a module-level function taking instances as +arguments and producing modules as values. +</P> +<P> +Let us now write a functor implementation of the <CODE>Food</CODE> grammar. +Consider its module header first: +</P> +<PRE> + incomplete concrete FoodsI of Foods = open Syntax, LexFoods in +</PRE> +<P> +A functor is distinguished from an ordinary module by the leading +keyword <CODE>incomplete</CODE>. +</P> +<P> +In the functor-function analogy, <CODE>FoodsI</CODE> would be presented as a function +with the following type signature: +</P> +<PRE> + FoodsI : + instance of Syntax -> instance of LexFoods -> concrete of Foods +</PRE> +<P> +It takes as arguments instances of two interfaces: +</P> +<UL> +<LI><CODE>Syntax</CODE>, the resource grammar interface +<LI><CODE>LexFoods</CODE>, the domain-specific lexicon interface +</UL> + +<P> +Functors opening <CODE>Syntax</CODE> and a domain lexicon interface are in fact +so typical in GF applications, that this structure could be called +a <B>design pattern</B> +for GF grammars. What makes this pattern so useful is, again, that +languages tend to use the same syntactic structures and only differ in words. +</P> +<P> +We will show the exact syntax of interfaces and instances in next Section. +Here it is enough to know that we have +</P> +<UL> +<LI><CODE>SyntaxGer</CODE>, an instance of <CODE>Syntax</CODE> +<LI><CODE>LexFoodsGer</CODE>, an instance of <CODE>LexFoods</CODE> +</UL> + +<P> +Then we can complete the German implementation by "applying" the functor: +</P> +<PRE> + FoodI SyntaxGer LexFoodsGer : concrete of Foods +</PRE> +<P> +The GF syntax for doing so is +</P> +<PRE> + concrete FoodsGer of Foods = FoodsI with + (Syntax = SyntaxGer), + (LexFoods = LexFoodsGer) ; +</PRE> +<P> +Notice that this is the <I>whole</I> module, not just a header of it. +The module body is received from <CODE>FoodsI</CODE>, by instantiating the +interface constants with their definitions given in the German +instances. A module of this form, characterized by the keyword <CODE>with</CODE>, is +called a <B>functor instantiation</B>. +</P> +<P> +Here is the complete code for the functor <CODE>FoodsI</CODE>: +</P> +<PRE> + --# -path=.:../foods:present:prelude + + incomplete concrete FoodsI of Foods = open Syntax, LexFoods in { + lincat + Phrase = Cl ; + Item = NP ; + Kind = CN ; + Quality = AP ; + lin + Is item quality = mkCl item quality ; + This kind = mkNP this_QuantSg kind ; + That kind = mkNP that_QuantSg kind ; + These kind = mkNP these_QuantPl kind ; + Those kind = mkNP those_QuantPl kind ; + QKind quality kind = mkCN quality kind ; + Very quality = mkAP very_AdA quality ; + + Wine = mkCN wine_N ; + Pizza = mkCN pizza_N ; + Cheese = mkCN cheese_N ; + Fish = mkCN fish_N ; + Fresh = mkAP fresh_A ; + Warm = mkAP warm_A ; + Italian = mkAP italian_A ; + Expensive = mkAP expensive_A ; + Delicious = mkAP delicious_A ; + Boring = mkAP boring_A ; + } +</PRE> +<P></P> +<A NAME="toc87"></A> +<H2>Interfaces and instances</H2> +<P> +<a name="secinterface"></a> +</P> +<P> +Let us now define the <CODE>LexFoods</CODE> interface: +</P> +<PRE> + interface LexFoods = open Syntax in { + oper + wine_N : N ; + pizza_N : N ; + cheese_N : N ; + fish_N : N ; + fresh_A : A ; + warm_A : A ; + italian_A : A ; + expensive_A : A ; + delicious_A : A ; + boring_A : A ; + } +</PRE> +<P> +In this interface, only lexical items are declared. In general, an +interface can declare any functions and also types. The <CODE>Syntax</CODE> +interface does so. +</P> +<P> +Here is a German instance of the interface. +</P> +<PRE> + instance LexFoodsGer of LexFoods = open SyntaxGer, ParadigmsGer in { + oper + wine_N = mkN "Wein" ; + pizza_N = mkN "Pizza" "Pizzen" feminine ; + cheese_N = mkN "Käse" "Käsen" masculine ; + fish_N = mkN "Fisch" ; + fresh_A = mkA "frisch" ; + warm_A = mkA "warm" "wärmer" "wärmste" ; + italian_A = mkA "italienisch" ; + expensive_A = mkA "teuer" ; + delicious_A = mkA "köstlich" ; + boring_A = mkA "langweilig" ; + } +</PRE> +<P> +Notice that when an interface opens an interface, such as <CODE>Syntax</CODE>, +here, then its instance has to open an instance of it. But the instance +may also open some other resources --- very typically, like here, +a domain lexicon instance opens a <CODE>Paradigms</CODE> module. +</P> +<P> +Just to complete the picture, we repeat the German functor instantiation +for <CODE>FoodsI</CODE>, this time with a path directive that makes it compilable. +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsGer of Foods = FoodsI with + (Syntax = SyntaxGer), + (LexFoods = LexFoodsGer) ; +</PRE> +<P></P> +<P> +<B>Exercise</B>. Compile and test <CODE>FoodsGer</CODE>. +</P> +<P> +<B>Exercise</B>. Refactor <CODE>FoodsEng</CODE> into a functor instantiation. +</P> +<A NAME="toc88"></A> +<H2>Adding languages to a functor implementation</H2> +<P> +Once we have an application grammar defined by using a functor, +adding a new language is simple. Just two modules need to be written: +</P> +<UL> +<LI>a domain lexicon instance +<LI>a functor instantiation +</UL> + +<P> +The functor instantiation is completely mechanical to write. +Here is one for Finnish: +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsFin of Foods = FoodsI with + (Syntax = SyntaxFin), + (LexFoods = LexFoodsFin) ; +</PRE> +<P> +The domain lexicon instance requires some knowledge of the words of the +language: what words are used for which concepts, how the words are +inflected, plus features such as genders. Here is a lexicon instance for +Finnish: +</P> +<PRE> + instance LexFoodsFin of LexFoods = open SyntaxFin, ParadigmsFin in { + oper + wine_N = mkN "viini" ; + pizza_N = mkN "pizza" ; + cheese_N = mkN "juusto" ; + fish_N = mkN "kala" ; + fresh_A = mkA "tuore" ; + warm_A = mkA "lämmin" ; + italian_A = mkA "italialainen" ; + expensive_A = mkA "kallis" ; + delicious_A = mkA "herkullinen" ; + boring_A = mkA "tylsä" ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Instantiate the functor <CODE>FoodsI</CODE> to some language of +your choice. +</P> +<A NAME="toc89"></A> +<H2>Division of labour revisited</H2> +<P> +One purpose with the resource grammars was stated to be a division +of labour between linguists and application grammarians. We can now +reflect on what this means more precisely, by asking ourselves what +skills are required of grammarians working on different components. +</P> +<P> +Building a GF application starts from the abstract syntax. Writing +an abstract syntax requires +</P> +<UL> +<LI>understanding of the semantic structure of the application domain +<LI>knowledge of the GF fragment with categories and functions +</UL> + +<P> +If the concrete syntax is written by using a functor, the programmer +has to decide what parts of the implementation are put to the interface +and what parts are shared in the functor. This requires +</P> +<UL> +<LI>knowing how the domain concepts are expressed in natural language +<LI>knowledge of the resource grammar library --- the categories and combinators +<LI>understanding what parts are likely to be expressed in language-dependent + ways, so that they are put to an interface and not the functor +<LI>knowledge of the GF fragment with function applications and strings +</UL> + +<P> +Instantiating a ready-made functor to a new language is less demanding. +It requires essentially +</P> +<UL> +<LI>knowing how the domain words are expressed in the language +<LI>knowing, roughly, how these words are inflected +<LI>knowledge of the paradigms available in the library +<LI>knowledge of the GF fragment with function applications and strings +</UL> + +<P> +Notice that none of these tasks requires the use of GF records, tables, +or parameters. Thus only a small fragment of GF is needed; the rest of +GF is only relevant for those who write the libraries. Essentially, +all the machinery introduced in <a href="#chaptwo">the fourth chapter</a> is unnecessary! +</P> +<P> +Of course, grammar writing is not always just straightforward usage of libraries. +For example, GF can be used for other languages than just those in the +libraries --- for both natural and formal languages. A knowledge of records +and tables can, unfortunately, also be needed for understanding GF's error +messages. +</P> +<P> +<B>Exercise</B>. Design a small grammar that can be used for controlling +an MP3 player. The grammar should be able to recognize commands such +as <I>play this song</I>, with the following variations: +</P> +<UL> +<LI>verbs: <I>play</I>, <I>remove</I> +<LI>objects: <I>song</I>, <I>artist</I> +<LI>determiners: <I>this</I>, <I>the previous</I> +<LI>verbs without arguments: <I>stop</I>, <I>pause</I> +</UL> + +<P> +The implementation goes in the following phases: +</P> +<OL> +<LI>abstract syntax +<LI>functor and lexicon interface +<LI>lexicon instance for the first language +<LI>functor instantiation for the first language +<LI>lexicon instance for the second language +<LI>functor instantiation for the second language +<LI>... +</OL> + +<A NAME="toc90"></A> +<H2>Restricted inheritance</H2> +<P> +A functor implementation using the resource <CODE>Syntax</CODE> interface +works well as long as all concepts are expressed by using the same structures +in all languages. If this is not the case, the deviant linearization can +be made into a parameter and moved to the domain lexicon interface. +</P> +<P> +The <CODE>Foods</CODE> grammar works so well that we have to +take a contrived example: assume that English has +no word for <CODE>Pizza</CODE>, but has to use the paraphrase <I>Italian pie</I>. +This paraphrase is no longer a noun <CODE>N</CODE>, but a complex phrase +in the category <CODE>CN</CODE>. An obvious way to solve this problem is +to change interface <CODE>LexFoods</CODE> so that the constant declared for +<CODE>Pizza</CODE> gets a new type: +</P> +<PRE> + oper pizza_CN : CN ; +</PRE> +<P> +But this solution is unstable: we may end up changing the interface +and the function with each new language, and we must every time also +change the interface instances for the old languages to maintain +type correctness. +</P> +<P> +A better solution is to use <B>restricted inheritance</B>: the English +instantiation inherits the functor implementation except for the +constant <CODE>Pizza</CODE>. This is how we write: +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsEng of Foods = FoodsI - [Pizza] with + (Syntax = SyntaxEng), + (LexFoods = LexFoodsEng) ** + open SyntaxEng, ParadigmsEng in { + + lin Pizza = mkCN (mkA "Italian") (mkN "pie") ; + } +</PRE> +<P> +Restricted inheritance is available for all inherited modules. One can for +instance exclude some mushrooms and pick up just some fruit in +the <CODE>FoodMarket</CODE> example "Rsecarchitecture: +</P> +<PRE> + abstract Foodmarket = Food, Fruit [Peach], Mushroom - [Agaric] +</PRE> +<P> +A concrete syntax of <CODE>Foodmarket</CODE> must then have the same inheritance +restrictions, in order to be well-typed with respect to the abstract syntax. +</P> +<A NAME="toc91"></A> +<H2>Grammar reuse</H2> +<P> +The alert reader has certainly noticed an analogy between <CODE>abstract</CODE> +and <CODE>concrete</CODE>, on the one hand, and <CODE>interface</CODE> and <CODE>instance</CODE>, +on the other. Why are these two pairs of module types kept separate +at all? There is, in fact, a very close correspondence between +judgements in the two kinds of modules: +</P> +<PRE> + cat C <---> oper C : Type + + fun f : A <---> oper f : A + + lincat C = T <---> oper C : Type = T + + lin f = t <---> oper f : A = t +</PRE> +<P> +But there are also some differences: +</P> +<UL> +<LI><CODE>abstract</CODE> and <CODE>concrete</CODE> modules define <B>top-level grammars</B>, i.e. + grammars that can be used for parsing and linearization; this is because +<LI>the types and terms in <CODE>concrete</CODE> modules are restricted to a subset + of those available in <CODE>interface</CODE>, <CODE>instance</CODE>, and <CODE>resource</CODE> +<LI><CODE>param</CODE> judgements have no counterparts in top-level grammars +</UL> + +<P> +The term that can be used for interfaces, instances, and resources is +<B>resource-level grammars</B>. +From these explanations and the above translations it follows that top-level +grammars are, in a sense, a special case of resource-level grammars. +</P> +<P> +Thus, indeed, abstract syntax modules can be used like interfaces, and concrete syntaxes +as their instances. The use of top-level grammars as resources +is called <B>grammar reuse</B>. Whether a library module is a top-level or a +resource-level module is mostly invisible to application programmers +(see the Summary <a href="#seclock">here</a> +for an exception to this). The GF resource grammar +library itself is in fact built in two layers: +</P> +<UL> +<LI>the <B>ground resource</B>: a set of top-level grammars for syntactic structures +<LI>the <B>surface resource</B>: a resource-level grammar with overloaded operations + defined in terms of the ground resource +</UL> + +<P> +Both the ground +resource and the surface resource can be used by application programmers, +but it is the surface resource that we use in this book. Because of overloading, +it has much fewer function names and also flatter trees. For instance, the clause +<center> +<I>these very warm pizzas are Italian</I> +</center> +which in the surface resource can be built as +</P> +<PRE> + mkCl + (mkNP these_QuantPl + (mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN))) + (mkAP italian_AP) +</PRE> +<P> +has in the ground resource the much more complex tree +</P> +<PRE> + PredVP + (DetCN (DetPl (PlQuant this_Quant) NoNum NoOrd) + (AdjCN (AdAP very_AdA (PositA warm_A)) (UseN pizza_N))) + (UseComp (CompAP (PositA italian_A))) +</PRE> +<P> +The main advantage of using the ground resource is that the trees can then be found +by using the parser, as shown in the next section. Otherwise, the overloaded surface +resource constants are much easier to use. +</P> +<P> +Needless to say, once a library has been defined in some way, it is easy to +build layers of <B>derived libraries</B> on top of it, by using grammar reuse +and, in the case of multilingual libraries, functors. This is indeed how +the surface resource has been implemented: as a functored parametrized on +the abstract syntax of the ground resource. +</P> +<A NAME="toc92"></A> +<H2>Browsing the resource with GF commands</H2> +<P> +<a name="secbrowsing"></a> +</P> +<P> +In addition to reading the +<A HREF="../../lib/resource-1.0/synopsis.html">resource synopsis</A>, you +can find resource function combinations by using the parser. This +is so because the resource library is in the end implemented as +a top-level <CODE>abstract-concrete</CODE> grammar, on which parsing +and linearization work. +</P> +<P> +Unfortunately, currently (GF 2.8) +only English and the Scandinavian languages can be +parsed within acceptable computer resource limits when the full +resource is used. +</P> +<P> +To look for a syntax tree in the overload API by parsing, do like this: +</P> +<PRE> + % gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/OverLangEng.gfc + + > p -cat=S -overload "this grammar is too big" + mkS (mkCl (mkNP this_QuantSg grammar_N) (mkAP too_AdA big_A)) +</PRE> +<P> +The <CODE>-overload</CODE> option given to the parser is a directive to find the +shallowest overloaded term that matches the parse tree. +</P> +<P> +To view linearizations in all languages by parsing from English: +</P> +<PRE> + % gf $GF_LIB_PATH/alltenses/langs.gfcm + + > p -cat=S -lang=LangEng "this grammar is too big" | tb + UseCl TPres ASimul PPos (PredVP (DetCN (DetSg (SgQuant this_Quant) + NoOrd) (UseN grammar_N)) (UseComp (CompAP (AdAP too_AdA (PositA big_A))))) + Den här grammatiken är för stor + Esta gramática es demasiado grande + (Cyrillic: eta grammatika govorit des'at' jazykov) + Denne grammatikken er for stor + Questa grammatica è troppo grande + Diese Grammatik ist zu groß + Cette grammaire est trop grande + Tämä kielioppi on liian suuri + This grammar is too big + Denne grammatik er for stor +</PRE> +<P> +This method shows the unambiguous ground resource functions and not +the overloaded ones. It uses a precompiled grammar package of the GFCM or GFCC +format; see <a href="#chapeight">the eighth chapter</a> for more information on this. +</P> +<P> +Unfortunately, the Russian grammar uses at the moment a different +character encoding than the rest and is therefore not displayed correctly +in a terminal window. However, the GF syntax editor does display all +examples correctly --- again, using the ground resource: +</P> +<PRE> + % gfeditor $GF_LIB_PATH/alltenses/langs.gfcm +</PRE> +<P> +When you have constructed the tree, you will see the following screen: +</P> +<P> +<center> +</P> +<P> + <IMG ALIGN="right" SRC="10lang-small.png" BORDER="0" ALT=""> +</P> +<P> +</center> +</P> +<P> +<B>Exercise</B>. Find the resource grammar translations for the following +English phrases (parse in the category <CODE>Phr</CODE>). You can first try to +build the terms manually. +</P> +<P> +<I>every man loves a woman</I> +</P> +<P> +<I>this grammar speaks more than ten languages</I> +</P> +<P> +<I>which languages aren't in the grammar</I> +</P> +<P> +<I>which languages did you want to speak</I> +</P> +<A NAME="toc93"></A> +<H2>An extended Foods grammar</H2> +<P> +<a name="secextended"></a> +</P> +<P> +Now that we know how to find information in the resource grammar, +we can easily extend the <CODE>Foods</CODE> fragment considerably. We shall enable +the following new expressions: +</P> +<UL> +<LI>questions: <I>Is this pizza Italian?</I> <I>Which pizza do you want to eat?</I> +<LI>imperatives: <I>Eat that pizza please!</I> +<LI>denials: <I>These pizzas are not Italian.</I> +<LI>verbs: <I>eat</I>, <I>pay</I> +<LI>guests, in addition to food items: <I>I, you, this lady</I> +</UL> + +<A NAME="toc94"></A> +<H3>Abstract syntax</H3> +<P> +Since we don't want to change the already existing <CODE>Foods</CODE> module, +we build an extension of it, <CODE>ExtFoods</CODE>: +</P> +<PRE> + abstract ExtFoods = Foods ** { + + flags startcat=Move ; + + cat + Move ; -- dialogue move: declarative, question, or imperative + Verb ; -- transitive verb + Guest ; -- guest in restaurant + GuestKind ; -- type of guest + + fun + MAssert : Phrase -> Move ; -- This pizza is warm. + MDeny : Phrase -> Move ; -- This pizza isn't warm. + MAsk : Phrase -> Move ; -- Is this pizza warm? + + PVerb : Guest -> Verb -> Item -> Phrase ; -- we eat this pizza + PVerbWant : Guest -> Verb -> Item -> Phrase ; -- we want to eat this pizza + + WhichVerb : + Kind -> Guest -> Verb -> Move ; -- Which pizza do you eat? + WhichVerbWant : + Kind -> Guest -> Verb -> Move ; -- Which pizza do you want to eat? + WhichIs : Kind -> Quality -> Move ; -- Which wine is Italian? + + Do : Verb -> Item -> Move ; -- Pay this wine! + DoPlease : Verb -> Item -> Move ; -- Pay this wine please! + + I, You, We : Guest ; + + GThis, GThat, GThese, GThose : GuestKind -> Guest ; + + Eat, Drink, Pay : Verb ; + + Lady, Gentleman : GuestKind ; + } +</PRE> +<P> +The concrete syntax is implemented by a functor that extends the +already defined functor <CODE>FoodsI</CODE>. +</P> +<PRE> + incomplete concrete ExtFoodsI of ExtFoods = + FoodsI ** open Syntax, LexFoods in { + + flags lexer=text ; unlexer=text ; +</PRE> +<P> +The flags set up a lexer and unlexer that can deal with sentence-initial +capital letters and proper spacing with punctuation (see <a href="#seclexing">here</a> +for more information on lexers and unlexers). +</P> +<A NAME="toc95"></A> +<H3>Linearization types</H3> +<P> +If we look at the resource documentation, we find several categories +that are above the clause level and can thus host different kinds +of dialogue moves: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>Category</TH> +<TH>Explanation</TH> +<TH COLSPAN="2">Example</TH> +</TR> +<TR> +<TD><CODE>Text</CODE></TD> +<TD>text consisting of phrases</TD> +<TD><I>He is here. Why?</I></TD> +</TR> +<TR> +<TD><CODE>Phr</CODE></TD> +<TD>phrase in a text</TD> +<TD><I>but be quiet please</I></TD> +</TR> +<TR> +<TD><CODE>Utt</CODE></TD> +<TD>sentence, question, word...</TD> +<TD><I>be quiet</I></TD> +</TR> +<TR> +<TD><CODE>S</CODE></TD> +<TD>declarative sentence</TD> +<TD><I>she lived here</I></TD> +</TR> +<TR> +<TD><CODE>QS</CODE></TD> +<TD>question</TD> +<TD><I>where did she live</I></TD> +</TR> +<TR> +<TD><CODE>Imp</CODE></TD> +<TD>imperative</TD> +<TD><I>look at this</I></TD> +</TR> +<TR> +<TD><CODE>QCl</CODE></TD> +<TD>question clause, with all tenses</TD> +<TD><I>why does she walk</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +We also find that only the category <CODE>Text</CODE> contains punctuation marks. +So we choose this as the linearization type of <CODE>Move</CODE>. The other types +are quite obvious. +</P> +<PRE> + lincat + Move = Text ; + Verb = V2 ; + Guest = NP ; + GuestKind = CN ; +</PRE> +<P> +The category <CODE>V2</CODE> of <B>two-place verbs</B> includes both +<B>transitive verbs</B> that take <B>direct objects</B> (e.g. <I>we watch him</I>) +and verbs that take other kinds of <B>complements</B>, often with +prepositions (<I>we look at him</I>). In a multilingual grammar, it is +not guaranteed that transitive verbs are transitive in all languages, +so the more general notion of two-place verb is more appropriate. +</P> +<A NAME="toc96"></A> +<H3>Linearization rules</H3> +<P> +Now we need to find constructors that combine the new categories in +appropriate ways. To form a text from a clause, we first make it into +a sentence with <CODE>mkS</CODE>, and then apply <CODE>mkText</CODE>: +</P> +<PRE> + lin MAssert p = mkText (mkS p) ; +</PRE> +<P> +The function <CODE>mkS</CODE> has in the resource synopsis been given the type +</P> +<PRE> + mkS : (Tense) -> (Ant) -> (Pol) -> Cl -> S +</PRE> +<P> +Parentheses around type names do not make any difference for the GF compiler, +but in the synopsis notation they indicate <B>optionality</B>: any of the +optional arguments can be omitted, and there is an instance of <CODE>mkS</CODE> +available. For each optional type, it uses the <B>default value</B> for that +type, which for the <B>polarity</B> <CODE>Pol</CODE> is positive i.e. unnegated. +To build a negative sentence, we use an explicit polarity constructor: +</P> +<PRE> + MDeny p = mkText (mkS negativePol p) ; +</PRE> +<P> +Of course, we could have used <CODE>positivePol</CODE> in the first rule, instead of +relying on the default. (The types <CODE>Tense</CODE> and <CODE>Ant</CODE> will be explained +<a href="#sectense">here</a>.) +</P> +<P> +Phrases can be made into <B>question sentences</B>, which in turn can be +made into texts in a similar way as sentences; the default +punctuation mark is not the full stop but the question mark. +</P> +<PRE> + MAsk p = mkText (mkQS p) ; +</PRE> +<P> +There is an <CODE>mkCl</CODE> instance that directly builds a clause from a noun phrase, +a two-place verb, and another noun phrase. +</P> +<PRE> + PVerb = mkCl ; +</PRE> +<P> +The auxiliary verb <I>want</I> requires a <B>verb phrase</B> (<CODE>VP</CODE>) as its complement. It +can be built from a two-place verb and its noun phrase complement. +</P> +<PRE> + PVerbWant guest verb item = mkCl guest want_VV (mkVP verb item) ; +</PRE> +<P> +The <B>interrogative determiner</B> (<CODE>IDet</CODE>) <I>which</I> can be combined with +a common noun to form an <B>interrogative phrase</B> (<CODE>IP</CODE>). This <CODE>IP</CODE> can then +be used as a subject in a <B>question clause</B> (<CODE>QCl</CODE>), which in turn is +made into a <CODE>QS</CODE> and finally to a <CODE>Text</CODE>. +</P> +<PRE> + WhichIs kind quality = + mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) (mkVP quality))) ; +</PRE> +<P> +When interrogative phrases are used as <I>objects</I>, the resource library +uses a category named <CODE>Slash</CODE> of +objectless sentences. The name cames from the <B>slash categories</B> of the +GPSG grammar formalism +(Gazdar & al. 1985). Slashes can be formed from subjects and two-place verbs, +also with an intervening auxiliary verb. +</P> +<PRE> + WhichVerb kind guest verb = + mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) + (mkSlash guest verb))) ; + WhichVerbWant kind guest verb = + mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) + (mkSlash guest want_VV verb))) ; +</PRE> +<P> +Finally, we form the <B>imperative</B> (<CODE>Imp</CODE>) of a transitive verb +and its object. We make it into a <B>polite</B> form utterance, and finally +into a <CODE>Text</CODE> with an exclamation mark. +</P> +<PRE> + Do verb item = + mkText + (mkPhr (mkUtt politeImpForm (mkImp verb item))) exclMarkPunct ; + DoPlease verb item = + mkText + (mkPhr (mkUtt politeImpForm (mkImp verb item)) please_Voc) + exclMarkPunct ; +</PRE> +<P> +The rest of the concrete syntax is straightforward use of structural words, +</P> +<PRE> + I = mkNP i_Pron ; + You = mkNP youPol_Pron ; + We = mkNP we_Pron ; + GThis = mkNP this_QuantSg ; + GThat = mkNP that_QuantSg ; + GThese = mkNP these_QuantPl ; + GThose = mkNP those_QuantPl ; +</PRE> +<P> +and of the food lexicon, +</P> +<PRE> + Eat = eat_V2 ; + Drink = drink_V2 ; + Pay = pay_V2 ; + Lady = lady_N ; + Gentleman = gentleman_N ; + } +</PRE> +<P> +Notice that we have no reason to build an extension of <CODE>LexFoods</CODE>, but we just +add words to the old one. Since <CODE>LexFoods</CODE> instances are resource modules, +the superfluous definitions that they contain have no effect on the +modules that just <CODE>open</CODE> them, and thus the smaller <CODE>Foods</CODE> grammars +don't suffer from the additions we make. +</P> +<P> +<B>Exercise</B>. Port the <CODE>ExtFoods</CODE> grammars to some new languages, building +on the <CODE>Foods</CODE> implementations from previous sections, and using the functor +defined in this section. +</P> +<A NAME="toc97"></A> +<H2>Tenses</H2> +<P> +<a name="sectense"></a> +</P> +<P> +When compiling the <CODE>ExtFoods</CODE> grammars, we have used the path +</P> +<PRE> + --# -path=.:../foods:present:prelude +</PRE> +<P> +where the library subdirectory <CODE>present</CODE> refers to a restricted version +of the resource that covers only the present tense of verbs and sentences. +Having this version available is motivatad by efficiency reasons: tenses +produce in many languages a manifold of forms and combinations, which +multiply the size of the grammar; at the same time, many applications, +both technical ones and spoken dialogues, only need the present tense. +</P> +<P> +But it is easy change the grammars so that they admit of the full set +of tenses. It is enough to change the path to +</P> +<PRE> + --# -path=.:../foods:alltenses:prelude +</PRE> +<P> +and recompile the grammars from source (flag <CODE>-src</CODE>); the libraries are +not recompiled, because their sources cannot be found on the path list. +Then it is possible to see all the tenses of +phrases, by using the <CODE>-all</CODE> flag in linearization: +</P> +<PRE> + > gr -cat=Phrase | l -all + This wine is delicious + Is this wine delicious + This wine isn't delicious + Isn't this wine delicious + This wine is not delicious + Is this wine not delicious + This wine has been delicious + Has this wine been delicious + This wine hasn't been delicious + Hasn't this wine been delicious + This wine has not been delicious + Has this wine not been delicious + This wine was delicious + Was this wine delicious + This wine wasn't delicious + Wasn't this wine delicious + This wine was not delicious + Was this wine not delicious + This wine had been delicious + Had this wine been delicious + This wine hadn't been delicious + Hadn't this wine been delicious + This wine had not been delicious + Had this wine not been delicious + This wine will be delicious + Will this wine be delicious + This wine won't be delicious + Won't this wine be delicious + This wine will not be delicious + Will this wine not be delicious + This wine will have been delicious + Will this wine have been delicious + This wine won't have been delicious + Won't this wine have been delicious + This wine will not have been delicious + Will this wine not have been delicious + This wine would be delicious + Would this wine be delicious + This wine wouldn't be delicious + Wouldn't this wine be delicious + This wine would not be delicious + Would this wine not be delicious + This wine would have been delicious + Would this wine have been delicious + This wine wouldn't have been delicious + Wouldn't this wine have been delicious + This wine would not have been delicious + Would this wine not have been delicious +</PRE> +<P> +In addition to tenses, the linearization writes all parametric +variations --- polarity and word order (direct vs. inverted) --- as +well as the variation between contracted and full negation words. +Of course, the list is even longer in languages that have more +tenses and moods, e.g. the Romance languages. +</P> +<P> +In the <CODE>ExtFoods</CODE> grammar, tenses never find their way to the +top level of <CODE>Move</CODE>s. Therefore it is useless to carry around +the clause and verb tenses given in the <CODE>alltenses</CODE> set of libraries. +But with the library, it is easy to add tenses to <CODE>Move</CODE>s. For +instance, one can add the rules +</P> +<PRE> + fun MAssertFut : Phrase -> Move ; -- I will pay this wine + fun MAssertPastPerf : Phrase -> Move ; -- I had paid that wine + lin MAssertFut p = mkText (mkS futureTense p) ; + lin MAssertPastPerf p = mkText (mkS pastTense anteriorAnt p) ; +</PRE> +<P> +Comparison with <CODE>MAssert</CODE> above shows that the absence of the tense +and anteriority features defaults to present simultaneous tenses. +</P> +<P> +<B>Exercise</B>. Measure the size of the context-free grammar corresponding to +some concrete syntax of <CODE>ExtFoods</CODE> with all tenses. +You can do this by printing the grammar in the context-free format +(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines. +</P> +<A NAME="toc98"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc99"></A> +<H3>Interfaces and instances</H3> +<P> +An <B>interface module</B> (<CODE>interface</CODE> <I>I</I>) is like a <CODE>resource</CODE> module, +the difference being that it does not need to give definitions in +its <CODE>oper</CODE> and <CODE>param</CODE> judgements. Definitions are, however, +allowed, and they may use constants that appear undefined in the +module. For example, here is an interface for predication, which +is parametrized on NP case and agreement features, and on the constituent +order: +</P> +<PRE> + interface Predication = { + param + Case ; + Agreement ; + oper + subject : Case ; + object : Case ; + order : (verb,subj,obj : String) -> String ; + + NP : Type = {s : Case => Str ; a : Agreement} ; + TV : Type = {s : Agreement => Str} ; + + sentence : TV -> NP -> NP -> {s : Str} = \verb,subj,obj -> { + s = order (verb ! subj.a) (subj ! subject) (obj ! object) ; + } +</PRE> +<P> +An <B>instance module</B> (<CODE>instance</CODE> <I>J</I> <CODE>of</CODE> <I>I</I>) is also like a +<CODE>resource</CODE>, but it is compiled in union with the interface that it +is an instance <CODE>of</CODE>. This means that the definitions given in the +instance are type-checked with respect to the types given in the +interface. Moreover, overwriting types or definitions given in the interface +is not allowed. But it is legal for an instance to contain definitions +not included in the corresponding interface. Here is an instance of +<CODE>Predication</CODE>, suitable for languages like English. +</P> +<PRE> + instance PredicationSimpleSVO of Predication = { + param + Case = Nom | Acc | Gen ; + Agreement = Agr Number Person ; + + -- two new types + Number = Sg | Pl ; + Person = P1 | P2 | P3 ; + + oper + subject = Nom ; + object = Acc ; + order = \verb,subj,obj -> subj ++ verb ++ obj ; + + -- the rest of the definitions don't need repetition + } +</PRE> +<P></P> +<A NAME="toc100"></A> +<H3>Grammar reuse</H3> +<P> +<a name="seclock"></a> +</P> +<P> +Abstract syntax modules can be used like interfaces, and concrete syntaxes +as their instances. The following translations then take place: +</P> +<PRE> + cat C ---> oper C : Type + + fun f : A ---> oper f : A* + + lincat C = T ---> oper C : Type = T' + + lin f = t ---> oper f : A* = t' +</PRE> +<P> +This translation is called <B>grammar reuse</B>. It uses a homomorphism +from abstract types and terms to the concrete types and terms. For the +sake of more type safety, the types are not exactly the same. Currently +(GF 2.8), the type <I>T'</I> formed from the linearization type <I>T</I> of +a category <I>C</I> is <I>T</I> extended with a dummy <B>lock field</B>. Thus +</P> +<PRE> + lincat C = T ---> oper C = T ** {lock_C : {}} +</PRE> +<P> +and the linearization terms are lifted correspondingly. The user of +a GF library should never see any lock fields; when they appear in +the compiler's warnings, they indicate that some library category is +constructed improperly by a user program. +</P> +<A NAME="toc101"></A> +<H3>Functors</H3> +<P> +A <B>parametrized module</B>, aka. an <B>incomplete module</B>, or a +<B>functor</B>, is any module that <CODE>open</CODE>s an <CODE>interface</CODE> (or +an <CODE>abstract</CODE>). Several interfaces may be opened by one +functor. The module header must be prefixed by the word <CODE>incomplete</CODE>. +Here is a typical example, using the resource <CODE>Syntax</CODE> and +a domain specific lexicon: +</P> +<PRE> + incomplete concrete DomainI of Domain = open Syntax, Lex in {...} ; +</PRE> +<P> +A <B>functor instantiation</B> is a module that inherits a functor and +provides an instance to each of its open interfaces. Here is an example: +</P> +<PRE> + concrete DomainSwe of Domain = DomainI with + (Syntax = SyntaxSwe), + (Lex = LexSwe) ; +</PRE> +<P></P> +<A NAME="toc102"></A> +<H3>Restricted inheritance</H3> +<P> +A module of any type can make <B>restricted inheritance</B>, which is +either exclusion or inclusion: +</P> +<PRE> + module M = A[f,g], B-[k] ** ... +</PRE> +<P> +A concrete syntax given to an abstract syntax that uses restricted inheritance +must make the corresponding restrictions. In addition, the concrete syntax can +make its own restrictions in order to redefine inherited linearization types and +rules. +</P> +<P> +Overriding old definitions without explicit restrictions is not allowed. +</P> +<A NAME="toc103"></A> +<H1>Refining semantics in abstract syntax</H1> +<P> +<a name="chapsix"></a> +</P> +<P> +While the concrete syntax constructs of GF have been already +covered, there is much more that can be done in the abstract +syntax. The techniques of <B>dependent types</B> and +<B>higher order abstract syntax</B> are introduced in this chapter, +which thereby concludes the presentation of the GF language. +</P> +<P> +Many of the examples in this chapter are somewhat less close to +applications than the ones shown before. Moreover, the tools for +embedded grammars in <a href="#chapeight">the eighth chapter</a> do not yet fully support dependent +types and higher order abstract syntax. +</P> +<A NAME="toc104"></A> +<H2>GF as a logical framework</H2> +<P> +In this chapter, we will show how +to encode advanced semantic concepts in an abstract syntax. +We use concepts inherited from <B>type theory</B>. Type theory +is the basis of many systems known as <B>logical frameworks</B>, which are +used for representing mathematical theorems and their proofs on a computer. +In fact, GF has a logical framework as its proper part: +this part is the abstract syntax. +</P> +<P> +In a logical framework, the formalization of a mathematical theory +is a set of type and function declarations. The following is an example +of such a theory, represented as an <CODE>abstract</CODE> module in GF. +</P> +<PRE> + abstract Arithm = { + cat + Prop ; -- proposition + Nat ; -- natural number + fun + Zero : Nat ; -- 0 + Succ : Nat -> Nat ; -- the successor of x + Even : Nat -> Prop ; -- x is even + And : Prop -> Prop -> Prop ; -- A and B + } +</PRE> +<P> +This example does not show any new type-theoretical constructs yet, but +it could nevertheless be used as a part of a proof system for arithmetic. +</P> +<P> +<B>Exercise</B>. Give a concrete syntax of <CODE>Arithm</CODE>, preferably +by using the resource library. +</P> +<A NAME="toc105"></A> +<H2>Dependent types</H2> +<P> +<a name="secsmarthouse"></a> +</P> +<P> +<B>Dependent types</B> are a characteristic feature of GF, +inherited from the <B>constructive type theory</B> of Martin-Löf and +distinguishing GF from most other grammar formalisms and +functional programming languages. +</P> +<P> +Dependent types can be used for stating stronger +<B>conditions of well-formedness</B> than ordinary types. +A simple example is a "smart house" system, which +defines voice commands for household appliances. This example +is borrowed from the +Regulus Book +(Rayner & al. 2006). +</P> +<P> +One who enters a smart house can use a spoken <CODE>Command</CODE> to dim lights, switch +on the fan, etc. For <CODE>Device</CODE>s of each <CODE>Kind</CODE>, there is a set of +<CODE>Action</CODE>s that can be performed on them; thus one can dim the lights but + not the fan, for example. These dependencies can be expressed +by making the type <CODE>Action</CODE> dependent on <CODE>Kind</CODE>. We express these +dependencies in <CODE>cat</CODE> declarations by attaching argument types to +categories: +</P> +<PRE> + cat + Command ; + Kind ; + Device Kind ; -- argument type Kind + Action Kind ; +</PRE> +<P> +The crucial use of the dependencies is made in the rule for forming commands: +</P> +<PRE> + fun CAction : (k : Kind) -> Action k -> Device k -> Command ; +</PRE> +<P> +In other words: an action and a device can be combined into a command only +if they are of the same <CODE>Kind</CODE> <CODE>k</CODE>. If we have the functions +</P> +<PRE> + DKindOne : (k : Kind) -> Device k ; -- the light + + light, fan : Kind ; + dim : Action light ; +</PRE> +<P> +we can form the syntax tree +</P> +<PRE> + CAction light dim (DKindOne light) +</PRE> +<P> +but we cannot form the trees +</P> +<PRE> + CAction light dim (DKindOne fan) + CAction fan dim (DKindOne light) + CAction fan dim (DKindOne fan) +</PRE> +<P> +Linearization rules are written as usual: the concrete syntax does not +know if a category is a dependent type. In English, one could write as follows: +</P> +<PRE> + lincat Action = {s : Str} ; + lin CAction _ act dev = {s = act.s ++ dev.s} ; +</PRE> +<P> +Notice that the argument for <CODE>Kind</CODE> does not appear in the linearization; +therefore it is good practice to make this clear by +using a wild card for it, rather than a real +variable. +As we will show, +the type checker can reconstruct the kind from the <CODE>dev</CODE> argument. +</P> +<P> +Parsing with dependent types is performed in two phases: +</P> +<OL> +<LI>context-free parsing +<LI>filtering through type checker +</OL> + +<P> +If you just parse in the usual way, you don't enter the second phase, and +the <CODE>kind</CODE> argument is not found: +</P> +<PRE> + > parse "dim the light" + CAction ? dim (DKindOne light) +</PRE> +<P> +Moreover, type-incorrect commands are not rejected: +</P> +<PRE> + > parse "dim the fan" + CAction ? dim (DKindOne fan) +</PRE> +<P> +The question mark <CODE>?</CODE> is a <B>metavariable</B>, and is returned by the parser +for any subtree that is suppressed by a linearization rule. +These are exactly the same kind of metavariables as were used <a href="#secediting">here</a> +to mark incomplete parts of trees in the syntax editor. +</P> +<P> +To get rid of metavariables, we must feed the parse result into the +second phase of <B>solving</B> them. The <CODE>solve</CODE> process uses the dependent +type checker to restore the values of the metavariables. It is invoked by +the command <CODE>put_tree = pt</CODE> with the flag <CODE>-transform=solve</CODE>: +</P> +<PRE> + > parse "dim the light" | put_tree -transform=solve + CAction light dim (DKindOne light) +</PRE> +<P> +The <CODE>solve</CODE> process may fail, in which case no tree is returned: +</P> +<PRE> + > parse "dim the fan" | put_tree -transform=solve + no tree found +</PRE> +<P></P> +<P> +<B>Exercise</B>. Write an abstract syntax module with above contents +and an appropriate English concrete syntax. Try to parse the commands +<I>dim the light</I> and <I>dim the fan</I>, with and without <CODE>solve</CODE> filtering. +</P> +<P> +<B>Exercise</B>. Perform random and exhaustive generation, with and without +<CODE>solve</CODE> filtering. +</P> +<P> +<B>Exercise</B>. Add some device kinds and actions to the grammar. +</P> +<A NAME="toc106"></A> +<H2>Polymorphism</H2> +<P> +<a name="secpolymorphic"></a> +</P> +<P> +Sometimes an action can be performed on all kinds of devices. It would be +possible to introduce separate <CODE>fun</CODE> constants for each kind-action pair, +but this would be tedious. Instead, one can use <B>polymorphic</B> actions, +i.e. actions that take a <CODE>Kind</CODE> as an argument and produce an <CODE>Action</CODE> +for that <CODE>Kind</CODE>: +</P> +<PRE> + fun switchOn, switchOff : (k : Kind) -> Action k ; +</PRE> +<P> +Functions that are not polymorphic are <B>monomorphic</B>. However, the +dichotomy into monomorphism and full polymorphism is not always sufficient +for good semantic modelling: very typically, some actions are defined +for a proper subset of devices, but not just one. For instance, both doors and +windows can be opened, whereas lights cannot. +We will return to this problem by introducing the +concept of <B>restricted polymorphism</B> later, +after a section on proof objects. +</P> +<P> +<B>Exercise</B>. The grammar <CODE>ExtFoods</CODE> <a href="#secextended">here</a> permits the +formation of phrases such as <I>we drink this fish</I> and <I>we eat this wine</I>. +A way to prevent them is to distinguish between eatable and drinkable food items. +Another, related problem is that there is some duplicated code +due to a category distinction between guests and food items, for instance, +two constructors for the determiner <I>this</I>. This problem can also +be solved by dependent types. Rewrite the abstract syntax in <CODE>Foods</CODE> and +<CODE>ExtFoods</CODE> by using such a type system, and also update the concrete syntaxes. +If you do this right, you only have to change the functor modules +<CODE>FoodsI</CODE> and <CODE>ExtFoodsI</CODE> in the concrete syntax. +</P> +<A NAME="toc107"></A> +<H3>Digression: dependent types in concrete syntax</H3> +<P> +The <B>functional fragment</B> of GF +terms and types comprises function types, applications, lambda +abstracts, constants, and variables. This fragment is the same in +abstract and concrete syntax. In particular, +dependent types are also available in concrete syntax. +We have not made use of them yet, +but we will now look at one example of how they +can be used. +</P> +<P> +Those readers who are familiar with functional programming languages +like ML and Haskell, may already have missed <B>polymorphic</B> +functions. For instance, Haskell programmers have access to +the functions +</P> +<PRE> + const :: a -> b -> a + const c _ = c + + flip :: (a -> b -> c) -> b -> a -> c + flip f y x = f x y +</PRE> +<P> +which can be used for any given types <CODE>a</CODE>,<CODE>b</CODE>, and <CODE>c</CODE>. +</P> +<P> +The GF counterpart of polymorphic functions are <B>monomorphic</B> +functions with explicit <B>type variables</B> --- a techniques that we already +used in abstract syntax for modelling actions that can be performed +on all kinds of devices. Thus the above definitions can be written +</P> +<PRE> + oper const :(a,b : Type) -> a -> b -> a = + \_,_,c,_ -> c ; + + oper flip : (a,b,c : Type) -> (a -> b ->c) -> b -> a -> c = + \_,_,_,f,x,y -> f y x ; +</PRE> +<P> +When the operations are used, the type checker requires +them to be equipped with all their arguments; this may be a nuisance +for a Haskell or ML programmer. They have not been used very much, +except in the <CODE>Coordination</CODE> module of the resource library. +</P> +<A NAME="toc108"></A> +<H2>Proof objects</H2> +<P> +Perhaps the most well-known idea in constructive type theory is +the <B>Curry-Howard isomorphism</B>, also known as the +<B>propositions as types principle</B>. Its earliest formulations +were attempts to give semantics to the logical systems of +propositional and predicate calculus. In this section, we will consider +a more elementary example, showing how the notion of proof is useful +outside mathematics, as well. +</P> +<P> +We use the already shown category of unary (also known as Peano-style) +natural numbers: +</P> +<PRE> + cat Nat ; + fun Zero : Nat ; + fun Succ : Nat -> Nat ; +</PRE> +<P> +The <B>successor function</B> <CODE>Succ</CODE> generates an infinite +sequence of natural numbers, beginning from <CODE>Zero</CODE>. +</P> +<P> +We then define what it means for a number <I>x</I> to be <I>less than</I> +a number <I>y</I>. Our definition is based on two axioms: +</P> +<UL> +<LI><CODE>Zero</CODE> is less than <CODE>Succ</CODE> <I>y</I> for any <I>y</I>. +<LI>If <I>x</I> is less than <I>y</I>, then <CODE>Succ</CODE> <I>x</I> is less than <CODE>Succ</CODE> <I>y</I>. +</UL> + +<P> +The most straightforward way of expressing these axioms in type theory +is with a dependent type <CODE>Less</CODE> <I>x y</I>, and two functions constructing +its objects: +</P> +<PRE> + cat Less Nat Nat ; + fun lessZ : (y : Nat) -> Less Zero (Succ y) ; + fun lessS : (x,y : Nat) -> Less x y -> Less (Succ x) (Succ y) ; +</PRE> +<P> +Objects formed by <CODE>lessZ</CODE> and <CODE>lessS</CODE> are +called <B>proof objects</B>: they establish the truth of certain +mathematical propositions. +For instance, the fact that 2 is less that +4 has the proof object +</P> +<PRE> + lessS (Succ Zero) (Succ (Succ (Succ Zero))) + (lessS Zero (Succ (Succ Zero)) (lessZ (Succ Zero))) +</PRE> +<P> +whose type is +</P> +<PRE> + Less (Succ (Succ Zero)) (Succ (Succ (Succ (Succ Zero)))) +</PRE> +<P> +which is the formalization of the proposition that 2 is less than 4. +</P> +<P> +GF grammars can be used to provide a <B>semantic control</B> of +well-formedness of expressions. We have already seen examples of this: +the grammar of well-formed actions on household devices. By introducing proof objects +we have now added an even more powerful technique of expressing semantic conditions. +</P> +<P> +A simple example of the use of proof objects is the definition of +well-formed <I>time spans</I>: a time span is expected to be from an earlier to +a later time: +</P> +<PRE> + from 3 to 8 +</PRE> +<P> +is thus well-formed, whereas +</P> +<PRE> + from 8 to 3 +</PRE> +<P> +is not. The following rules for spans impose this condition +by using the <CODE>Less</CODE> predicate: +</P> +<PRE> + cat Span ; + fun span : (m,n : Nat) -> Less m n -> Span ; +</PRE> +<P></P> +<P> +<B>Exercise</B>. Write an abstract and concrete syntax with the +concepts of this section, and experiment with it in GF. +</P> +<P> +<B>Exercise</B>. Define the notions of "even" and "odd" in terms +of proof objects. <B>Hint</B>. You need one function for proving +that 0 is even, and two other functions for propagating the +properties. +</P> +<A NAME="toc109"></A> +<H3>Proof-carrying documents</H3> +<P> +Another possible application of proof objects is <B>proof-carrying documents</B>: +to be semantically well-formed, the abstract syntax of a document must contain a proof +of some property, although the proof is not shown in the concrete document. +Think, for instance, of small documents describing flight connections: +</P> +<P> +<I>To fly from Gothenburg to Prague, first take LH3043 to Frankfurt, then OK0537 to Prague.</I> +</P> +<P> +The well-formedness of this text is partly expressible by dependent typing: +</P> +<PRE> + cat + City ; + Flight City City ; + fun + Gothenburg, Frankfurt, Prague : City ; + LH3043 : Flight Gothenburg Frankfurt ; + OK0537 : Flight Frankfurt Prague ; +</PRE> +<P> +This rules out texts saying <I>take OK0537 from Gothenburg to Prague</I>. +However, there is a +further condition saying that it must be possible to +change from LH3043 to OK0537 in Frankfurt. +This can be modelled as a proof object of a suitable type, +which is required by the constructor +that connects flights. +</P> +<PRE> + cat + IsPossible (x,y,z : City)(Flight x y)(Flight y z) ; + fun + Connect : (x,y,z : City) -> + (u : Flight x y) -> (v : Flight y z) -> + IsPossible x y z u v -> Flight x z ; +</PRE> +<P></P> +<A NAME="toc110"></A> +<H2>Restricted polymorphism</H2> +<P> +In the first version of the smart house grammar <CODE>Smart</CODE>, +all Actions were either of +</P> +<UL> +<LI><B>monomorphic</B>: defined for one Kind +<LI><B>polymorphic</B>: defined for all Kinds +</UL> + +<P> +To make this scale up for new Kinds, we can refine this to +<B>restricted polymorphism</B>: defined for Kinds of a certain <B>class</B> +</P> +<P> +The notion of class can be expressed in abstract syntax +by using the Curry-Howard isomorphism as follows: +</P> +<UL> +<LI>a class is a <B>predicate</B> of Kinds --- i.e. a type depending of Kinds +<LI>a Kind is in a class if there is a proof object of this type +</UL> + +<P> +Here is an example with switching and dimming. The classes are called +<CODE>switchable</CODE> and <CODE>dimmable</CODE>. +</P> +<PRE> + cat + Switchable Kind ; + Dimmable Kind ; + fun + switchable_light : Switchable light ; + switchable_fan : Switchable fan ; + dimmable_light : Dimmable light ; + + switchOn : (k : Kind) -> Switchable k -> Action k ; + dim : (k : Kind) -> Dimmable k -> Action k ; +</PRE> +<P> +One advantage of this formalization is that classes for new +actions can be added incrementally. +</P> +<P> +<B>Exercise</B>. Write a new version of the <CODE>Smart</CODE> grammar with +classes, and test it in GF. +</P> +<P> +<B>Exercise</B>. Add some actions, kinds, and classes to the grammar. +Try to port the grammar to a new language. You will probably find +out that restricted polymorphism works differently in different languages. +For instance, in Finnish not only doors but also TVs and radios +can be "opened", which means switching them on. +</P> +<A NAME="toc111"></A> +<H2>Variable bindings</H2> +<P> +<a name="secbinding"></a> +</P> +<P> +Mathematical notation and programming languages have +expressions that <B>bind</B> variables. For instance, +a universally quantifier proposition +</P> +<PRE> + (All x)B(x) +</PRE> +<P> +consists of the <B>binding</B> <CODE>(All x)</CODE> of the variable <CODE>x</CODE>, +and the <B>body</B> <CODE>B(x)</CODE>, where the variable <CODE>x</CODE> can have +<B>bound occurrences</B>. +</P> +<P> +Variable bindings appear in informal mathematical language as well, for +instance, +</P> +<PRE> + for all x, x is equal to x + + the function that for any numbers x and y returns the maximum of x+y + and x*y + + Let x be a natural number. Assume that x is even. Then x + 3 is odd. +</PRE> +<P> +In type theory, variable-binding expression forms can be formalized +as functions that take functions as arguments. The universal +quantifier is defined +</P> +<PRE> + fun All : (Ind -> Prop) -> Prop +</PRE> +<P> +where <CODE>Ind</CODE> is the type of individuals and <CODE>Prop</CODE>, +the type of propositions. If we have, for instance, the equality predicate +</P> +<PRE> + fun Eq : Ind -> Ind -> Prop +</PRE> +<P> +we may form the tree +</P> +<PRE> + All (\x -> Eq x x) +</PRE> +<P> +which corresponds to the ordinary notation +</P> +<PRE> + (All x)(x = x). +</PRE> +<P> +An abstract syntax where trees have functions as arguments, as in +the two examples above, has turned out to be precisely the right +thing for the semantics and computer implementation of +variable-binding expressions. The advantage lies in the fact that +only one variable-binding expression form is needed, the lambda abstract +<CODE>\x -> b</CODE>, and all other bindings can be reduced to it. +This makes it easier to implement mathematical theories and reason +about them, since variable binding is tricky to implement and +to reason about. The idea of using functions as arguments of +syntactic constructors is known as <B>higher-order abstract syntax</B>. +</P> +<P> +The question now arises: how to define linearization rules +for variable-binding expressions? +Let us first consider universal quantification, +</P> +<PRE> + fun All : (Ind -> Prop) -> Prop +</PRE> +<P> +In GF, we write +</P> +<PRE> + lin All B = {s = "(" ++ "All" ++ B.$0 ++ ")" ++ B.s} +</PRE> +<P> +to obtain the form shown above. +This linearization rule brings in a new GF concept --- the <CODE>$0</CODE> +field of <CODE>B</CODE> containing a bound variable symbol. +The general rule is that, if an argument type of a function is +itself a function type <CODE>A -> C</CODE>, the linearization type of +this argument is the linearization type of <CODE>C</CODE> +together with a new field <CODE>$0 : Str</CODE>. In the linearization rule +for <CODE>All</CODE>, the argument <CODE>B</CODE> thus has the linearization +type +</P> +<PRE> + {$0 : Str ; s : Str}, +</PRE> +<P> +since the linearization type of <CODE>Prop</CODE> is +</P> +<PRE> + {s : Str} +</PRE> +<P> +In other words, the linearization of a function +consists of a linearization of the body together with a +field for a linearization of the bound variable. +Those familiar with type theory or lambda calculus +should notice that GF requires trees to be in +<B>eta-expanded</B> form in order for this to make sense: +for any function of type +</P> +<PRE> + A -> B +</PRE> +<P> +an eta-expanded syntax tree has the form +</P> +<PRE> + \x -> b +</PRE> +<P> +where <CODE>b : B</CODE> under the assumption <CODE>x : A</CODE>. +It is in this form that an expression can be analysed +as having a bound variable and a body, which can be put into +a linearization record. +</P> +<P> +Given the linearization rule +</P> +<PRE> + lin Eq a b = {s = "(" ++ a.s ++ "=" ++ b.s ++ ")"} +</PRE> +<P> +the linearization of +</P> +<PRE> + \x -> Eq x x +</PRE> +<P> +is the record +</P> +<PRE> + {$0 = "x", s = ["( x = x )"]} +</PRE> +<P> +Thus we can compute the linearization of the formula, +</P> +<PRE> + All (\x -> Eq x x) --> {s = "[( All x ) ( x = x )]"}. +</PRE> +<P> +But how did we get the linearization of the variable <CODE>x</CODE> +into the string <CODE>"x"</CODE>? GF grammars have no rules for +this: it is just hard-wired in GF that variable symbols are +linearized into the same strings that represent them in +the print-out of the abstract syntax. +</P> +<P> +To be able to <I>parse</I> variable symbols, however, GF needs to know what +to look for (instead of e.g. trying to parse <I>any</I> +string as a variable). What strings are parsed as variable symbols +is defined in the lexical analysis part of GF parsing +</P> +<PRE> + > p -cat=Prop -lexer=codevars "(All x)(x = x)" + All (\x -> Eq x x) +</PRE> +<P> +(see more details on lexers <a href="#seclexing">here</a>). If several variables are bound in the +same argument, the labels are <CODE>$0, $1, $2</CODE>, etc. +</P> +<P> +<B>Exercise</B>. Write an abstract syntax of the whole +<B>predicate calculus</B>, with the +<B>connectives</B> "and", "or", "implies", and "not", and the +<B>quantifiers</B> "exists" and "for all". Use higher-order functions +to guarantee that unbounded variables do not occur. +</P> +<P> +<B>Exercise</B>. Write a concrete syntax for your favourite +notation of predicate calculus. Use Latex as target language +if you want nice output. You can also try producing boolean +expressions of some programming language. Use as many parenthesis as you need to +guarantee non-ambiguity. +</P> +<A NAME="toc112"></A> +<H2>Semantic definitions</H2> +<P> +<a name="secdefdef"></a> +</P> +<P> +Just like any functional programming language, abstract syntax in +GF has declarations of functions, telling what the type of a function is. +But we have not yet shown how to <B>compute</B> +these functions: all we can do is provide them with arguments +and linearize the resulting terms. +Since our main interest is the well-formedness of expressions, +this has not yet bothered +us very much. As we will see, however, computation does play a role +even in the well-formedness of expressions when dependent types are +present. +</P> +<P> +GF has a form of judgement for <B>semantic definitions</B>, +marked by the key word <CODE>def</CODE>. At its simplest, it is just +the definition of one constant, e.g. +</P> +<PRE> + fun one : Nat ; + def one = Succ Zero ; +</PRE> +<P> +Notice a <CODE>def</CODE> definition can only be given to names declared by +<CODE>fun</CODE> judgements in the same module; it is not possible to define +an inherited name. +</P> +<P> +We can also define a function with arguments, +</P> +<PRE> + fun twice : Nat -> Nat ; + def twice x = plus x x ; +</PRE> +<P> +which is still a special case of the most general notion of +definition, that of a group of <B>pattern equations</B>: +</P> +<PRE> + fun plus : Nat -> Nat -> Nat ; + def + plus x Zero = x ; + plus x (Succ y) = Succ (Sum x y) ; +</PRE> +<P> +To compute a term is, as in functional programming languages, +simply to follow a chain of reductions until no definition +can be applied. For instance, we compute +</P> +<PRE> + Sum one one --> + Sum (Succ Zero) (Succ Zero) --> + Succ (sum (Succ Zero) Zero) --> + Succ (Succ Zero) +</PRE> +<P> +Computation in GF is performed with the <CODE>pt</CODE> command and the +<CODE>compute</CODE> transformation, e.g. +</P> +<PRE> + > p -tr "1 + 1" | pt -transform=compute -tr | l + sum one one + Succ (Succ Zero) + s(s(0)) +</PRE> +<P></P> +<P> +The <CODE>def</CODE> definitions of a grammar induce a notion of +<B>definitional equality</B> among trees: two trees are +definitionally equal if they compute into the same tree. +Thus, trivially, all trees in a chain of computation +(such as the one above) are definitionally equal to each other. +In general, there can be infinitely many definitionally equal trees. +</P> +<P> +An important property of definitional equality is that it is +<B>extensional</B>, i.e. has to do with the sameness of semantic value. +Linearization, on the other hand, is an <B>intensional</B> operation, +i.e. has to do with the sameness of expression. This means that +<CODE>def</CODE> definitions are <I>not</I> evaluated as linearization steps. +Intensionality is a crucial property of linearization, since we want +to use it for things like tracing a chain of evaluation. +For instance, each of the steps of the computation above +has a different linearization into standard arithmetic notation: +</P> +<PRE> + 1 + 1 + s(0) + s(0) + s(s(0) + 0) + s(s(0)) +</PRE> +<P> +In most programming languages, the operations that can be performed on +expressions are extensional, i.e. give equal values to equal arguments. +But GF has both extensional and intensional operations. +Type checking is extensional: +in the type theory with dependent types, types may depend on terms, +and types depending on definitionally equal terms are +equal types. For instance, +</P> +<PRE> + Less Zero one + Less Zero (Succ Zero)) +</PRE> +<P> +are equal types. Hence, any tree that type checks as a proof that +1 is odd also type checks as a proof that the successor of 0 is odd. +(Recall, in this connection, that the +arguments a category depends on never play any role +in the linearization of trees of that category, +nor in the definition of the linearization type.) +</P> +<P> +When pattern matching is performed with <CODE>def</CODE> equations, it is +crucial to distinguish between <B>constructors</B> and other functions +(cf. <a href="#secmatching">here</a> on pattern matching in concrete syntax). +GF has a judgement form <CODE>data</CODE> to tell that a category has +certain functions as constructors: +</P> +<PRE> + data Nat = Succ | Zero ; +</PRE> +<P> +Unlike in Haskell and ML, new constructors can be added to +a type with new <CODE>data</CODE> judgements. The type signatures of constructors +are given separately, in ordinary <CODE>fun</CODE> judgements. +One can also write directly +</P> +<PRE> + data Succ : Nat -> Nat ; +</PRE> +<P> +which is syntactic sugar for the pair of judgements +</P> +<PRE> + fun Succ : Nat -> Nat ; + data Nat = Succ ; +</PRE> +<P> +If we did not mark <CODE>Zero</CODE> as <CODE>data</CODE>, the definition +</P> +<PRE> + fun isZero : Nat -> Bool ; + def isZero Zero = True ; + def isZero _ = False ; +</PRE> +<P> +would return <CODE>True</CODE> for all arguments, because the pattern <CODE>Zero</CODE> +would be treated as a variable and it would hence match all values! +This is a common pitfall in GF. +</P> +<P> +<B>Exercise</B>. Implement an interpreter of a small functional programming +language with natural numbers, lists, pairs, lambdas, etc. Use higher-order +abstract syntax with semantic definitions. As onject language, use +your favourite programming language. +</P> +<A NAME="toc113"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc114"></A> +<H3>Judgements</H3> +<P> +We have generalized the <CODE>cat</CODE> judgement form and introduced two new forms +for abstract syntax: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>form</TH> +<TH COLSPAN="2">reading</TH> +</TR> +<TR> +<TD><CODE>cat</CODE> <I>C</I> <I>G</I></TD> +<TD><I>C</I> is a category in context <I>G</I></TD> +</TR> +<TR> +<TD><CODE>def</CODE> <I>f</I> <I>P1</I> ... <I>Pn</I> <CODE>=</CODE> t</TD> +<TD>function <I>f</I> applied to <I>P1</I>...<I>Pn</I> has value <I>t</I></TD> +</TR> +<TR> +<TD><CODE>data</CODE> <I>C</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD> +<TD>category <I>C</I> has constructors <I>C1</I>...<I>Cn</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +The <B>context</B> in the <CODE>cat</CODE> judgement has the form +</P> +<PRE> + (x1 : T1) ... (xn : Tn) +</PRE> +<P> +where the types <I>T1 ... Tn</I> may be increasingly dependent. To form a +type, <I>C</I> must be equipped with arguments of each type in the +context, satisfying the dependencies. As syntactic sugar, we have +</P> +<PRE> + T G === (x : T) G +</PRE> +<P> +if <I>x</I> does not occur in <I>G</I>. The linearization type definition of a +category does not mention the context. +</P> +<P> +In <CODE>def</CODE> judgements, the arguments <I>P1</I>...<I>Pn</I> can be constructor and +variable patterns as well as wild cards, and the binding and +evaluation rules are the same as <a href="#secmatching">here</a>. +</P> +<P> +A <CODE>data</CODE> judgement states that the names on the right-hand side are constructors +of the category on the left-hand side. The precise types of the constructors are +given in the <CODE>fun</CODE> judgements introducing them; the value type of a constructor +of <I>C</I> must be of the form <I>C a1 ... am</I>. As syntactic sugar, +</P> +<PRE> + data f : A1 ... An -> C a1 ... am === + fun f : A1 ... An -> C a1 ... am ; data C = f ; +</PRE> +<P></P> +<A NAME="toc115"></A> +<H3>Dependent function types</H3> +<P> +A <B>dependent function type</B> has the form +</P> +<PRE> + (x : A) -> B +</PRE> +<P> +where <I>B</I> depends on a variable <I>x</I> of type <I>A</I>. We have the +following syntactic sugar: +</P> +<PRE> + (x,y : A) -> B === (x : A) -> (y : A) -> B + + (_ : A) -> B === (x : A) -> B if B does not depend on x + + A -> B === (_ : A) -> B +</PRE> +<P> +A <CODE>fun</CODE> function in abstract syntax may have function types as +argument types. This is called <B>higher-order abstract syntax</B>. +The linearization of an argument +</P> +<PRE> + \z0, ..., zn -> b : (x0 : A1) -> ... -> (xn : An) -> B +</PRE> +<P> +if formed from the linearization of <I>b*</I> of <I>b</I> by adding +fields that hold the variable symbols: +</P> +<PRE> + b* ** {$0 = "z0" ; ... ; $n = "zn"} +</PRE> +<P> +If an argument function is itself a higher-order function, its +bound variables cannot be reached in linearization. Thus, in a sense, +the higher-order abstract syntax of GF is just <B>second-orde abstract syntax</B>. +</P> +<P> +A <B>syntax tree</B> is a well-typed term in <B>beta-eta normal form</B>, which +means that +</P> +<UL> +<LI>its type is a basic type, i.e. it is not a partial application; +<LI>its arguments are in eta normal form, i.e. either full applications or + lambda abstractions with bodies that are full applications; +<LI>it has no beta redexes, i.e. applications of abstractions. +</UL> + +<P> +Terms that are not in this form may occur as arguments of dependent types +and in <CODE>def</CODE> judgements, but they cannot be linearized. +</P> +<A NAME="toc116"></A> +<H1>Grammars of formal languages</H1> +<P> +<a name="chapseven"></a> +</P> +<P> +In this chapter, we will write a grammar for arithmetic expressions as known +from school mathematics and many programming languages. We will see how to +define precedences in GF, how to include built-in integers in grammars, and +how to deal with spaces between tokens in desired ways. As an alternative concrete +syntax, we will generate code for a JVM-like stack machine. We will conclude +by extending the language with variable declarations and assignments, which +are handled in a type-safe way by using higher-order abstract syntax. +</P> +<P> +To write grammars for formal languages is usually less challenging than for +natural languages. There are standard tools for this task, such as the YACC +family of parser generators. Using GF would be overkill for many projects, +and come with a penalty in efficiency. However, it is still worth while to +look at this task. A typical application of GF are natural-language interfaces +to formal systems: in such applications, the translation between natural and +formal language can be defined as a multilingual grammar. The use of higher-order +abstract syntax, together with dependent types, provides a way to define a +complete compiler in GF. +</P> +<A NAME="toc117"></A> +<H2>Arithmetic expressions</H2> +<A NAME="toc118"></A> +<H3>Abstract syntax</H3> +<P> +We want to write a grammar for what is usually called <B>expressions</B> +in programming languages. The expressions are built from integers by +the binary operations of addition, subtraction, multiplication, and +division. The abstract syntax is easy to write. We call it <CODE>Calculator</CODE>, +since it can be used as the basis of a calculator. +</P> +<PRE> + abstract Calculator = { + + cat Exp ; + + fun + EPlus, EMinus, ETimes, EDiv : Exp -> Exp -> Exp ; + EInt : Int -> Exp ; + } +</PRE> +<P> +Notice the use of the category <CODE>Int</CODE>. It is a built-in category of +integers. Its syntax trees are denoted by <B>integer literals</B>, which are +sequences of digits. For instance, +</P> +<PRE> + 5457455814608954681 : Int +</PRE> +<P> +These are the only objects of type <CODE>Int</CODE>: +grammars are not allowed to declare functions with <CODE>Int</CODE> as value type. +</P> +<A NAME="toc119"></A> +<H3>Concrete syntax: a simple approach</H3> +<P> +Arithmetic expressions should be unambiguous. If we write +</P> +<PRE> + 2 + 3 * 4 +</PRE> +<P> +it should be parsed as one, but not both, of +</P> +<PRE> + EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) + ETimes (EPlus (EInt 2) (EInt 3)) (EInt 4) +</PRE> +<P> +Under normal conventions, the former is chosen, because +multiplication has <B>higher precedence</B> than addition. +If we want to express the latter tree, we have to use +parentheses: +</P> +<PRE> + (2 + 3) * 4 +</PRE> +<P> +However, it is not completely trivial to decide when to use +parentheses and when not. We will therefore begin with a +concrete syntax that always uses parentheses around binary +operator applications. +</P> +<PRE> + concrete CalculatorP of Calculator = { + + lincat + Exp = SS ; + lin + EPlus = infix "+" ; + EMinus = infix "-" ; + ETimes = infix "*" ; + EDiv = infix "/" ; + EInt i = i ; + + oper + infix : Str -> SS -> SS -> SS = \f,x,y -> + ss ("(" ++ x.s ++ f ++ y.s ++ ")") ; + } +</PRE> +<P> +Now we will obtain +</P> +<PRE> + > linearize EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) + ( 2 + ( 3 * 4 ) ) +</PRE> +<P> +The first problem, even more urgent than superfluous parentheses, is +to get rid of superfluous spaces and to recognize integer literals +in the parser. +</P> +<A NAME="toc120"></A> +<H2>Lexing and unlexing</H2> +<P> +<a name="seclexing"></a> +</P> +<P> +The input of parsing in GF is not just a string, but a list of +<B>tokens</B>. By default, a list of tokens is obtained from a string +by analysing it into <B>words</B>, which means chunks separated by +spaces. Thus for instance +</P> +<PRE> + "(12 + (3 * 4))" +</PRE> +<P> +is split into the tokens +</P> +<PRE> + "(12", "+", "(3". "*". "4))" +</PRE> +<P> +The parser then tries to find each of these tokens among the terminals +of the grammar, i.e. among the strings that can appear in linearizations. +In our example, only the tokens <CODE>"+"</CODE> and <CODE>"*"</CODE> can be found, and +parsing therefore fails. +</P> +<P> +The proper way to split the above string into tokens would be +</P> +<PRE> + "(", "12", "+", "(", "3", "*", "4", ")", ")" +</PRE> +<P> +Moreover, the tokens <CODE>"12"</CODE>, <CODE>"3"</CODE>, and <CODE>"4"</CODE> should not be sought +among the terminals in the grammar, but treated as integer tokens, which +are defined outside the grammar. Since GF aims to be fully general, such +conventions are not built in: it must be possible for a grammar to have +tokens such as <CODE>"12"</CODE> and <CODE>"12)"</CODE>. Therefore, GF has a way to select +a <B>lexer</B>, a function that splits strings into tokens and classifies +them into terminals, literalts, etc. +</P> +<P> +A lexer can be given as a flag to the parsing command: +</P> +<PRE> + > parse -cat=Exp -lexer=codelit "(2 + (3 * 4))" + EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) +</PRE> +<P> +Since the lexer is usually a part of the language specification, it +makes sense to put it in the concrete syntax by using the judgement +</P> +<PRE> + flags lexer = codelit ; +</PRE> +<P> +The problem of getting correct spacing after linearization is likewise solved +by an <B>unlexer</B>: +</P> +<PRE> + > l -unlexer=code EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) + (2 + (3 * 4)) +</PRE> +<P> +Also this flag is usually put into the concrete syntax file. +</P> +<P> +The lexers and unlexers that are available in the GF system can be +seen by +</P> +<PRE> + > help -lexer + > help -unlexer +</PRE> +<P> +A table of the most common lexers and unlexers is given in the Summary +section 7.8. +</P> +<A NAME="toc121"></A> +<H2>Precedence and fixity</H2> +<P> +<a name="secprecedence"></a> +</P> +<P> +Here is a summary of the usual +precedence rules in mathematics and programming languages: +</P> +<UL> +<LI>Integer constants and expressions in parentheses have the highest precedence. +<LI>Multiplication and division have equal precedence, lower than the highest + but higher than addition and subtraction, which are again equal. +<LI>All the four binary operations are <B>left-associative</B>, which means that + e.g. <CODE>1 + 2 + 3</CODE> means the same as <CODE>(1 + 2) + 3</CODE>. +</UL> + +<P> +One way of dealing with precedences in compiler books is by dividing expressions +into three categories: +</P> +<UL> +<LI>expressions: addition and subtraction +<LI>terms: multiplication and division +<LI>factors: constants and expressions in parentheses +</UL> + +<P> +The context-free grammar, also taking care of associativity, is the following: +</P> +<PRE> + Exp ::= Exp "+" Term | Exp "-" Term | Term ; + Term ::= Term "*" Fact | Term "/" Fact | Fact ; + Fact ::= Int | "(" Exp ")" ; +</PRE> +<P> +A compiler, however, does not want to make a semantic distinction between the +three categories. Nor does it want to build syntax trees with the +<B>coercions</B> that enable the use of a higher level expressions on a lower, and +encode the use of parentheses. In compiler tools such as YACC, building abstract +syntax trees is performed as a <B>semantic action</B>. For instance, if the parser +recognizes an expression in parentheses, the action is to return only the +expression, without encoding the parentheses. +</P> +<P> +In GF, semantic actions could be encoded by using <CODE>def</CODE> definitions introduced +<a href="#secdefdef">here</a>. But there is a more straightforward way of thinking about +precedences: we introduce a parameter for precedence, and treat it as +an inherent feature of expressions: +</P> +<PRE> + oper + param Prec = Ints 2 ; + TermPrec : Type = {s : Str ; p : Prec} ; + + mkPrec : Prec -> Str -> TermPrec = \p,s -> {s = s ; p = p} ; + + lincat + Exp = TermPrec ; +</PRE> +<P> +This example shows another way to use built-in integers in GF: +the type <CODE>Ints 2</CODE> is a parameter type, whose values are the integers +<CODE>0,1,2</CODE>. These are the three precedence levels we need. The main idea +is to compare the inherent precedence of an expression with the context +in which it is used. If the precedence is higher than or equal to +the expected, then +no parentheses are needed. Otherwise they are. We encode this rule in +the operation +</P> +<PRE> + oper usePrec : TermPrec -> Prec -> Str = \x,p -> + case lessPrec x.p p of { + True => "(" x.s ")" ; + False => x.s + } ; +</PRE> +<P> +With this operation, we can build another one, that can be used for +defining left-associative infix expressions: +</P> +<PRE> + infixl : Prec -> Str -> (_,_ : TermPrec) -> TermPrec = \p,f,x,y -> + mkPrec p (usePrec x p ++ f ++ usePrec y (nextPrec p)) ; +</PRE> +<P> +Constant-like expressions (the highest level) can be built simply by +</P> +<PRE> + constant : Str -> TermPrec = mkPrec 2 ; +</PRE> +<P> +All these operations can be found in the library module <CODE>lib/prelude/Formal</CODE>, +so we don't have to define them in our own code. Also the auxiliary operations +<CODE>nextPrec</CODE> and <CODE>lessPrec</CODE> used in their definitions are defined there. +The library has 5 levels instead of 3. +</P> +<P> +Now we can express the whole concrete syntax of <CODE>Calculator</CODE> compactly: +</P> +<PRE> + concrete CalculatorC of Calculator = open Formal, Prelude in { + + flags lexer = codelit ; unlexer = code ; startcat = Exp ; + + lincat Exp = TermPrec ; + + lin + EPlus = infixl 0 "+" ; + EMinus = infixl 0 "-" ; + ETimes = infixl 1 "*" ; + EDiv = infixl 1 "/" ; + EInt i = constant i.s ; + } +</PRE> +<P> +Let us just take one more look at the operation <CODE>usePrec</CODE>, which decides whether +to put parentheses around a term or not. The case where parentheses are not needed +around a string was defined as the string itself. +However, this would imply that superfluous parentheses +are never correct. A more liberal grammar is obtained by using the operation +</P> +<PRE> + parenthOpt : Str -> Str = \s -> variants {s ; "(" ++ s ++ ")"} ; +</PRE> +<P> +which is actually used in the <CODE>Formal</CODE> library. +But even in this way, we can only allow one pair of superfluous parentheses. +Thus the parameter-based grammar has not quite reached the goal +of implementing the same language as the expression-term-factor grammar. +But it has the advantage of eliminating precedence distinctions from the +abstract syntax. +</P> +<P> +<B>Exercise</B>. Define non-associative and right-associative infix operations +analogous to <CODE>infixl</CODE>. +</P> +<P> +<B>Exercise</B>. Add a constructor that puts parentheses around expressions +to raise their precedence, but that is eliminated by a <CODE>def</CODE> definition. +Test parsing with and without a pipe to <CODE>pt -transform=compute</CODE>. +</P> +<A NAME="toc122"></A> +<H2>Code generation as linearization</H2> +<P> +The classical use of grammars of programming languages is in <B>compilers</B>, +which translate one language into another. Typically the source language of +a compiler is a high-level language and the target language is a machine +language. The hub of a compiler is abstract syntax: the <B>front end</B> of +the compiler parses source language strings into abstract syntax trees, and +the <B>back end</B> linearizes these trees into the target language. This processing +model is of course what GF uses for natural language translation; the main +difference is that, in GF, the compiler could run in the opposite direction as +well, that is, function as a <B>decompiler</B>. (In full-size compilers, the +abstract syntax is also transformed by several layers of semantic analysis +and optimizations, before the target code is generated; this can destroy +reversibility and hence decompilation.) +</P> +<P> +More for the sake of illustration +than as a serious compiler, let us write a concrete +syntax of <CODE>Calculator</CODE> that generates machine code similar to JVM (Java Virtual +Machine). JVM is a so-called <B>stack machine</B>, whose code follows the +<B>postfix</B> notation, also known as <B>reverse Polish</B> notation. Thus the +expression +</P> +<PRE> + 2 + 3 * 4 +</PRE> +<P> +is translated to +</P> +<PRE> + iconst 2 : iconst 3 ; iconst 4 ; imul ; iadd +</PRE> +<P> +The linearization rules are not difficult to give: +</P> +<PRE> + lin + EPlus = postfix "iadd" ; + EMinus = postfix "isub" ; + ETimes = postfix "imul" ; + EDiv = postfix "idiv" ; + EInt i = ss ("iconst" ++ i.s) ; + oper + postfix : Str -> SS -> SS -> SS = \op,x,y -> + ss (x.s ++ ";" ++ y.s ++ ";" ++ op) ; +</PRE> +<P></P> +<A NAME="toc123"></A> +<H2>Speaking aloud arithmetic expressions</H2> +<P> +Natural languages have sometimes difficulties in expressing mathematical +formulas unambiguously, because they have no universal device of parentheses. +For arithmetic formulas, a solution exists: +</P> +<PRE> + 2 + 3 * 4 +</PRE> +<P> +can be expressed +</P> +<PRE> + the sum of 2 and the product of 3 and 4 +</PRE> +<P> +However, this format is very verbose and unnatural, and becomes +impossible to understand when the complexity of expressions grows. +Fortunately, spoken language +has a very nice way of using <B>pauses</B> for disambiguation. This device was +introduced by Torbjörn Lager (personal communication, 2003) +as an input mechanism to a calculator dialogue +system; it seems to correspond very closely to how we actually speak when we +want to communicate arithmetic expressions. Another application would be as +a part of a programming assistant that reads aloud code. +</P> +<P> +The idea is that, after every completed operation, there is a pause. Try this +by speaking aloud the following lines, making a pause instead of pronouncing the +word <CODE>PAUSE</CODE>: +</P> +<PRE> + 2 plus 3 times 4 PAUSE + 2 plus 3 PAUSE times 4 PAUSE +</PRE> +<P> +A grammar implementing this convention is again simple to write: +</P> +<PRE> + lin + EPlus = infix "plus" ; + EMinus = infix "minus" ; + ETimes = infix "times" ; + EDiv = infix ["divided by"] ; + EInt i = i ; + oper + infix : Str -> SS -> SS -> SS = \op,x,y -> + ss (x.s ++ op ++ y.s ++ "PAUSE") ; +</PRE> +<P> +Intuitively, a pause is taken to give the hearer time to compute an +intermediate result. +</P> +<P> +<B>Exercise</B>. Is the pause-based grammar unambiguous? Test with random examples! +</P> +<A NAME="toc124"></A> +<H2>Programs with variables</H2> +<P> +A useful extension of arithmetic expressions is a <B>straight code</B> programming +language. The programs of this language are <B>assignments</B> of the form <CODE>x = exp</CODE>, +which assign expressions to variables. Expressions can moreover contain variables +that have been given values in previous assignments. +</P> +<P> +In this language, we use two new categories: programs and variables. +A program is a sequence of assignments, where a variable is given a value. +Logically, we want to distinguish <B>initializations</B> from other assignments: +these are the assignments where a variable is given a value for the first time. +Just like in C-like languages, +we prefix an initializing assignment with the type of the variable. +Here is an example of a piece of code written in the language: +</P> +<PRE> + int x = 2 + 3 ; + int y = x + 1 ; + x = x + 9 * y ; +</PRE> +<P> +We define programs by the following constructors: +</P> +<PRE> + fun + PEmpty : Prog ; + PInit : Exp -> (Var -> Prog) -> Prog ; + PAss : Var -> Exp -> Prog -> Prog ; +</PRE> +<P> +The interesting constructor is <CODE>PInit</CODE>, which uses +higher-order abstract syntax for making the initialized variable available in +the <B>continuation</B> of the program. The abstract syntax tree for the above code +is +</P> +<PRE> + PInit (EPlus (EInt 2) (EInt 3)) (\x -> + PInit (EPlus (EVar x) (EInt 1)) (\y -> + PAss x (EPlus (EVar x) (ETimes (EInt 9) (EVar y))) + PEmpty)) +</PRE> +<P> +Since we want to prevent the use of uninitialized variables in programs, we +don't give any constructors for <CODE>Var</CODE>! We just have a rule for using variables +as expressions: +</P> +<PRE> + fun EVar : Var -> Exp ; +</PRE> +<P> +The rest of the grammar is just the same as for arithmetic expressions +<a href="#secprecedence">here</a>. The best way to implement it is perhaps by writing a +module that extends the expression module. The most natural start category +of the extension is <CODE>Prog</CODE>. +</P> +<P> +<B>Exercise</B>. Extend the straight-code language to expressions of type <CODE>float</CODE>. +To guarantee type safety, you can define a category <CODE>Typ</CODE> of types, and +make <CODE>Exp</CODE> and <CODE>Var</CODE> dependent on <CODE>Typ</CODE>. Basic floating point expressions +can be formed from literal of the built-in GF type <CODE>Float</CODE>. The arithmetic +operations should be made polymorphic (as <a href="#secpolymorphic">here</a>). +</P> +<A NAME="toc125"></A> +<H3>The concrete syntax of assignments</H3> +<P> +We can define a C-like concrete syntax by using GF's <CODE>$</CODE> variables, as explained +<a href="#secbinding">here</a>. +</P> +<P> +In a JVM-like syntax, we need two more instructions: <CODE>iload</CODE> <I>x</I>, which +loads (pushes on the stack) the value of the variable <I>x</I>, and <CODE>istore</CODE> <I>x</I>, +which stores the value of the currently topmost expression in the variable <I>x</I>. +Thus the code for the example in the previous section is +</P> +<PRE> + iconst 2 ; iconst 3 ; iadd ; istore x ; + iload x ; iconst 1 ; iadd ; istore y ; + iload x ; iconst 9 ; iload y ; imul ; iadd ; istore x ; +</PRE> +<P> +Those familiar with JVM will notice that we are using <B>symbolic addresses</B>, i.e. +variable names instead of integer offsets in the memory. Neither real JVM nor +our variant makes any distinction between the initialization and reassignment +of a variable. +</P> +<P> +<B>Exercise</B>. Finish the implementation of the +C-to-JVM compiler by extending the expression modules +to straight code programs. +</P> +<P> +<B>Exercise</B>. If you made the exercise of adding floating point numbers to +the language, you can now cash out the main advantage of type checking +for code generation: selecting type-correct JVM instructions. The floating +point instructions are precisely the same as the integer one, except that +the prefix is <CODE>f</CODE> instead of <CODE>i</CODE>, and that <CODE>fconst</CODE> takes floating +point literals as arguments. +</P> +<A NAME="toc126"></A> +<H3>A liberal syntax of variables</H3> +<P> +In many applications, the task of GF is just linearization and parsing; +keeping track of bound variables and other semantic constraints is +the task of other parts of the program. For instance, if we want to +write a natural language interface that reads aloud C code, we can +quite as well use a context-free grammar of C, and leave it to the C +compiler to check that variables make sense. In such a program, we may +want to treat variables as <I>Strings</I>, i.e. to have a constructor +</P> +<PRE> + fun VString : String -> Var ; +</PRE> +<P> +The built-in category <CODE>String</CODE> has as its values <B>string literals</B>, +which are strings in double quotes. The lexer and unlexer <CODE>codelit</CODE> +restore and remove the quotes; when the lexer finds a token that is +neither a terminal in the grammar nor an integer literal, it sends +it to the parser as a string literal. +</P> +<P> +<B>Exercise</B>. Write a grammar for straight code without higher-order +abstract syntax. +</P> +<P> +<B>Exercise</B>. Extend the liberal straight code grammar to <CODE>while</CODE> loops and +some other program constructs, and investigate if you can build a reasonable spoken +language generator for this fragment. +</P> +<A NAME="toc127"></A> +<H2>Conclusion</H2> +<P> +Since formal languages are syntactically simpler than natural languages, it +is no wonder that their grammars can be defined in GF. Some thought is needed +for dealing with precedences and spacing, but much of it is encoded in GF's +libraries and built-in lexers and unlexers. If the sole purpose of a grammar +is to implement a programming language, then the <B>BNF Converter</B> tool +(BNFC) is more appropriate than GF: +<center> +<CODE>www.cs.chalmers.se/~markus/BNFC/</CODE> +</center> +BNFC uses standard YACC-like parser tools. GF has flags for printing +grammars in the BNFC format. +</P> +<P> +The most common applications of GF grammars of formal languages +are in natural-language interfaces of various kinds. +These systems don't usually need semantic control in GF abstract +syntax. However, the situation can be different if the interface also comprises +an interactive syntax editor, as in the GF-Key system +(Beckert & al. 2006, Burke & Johannisson 2005). +In that system, the editor is used for guiding programmers only to write +type-correct code. +</P> +<P> +The technique of continuations in modelling programming languages has recently +been applied to natural language, for processing <B>anaphoric reference</B>, +e.g. pronouns. It may be good to know that GF has the machinery available; +for the time being, however (GF 2.8), dependent types and +higher-order abstract syntax are not supported by the embedded GF implementations +in Haskell and Java. +</P> +<P> +<B>Exercise</B>. The book <I>C programming language</I> by Kernighan and Ritchie +(p. 123, 2nd edition, 1988) describes an English-like syntax for pointer and +array declarations, and a C program for translating between English and C. +The following example pair shows all the expression forms needed: +</P> +<PRE> + char (*(*x[3])())[5] + + x: array[3] of pointer to function returning + pointer to array[5] of char +</PRE> +<P> +Implement these translations by a GF grammar. +</P> +<P> +<B>Exercise</B>. Design a natural-language interface to Unix command lines. +It should be able to express verbally commands such as +<CODE>cat, cd, grep, ls, mv, rm, wc</CODE> and also +pipes built from them. +</P> +<A NAME="toc128"></A> +<H2>Summary of GF language constructs</H2> +<A NAME="toc129"></A> +<H3>Lexers and unlexers</H3> +<P> +Lexers are set by the flag <CODE>lexer</CODE> and unlexers by the flag <CODE>unlexer</CODE>. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>lexer</TH> +<TH COLSPAN="2">description</TH> +</TR> +<TR> +<TD><CODE>words</CODE></TD> +<TD>(default) tokens are separated by spaces or newlines</TD> +</TR> +<TR> +<TD><CODE>literals</CODE></TD> +<TD>like words, but integer and string literals recognized</TD> +</TR> +<TR> +<TD><CODE>chars</CODE></TD> +<TD>each character is a token</TD> +</TR> +<TR> +<TD><CODE>code</CODE></TD> +<TD>program code conventions (uses Haskell's lex)</TD> +</TR> +<TR> +<TD><CODE>text</CODE></TD> +<TD>with conventions on punctuation and capital letters</TD> +</TR> +<TR> +<TD><CODE>codelit</CODE></TD> +<TD>like code, but recognize literals (unknown words as strings)</TD> +</TR> +<TR> +<TD><CODE>textlit</CODE></TD> +<TD>like text, but recognize literals (unknown words as strings)</TD> +</TR> +</TABLE> + +<P></P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>unlexer</TH> +<TH COLSPAN="2">description</TH> +</TR> +<TR> +<TD><CODE>unwords</CODE></TD> +<TD>(default) space-separated token list</TD> +</TR> +<TR> +<TD><CODE>text</CODE></TD> +<TD>format as text: punctuation, capitals, paragraph <p></TD> +</TR> +<TR> +<TD><CODE>code</CODE></TD> +<TD>format as code (spacing, indentation)</TD> +</TR> +<TR> +<TD><CODE>textlit</CODE></TD> +<TD>like text, but remove string literal quotes</TD> +</TR> +<TR> +<TD><CODE>codelit</CODE></TD> +<TD>like code, but remove string literal quotes</TD> +</TR> +<TR> +<TD><CODE>concat</CODE></TD> +<TD>remove all spaces</TD> +</TR> +</TABLE> + +<P></P> +<A NAME="toc130"></A> +<H3>Built-in abstract syntax types</H3> +<P> +There are three built-in types. Their syntax trees are literals of corresponding kinds: +</P> +<UL> +<LI><CODE>Int</CODE>, with nonnegative integer literals e.g. <CODE>987031434</CODE> +<LI><CODE>Float</CODE>, with nonnegative floating-point literals e.g. <CODE>907.219807</CODE> +<LI><CODE>String</CODE>, with string literals e.g. <CODE>"foo"</CODE> +</UL> + +<P> +Their linearization type is uniformly <CODE>{s : Str}</CODE>. +</P> +<A NAME="toc131"></A> +<H1>Embedded grammars</H1> +<P> +<a name="chapeight"></a> +</P> +<P> +GF grammars can be used as parts of programs written in other programming +languages. Haskell and Java. +This facility is based on several components: +</P> +<UL> +<LI>a portable format for multilingual GF grammars +<LI>an interpreter for this format written in the host language +<LI>an API that enables reading grammar files and calling the interpreter +<LI>a way to manipulate abstract syntax trees in the host language +</UL> + +<P> +In this chapter, we will show the basic ways of producing such +<B>embedded grammars</B> and using them in Haskell, Java, and JavaScript programs. +We will build a simple example application in each language: +</P> +<UL> +<LI>a question-answering system in Haskell +<LI>a translator GUI in Java +<LI>a multilingual syntax editor in JavaScript +</UL> + +<P> +Moreover, we will use how grammar applications can be extended to +spoken language by generating <B>language models</B> for speech recognition +in various standard formats. +</P> +<A NAME="toc132"></A> +<H2>The portable grammar format</H2> +<P> +The portable format is called GFCC, "GF Canonical Compiled". A file +of this form can be produced from GF by the command +</P> +<PRE> + > print_multi -printer=gfcc | write_file FILE.gfcc +</PRE> +<P> +Files written in this format can also be imported in the GF system, +which recognizes the suffix <CODE>.gfcc</CODE> and builds the multilingual +grammar in memory. +</P> +<P> +<I>This applies to GF version 3 and upwards. Older GF used a format suffixed</I> +<CODE>.gfcm</CODE>. +<I>At the moment of writing, also the Java interpreter still uses the GFCM format.</I> +</P> +<P> +GFCC is, in fact, the recommended format in +which final grammar products are distributed, because they +are stripped from superfluous information and can be started and applied +faster than sets of separate modules. +</P> +<P> +Application programmers have never any need to read or modify GFCC files. +Also in this sense, they play the same role as machine code in +general-purpose programming. +</P> +<A NAME="toc133"></A> +<H2>The embedded interpreter and its API</H2> +<P> +The interpreter is a kind of a miniature GF system, which can parse and +linearize with grammars. But it can only perform a subset of the commands of +the GF system. For instance, it +cannot compile source grammars into the GFCC format; the compiler is the most +heavy-weight component of the GF system, and should not be carried around +in end-user applications. +Since GFCC is much +simpler than source GF, building an interpreter is relatively easy. +Full-scale interpreters currently exist in Haskell and Java, and partial +ones in C++, JavaScript, and Prolog. We will in this chapter focus +on Haskell, Java, and JavaScript. +</P> +<P> +Application programmers never need to read or modify the interpreter. +They only need to access it via its API. +</P> +<A NAME="toc134"></A> +<H2>Embedded GF applications in Haskell</H2> +<P> +Readers unfamiliar with Haskell, or who just want to program in Java, can safely +skip this section. Everything will be repeated in the corresponding Java +section. However, seeing the Haskell code may still be helpful because +Haskell is in many ways closer to GF than Java is. In particular, recursive +types of syntax trees and pattern matching over them are very similar in +Haskell and GF, +but require a complex encoding with classes and visitors in Java. +</P> +<A NAME="toc135"></A> +<H3>The EmbedAPI module</H3> +<P> +The Haskell API contains (among other things) the following types and functions: +</P> +<PRE> + module EmbedAPI where + + type MultiGrammar + type Language + type Category + type Tree + + file2grammar :: FilePath -> IO MultiGrammar + + linearize :: MultiGrammar -> Language -> Tree -> String + parse :: MultiGrammar -> Language -> Category -> String -> [Tree] + + linearizeAll :: MultiGrammar -> Tree -> [String] + linearizeAllLang :: MultiGrammar -> Tree -> [(Language,String)] + + parseAll :: MultiGrammar -> Category -> String -> [[Tree]] + parseAllLang :: MultiGrammar -> Category -> String -> [(Language,[Tree])] + + languages :: MultiGrammar -> [Language] + categories :: MultiGrammar -> [Category] + startCat :: MultiGrammar -> Category +</PRE> +<P> +This is the only module that needs to be imported in the Haskell application. +It is available as a part of the GF distribution, in the file +<CODE>src/GF/GFCC/API.hs</CODE>. +</P> +<A NAME="toc136"></A> +<H3>First application: a translator</H3> +<P> +Let us first build a stand-alone translator, which can translate +in any multilingual grammar between any languages in the grammar. +The whole code for this translator is here: +</P> +<PRE> + module Main where + + import GF.GFCC.API + import System (getArgs) + + main :: IO () + main = do + file:_ <- getArgs + gr <- file2grammar file + interact (translate gr) + + translate :: MultiGrammar -> String -> String + translate gr = case parseAllLang gr (startCat gr) s of + (lg,t:_):_ -> unlines [linearize gr l t | l <- languages gr, l /= lg] + _ -> "NO PARSE" +</PRE> +<P> +To run the translator, first compile it by +</P> +<PRE> + % ghc --make -o trans Translator.hs +</PRE> +<P> +Then produce a GFCC file. For instance, the <CODE>Food</CODE> grammar set can be +compiled as follows: +</P> +<PRE> + % gfc --make FoodEng.gf FoodIta.gf +</PRE> +<P> +This produces the file <CODE>Food.gfcc</CODE> (its name comes from the abstract syntax). +</P> +<P> +<I>The gfc batch compiler program is available in GF 3 and upwards.</I> +<I>In earlier versions, the appropriate command can be piped to gf:</I> +</P> +<PRE> + % echo "pm -printer=gfcc | wf Food.gfcc" | gf FoodEng.gf FoodIta.gf +</PRE> +<P> +Equivalently, the grammars could be read into GF shell and the <CODE>pm</CODE> command +issued from there. But the unix command has the advantage that it can +be put into a <CODE>Makefile</CODE> to automate the compilation of an application. +</P> +<P> +The Haskell library function <CODE>interact</CODE> makes the <CODE>trans</CODE> program work +like a Unix filter, which reads from standard input and writes to standard +output. Therefore it can be a part of a pipe and read and write files. +The simplest way to translate is to <CODE>echo</CODE> input to the program: +</P> +<PRE> + % echo "this wine is delicious" | ./trans Food.gfcc + questo vino è delizioso +</PRE> +<P> +The result is given in all languages except the input language. +</P> +<A NAME="toc137"></A> +<H3>A looping translator</H3> +<P> +If the user wants to translate many expressions in a sequence, it +is cumbersome to have to start the translator over and over again, +because reading the grammar and building the parser always takes +time. The translator of the previous section is easy to modify +to enable this: just change <CODE>interact</CODE> in the main function to +<CODE>loop</CODE>. It is not a standard Haskell function, so its definition has +to be included: +</P> +<PRE> + loop :: (String -> String) -> IO () + loop trans = do + s <- getLine + if s == "quit" then putStrLn "bye" else do + putStrLn $ trans s + loop trans +</PRE> +<P> +The loop keeps on translating line by line until the input line +is <CODE>quit</CODE>. +</P> +<A NAME="toc138"></A> +<H3>A question-answer system</H3> +<P> +<a name="secmathprogram"></a> +</P> +<P> +The next application is also a translator, but it adds a +<B>transfer</B> component to the grammar. Transfer is a function that +takes the input syntax tree into some other syntax tree, which is +then linearized and shown back to the user. The transfer function we +are going to use is one that computes a question into an answer. +The program accepts simple questions about arithmetic and answers +"yes" or "no" in the language in which the question was made: +</P> +<PRE> + Is 123 prime? + No. + 77 est impair ? + Oui. +</PRE> +<P> +The main change that is needed to the pure translator is to give +the type of <CODE>translate</CODE> an extra argument: a transfer function. +</P> +<PRE> + translate :: (Tree -> Tree) -> MultiGrammar -> String -> String +</PRE> +<P> +You can think of ordinary translation as a special case where +transfer is the identity function (<CODE>id</CODE> in Haskell). +</P> +<P> +Also the behaviour of returning the reply in different languages +should be changed so that the reply is returned in the <I>same</I> language. +Here is the complete definition of <CODE>translate</CODE> in the new form. +</P> +<PRE> + translate tr gr = case parseAllLang gr (startCat gr) s of + (lg,t:_):_ -> linearize gr lg (tr t) + _ -> "NO PARSE" +</PRE> +<P> +To complete the system, we have to define the transfer function. +So, how can we write a function from from abstract syntax trees +to abstract syntax trees? The embedded API does not make +the constructors of the type <CODE>Tree</CODE> available for users. Even if it did, it would +be quite complicated to use the type, and programs would be likely +to produce trees that are ill-typed in GF and therefore cannot +be linearized. +</P> +<A NAME="toc139"></A> +<H3>Exporting GF datatypes</H3> +<P> +The way to go in defining transfer is to use GF's tree constructors, that +is, the <CODE>fun</CODE> functions, as if they were Haskell's data constructors. +There is enough resemblance between GF and Haskell to make this possible +in most cases. It is even possible in Java, as we shall see later. +</P> +<P> +Thus every category of GF is translated into a Haskell datatype, where the +functions producing a value of that category are treated as constructors. +The translation is obtained by using the batch compiler with the command +</P> +<PRE> + % gfc -haskell Food.gfcc +</PRE> +<P> +It is also possible to produce the Haskell file together with GFCC, by +</P> +<PRE> + % gfc --make -haskell FoodEng.gf FoodIta.gf +</PRE> +<P> +The result is a file named <CODE>GSyntax.hs</CODE>, containing a +module named <CODE>GSyntax</CODE>. +</P> +<P> +<I>In GF before version 3, the same result is obtained from within GF, by the command</I> +</P> +<PRE> + > print_grammar -printer=gfcc_haskell | write_file GSyntax.hs +</PRE> +<P></P> +<P> +As an example, we take +the grammar we are going to use for queries. The abstract syntax is +</P> +<PRE> + abstract Math = { + + flags startcat = Question ; + + cat Answer ; Question ; Object ; + + fun + Even : Object -> Question ; + Odd : Object -> Question ; + Prime : Object -> Question ; + Number : Int -> Object ; + + Yes : Answer ; + No : Answer ; + } +</PRE> +<P> +It is translated to the following system of datatypes: +</P> +<PRE> + newtype GInt = GInt Integer + + data GAnswer = + GYes + | GNo + + data GObject = GNumber GInt + + data GQuestion = + GPrime GObject + | GOdd GObject + | GEven GObject +</PRE> +<P> +All type and constructor names are prefixed with a <CODE>G</CODE> to prevent clashes. +</P> +<P> +Now it is possible to define functions from and to these datatype, in Haskell. +Haskell's type checker guarantees that the functions are well-typed also with +respect to GF. Here is a question-to-answer function for this language: +</P> +<PRE> + answer :: GQuestion -> GAnswer + answer p = case p of + GOdd x -> test odd x + GEven x -> test even x + GPrime x -> test prime x + + value :: GObject -> Int + value e = case e of + GNumber (GInt i) -> fromInteger i + + test :: (Int -> Bool) -> GObject -> GAnswer + test f x = if f (value x) then GYes else GNo +</PRE> +<P> +So it is the function <CODE>answer</CODE> that we want to apply as transfer. +The only problem is the <I>type</I> of this function: the parsing and +linearization method of <CODE>API</CODE> work with <CODE>Tree</CODE>s and not +with <CODE>GQuestion</CODE>s and <CODE>GAnswers</CODE>. +</P> +<P> +Fortunately the Haskell translation of GF takes care of translating +between trees and the generated datatypes. This is done by using +a class with the required translation methods: +</P> +<PRE> + class Gf a where + gf :: a -> Tree + fg :: Tree -> a +</PRE> +<P> +The Haskell code generator also generates instances of these classes +for each datatype, for example, +</P> +<PRE> + instance Gf GQuestion where + gf (GEven x1) = DTr [] (AC (CId "Even")) [gf x1] + gf (GOdd x1) = DTr [] (AC (CId "Odd")) [gf x1] + gf (GPrime x1) = DTr [] (AC (CId "Prime")) [gf x1] + fg t = + case t of + DTr [] (AC (CId "Even")) [x1] -> GEven (fg x1) + DTr [] (AC (CId "Odd")) [x1] -> GOdd (fg x1) + DTr [] (AC (CId "Prime")) [x1] -> GPrime (fg x1) + _ -> error ("no Question " ++ show t) +</PRE> +<P> +Needless to say, <CODE>GSyntax</CODE> is a module that a programmer +never needs to look into, let alone change: it is enough to know that it +contains a systematic encoding and decoding between an abstract syntax +and Haskell datatypes, where +</P> +<UL> +<LI>all GF names are in Haskell prefixed with <CODE>G</CODE> +<LI><CODE>gf</CODE> translates from Haskell to GF +<LI><CODE>fg</CODE> translates from GF to Haskell +</UL> + +<A NAME="toc140"></A> +<H3>Putting it all together</H3> +<P> +Here is the complete code for the Haskell module <CODE>TransferLoop.hs</CODE>. +</P> +<PRE> + module Main where + + import GF.GFCC.API + import TransferDef (transfer) + + main :: IO () + main = do + gr <- file2grammar "Math.gfcc" + loop (translate transfer gr) + + loop :: (String -> String) -> IO () + loop trans = do + s <- getLine + if s == "quit" then putStrLn "bye" else do + putStrLn $ trans s + loop trans + + translate :: (Tree -> Tree) -> MultiGrammar -> String -> String + translate tr gr = case parseAllLang gr (startCat gr) s of + (lg,t:_):_ -> linearize gr lg (tr t) + _ -> "NO PARSE" +</PRE> +<P> +This is the <CODE>Main</CODE> module, which just needs a function <CODE>transfer</CODE> from +<CODE>TransferDef</CODE> in order to compile. In the current application, this module +looks as follows: +</P> +<PRE> + module TransferDef where + + import GF.GFCC.API (Tree) + import GSyntax + + transfer :: Tree -> Tree + transfer = gf . answer . fg + + answer :: GQuestion -> GAnswer + answer p = case p of + GOdd x -> test odd x + GEven x -> test even x + GPrime x -> test prime x + + value :: GObject -> Int + value e = case e of + GNumber (GInt i) -> fromInteger i + + test :: (Int -> Bool) -> GObject -> GAnswer + test f x = if f (value x) then GYes else GNo + + prime :: Int -> Bool + prime x = elem x primes where + primes = sieve [2 .. x] + sieve (p:xs) = p : sieve [ n | n <- xs, n `mod` p > 0 ] + sieve [] = [] +</PRE> +<P> +This module, in turn, needs <CODE>GSyntax</CODE> to compile, and the main module +needs <CODE>Math.gfcc</CODE> to run. To automate the production of the system, +we write a <CODE>Makefile</CODE> as follows: +</P> +<PRE> + all: + gfc --make -haskell MathEng.gf MathFre.gf + ghc --make -o ./math TransferLoop.hs + strip math +</PRE> +<P> +(Notice that the empty segments starting the command lines in a Makefile must be tabs.) +Now we can compile the whole system by just typing +</P> +<PRE> + make +</PRE> +<P> +Then you can run it by typing +</P> +<PRE> + ./math +</PRE> +<P> +Well --- you will of course need some concrete syntaxes of <CODE>Math</CODE> in order +to succeed. We have defined ours by using the resource functor design pattern, +as explained <a href="#secfunctor">here</a>. +</P> +<P> +Just to summarize, the source of the application consists of the following files: +</P> +<PRE> + Makefile -- a makefile + Math.gf -- abstract syntax + Math???.gf -- concrete syntaxes + TransferDef.hs -- definition of question-to-answer function + TransferLoop.hs -- Haskell Main module +</PRE> +<P></P> +<A NAME="toc141"></A> +<H2>Embedded GF applications in Java</H2> +<P> +When an API for GFCC in Java is available, +we will write the same applications in Java as +were written in Haskell above. Until then, we will +build another kind of application, which does not require +modification of generated Java code. +</P> +<P> +More information on embedded GF grammars in Java can be found in the document +</P> +<PRE> + www.cs.chalmers.se/~bringert/gf/gf-java.html +</PRE> +<P> +by Björn Bringert. +</P> +<A NAME="toc142"></A> +<H3>Translets</H3> +<P> +A Java system needs many more files than a Haskell system. +To get started, one can fetch the package <CODE>gfc2java</CODE> from +</P> +<PRE> + www.cs.chalmers.se/~bringert/darcs/gfc2java/ +</PRE> +<P> +by using the Darcs version control system as described in the <CODE>gf-java</CODE> home page. +</P> +<P> +The <CODE>gfc2java</CODE> package contains a script <CODE>build-translet</CODE>, which can be applied +to any <CODE>.gfcm</CODE> file to create a <B>translet</B>, a small translation GUI. Foor the <CODE>Food</CODE> +grammars of <a href="#chapthree">the third chapter</a>, we first create a file <CODE>food.gfcm</CODE> by +</P> +<PRE> + % echo "pm | wf food.gfcm" | gf FoodEng.gf FoodIta.gf +</PRE> +<P> +and then run +</P> +<PRE> + % build_translet food.gfcm +</PRE> +<P> +The resulting file <CODE>translate-food.jar</CODE> can be run with +</P> +<PRE> + % java -jar translate-food.jar +</PRE> +<P> +The translet looks like this: +</P> +<P> + <IMG ALIGN="right" SRC="food-translet.png" BORDER="0" ALT=""> +</P> +<A NAME="toc143"></A> +<H3>Dialogue systems</H3> +<P> +A question-answer system is a special case of a <B>dialogue system</B>, where the user and +the computer communicate by writing or, even more properly, by speech. The <CODE>gf-java</CODE> +homepage provides an example of a most simple dialogue system imaginable, where two +the conversation has just two rules: +</P> +<UL> +<LI>if the user says <I>here you go</I>, the system says <I>thanks</I> +<LI>if the user says <I>thanks</I>, the system says <I>you are welcome</I> +</UL> + +<P> +The conversation can be made in both English and Swedish; the user's initiative +decides which language the system replies in. Thus the structure is very similar +to the <CODE>math</CODE> program <a href="#secmathprogram">here</a>. The GF and +Java sources of the program can be +found in +</P> +<PRE> + www.cs.chalmers.se/~bringert/darcs/simpledemo +</PRE> +<P> +again accessible with the Darcs version control system. +</P> +<A NAME="toc144"></A> +<H2>Language models for speech recognition</H2> +<P> +The standard way of using GF in speech recognition is by building +<B>grammar-based language models</B>. To this end, GF comes with compilers +into several formats that are used in speech recognition systems. +One such format is GSL, used in the <A HREF="http://www.nuance.com">Nuance speech recognizer</A>. +It is produced from GF simply by printing a grammar with the flag +<CODE>-printer=gsl</CODE>. The following example uses the smart house grammar defined +<a href="#secsmarthouse">here</a>. +</P> +<PRE> + > import -conversion=finite SmartEng.gf + > print_grammar -printer=gsl + + ;GSL2.0 + ; Nuance speech recognition grammar for SmartEng + ; Generated by GF + + .MAIN SmartEng_2 + + SmartEng_0 [("switch" "off") ("switch" "on")] + SmartEng_1 ["dim" ("switch" "off") + ("switch" "on")] + SmartEng_2 [(SmartEng_0 SmartEng_3) + (SmartEng_1 SmartEng_4)] + SmartEng_3 ("the" SmartEng_5) + SmartEng_4 ("the" SmartEng_6) + SmartEng_5 "fan" + SmartEng_6 "light" +</PRE> +<P> +Other formats available via the <CODE>-printer</CODE> flag include: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>Format</TH> +<TH COLSPAN="2">Description</TH> +</TR> +<TR> +<TD><CODE>gsl</CODE></TD> +<TD>Nuance GSL speech recognition grammar</TD> +</TR> +<TR> +<TD><CODE>jsgf</CODE></TD> +<TD>Java Speech Grammar Format (JSGF)</TD> +</TR> +<TR> +<TD><CODE>jsgf_sisr_old</CODE></TD> +<TD>JSGF with semantic tags in SISR WD 20030401 format</TD> +</TR> +<TR> +<TD><CODE>srgs_abnf</CODE></TD> +<TD>SRGS ABNF format</TD> +</TR> +<TR> +<TD><CODE>srgs_xml</CODE></TD> +<TD>SRGS XML format</TD> +</TR> +<TR> +<TD><CODE>srgs_xml_prob</CODE></TD> +<TD>SRGS XML format, with weights</TD> +</TR> +<TR> +<TD><CODE>slf</CODE></TD> +<TD>finite automaton in the HTK SLF format</TD> +</TR> +<TR> +<TD><CODE>slf_sub</CODE></TD> +<TD>finite automaton with sub-automata in HTK SLF</TD> +</TR> +</TABLE> + +<P></P> +<P> +All currently available formats can be seen in gf with <CODE>help -printer</CODE>. +</P> +<A NAME="toc145"></A> +<H2>Dependent types and spoken language models</H2> +<P> +We have used dependent types to control semantic well-formedness +in grammars. This is important in traditional type theory +applications such as proof assistants, where only mathematically +meaningful formulas should be constructed. But semantic filtering has +also proved important in speech recognition, because it reduces the +ambiguity of the results. +</P> +<P> +Now, GSL is a context-free format, so how does it cope with dependent types? +In general, dependent types can give rise to infinitely many basic types +(exercise!), whereas a context-free grammar can by definition only have +finitely many nonterminals. +</P> +<P> +This is where the flag <CODE>-conversion=finite</CODE> is needed in the <CODE>import</CODE> +command. Its effect is to convert a GF grammar with dependent types to +one without, so that each instance of a dependent type is replaced by +an atomic type. This can then be used as a nonterminal in a context-free +grammar. The <CODE>finite</CODE> conversion presupposes that every +dependent type has only finitely many instances, which is in fact +the case in the <CODE>Smart</CODE> grammar. +</P> +<P> +<B>Exercise</B>. If you have access to the Nuance speech recognizer, +test it with GF-generated language models for <CODE>SmartEng</CODE>. Do this +both with and without <CODE>-conversion=finite</CODE>. +</P> +<P> +<B>Exercise</B>. Construct an abstract syntax with infinitely many instances +of dependent types. +</P> +<A NAME="toc146"></A> +<H3>Statistical language models</H3> +<P> +An alternative to grammar-based language models are +<B>statistical language models</B> (<B>SLM</B>s). An SLM is +built from a <B>corpus</B>, i.e. a set of utterances. It specifies the +probability of each <B>n-gram</B>, i.e. sequence of <I>n</I> words. The +typical value of <I>n</I> is 2 (bigrams) or 3 (trigrams). +</P> +<P> +One advantage of SLMs over grammar-based models is that they are +<B>robust</B>, i.e. they can be used to recognize sequences that would +be out of the grammar or the corpus. Another advantage is that +an SLM can be built "for free" if a corpus is available. +</P> +<P> +However, collecting a corpus can require a lot of work, and writing +a grammar can be less demanding, especially with tools such as GF or +Regulus. This advantage of grammars can be combined with robustness +by creating a back-up SLM from a <B>synthesized corpus</B>. This means +simply that the grammar is used for generating such a corpus. +In GF, this can be done with the <CODE>generate_trees</CODE> command. +As with grammar-based models, the quality of the SLM is better +if meaningless utterances are excluded from the corpus. Thus +a good way to generate an SLM from a GF grammar is by using +dependent types and filter the results through the type checker: +</P> +<PRE> + > generate_trees | put_trees -transform=solve | linearize +</PRE> +<P> +The method of creating statistical language model from corpora synthesized +from GF grammars is applied and evaluated in (Jonson 2006). +</P> +<P> +<B>Exercise</B>. Measure the size of the corpus generated from +<CODE>SmartEng</CODE> (defined <a href="#secsmarthouse">here</a>), with and without type checker filtering. +</P> + +<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) --> +<!-- cmdline: txt2tags -thtml -\-toc gf-tutorial.txt --> +</BODY></HTML> diff --git a/doc/mytree.png b/doc/mytree.png Binary files differnew file mode 100644 index 000000000..fafcc8772 --- /dev/null +++ b/doc/mytree.png |
