diff options
| author | aarne <aarne@cs.chalmers.se> | 2007-12-21 15:10:38 +0000 |
|---|---|---|
| committer | aarne <aarne@cs.chalmers.se> | 2007-12-21 15:10:38 +0000 |
| commit | 5ee1714fd23e974d1cf2511fa398b6ce310a9807 (patch) | |
| tree | 7a82f85d4f4681086430fdefd7903e4a26015c3f /doc/gf-tutorial.html | |
| parent | c5017f28aad7702838b9861aa3f6cbf7b3bacca5 (diff) | |
new tutorial and reference manual
Diffstat (limited to 'doc/gf-tutorial.html')
| -rw-r--r-- | doc/gf-tutorial.html | 7952 |
1 files changed, 7952 insertions, 0 deletions
diff --git a/doc/gf-tutorial.html b/doc/gf-tutorial.html new file mode 100644 index 000000000..1e6d961b8 --- /dev/null +++ b/doc/gf-tutorial.html @@ -0,0 +1,7952 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> +<HTML> +<HEAD> +<META NAME="generator" CONTENT="http://txt2tags.sf.net"> +<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> +<TITLE>Grammatical Framework Tutorial</TITLE> +</HEAD><BODY BGCOLOR="white" TEXT="black"> +<P ALIGN="center"><CENTER><H1>Grammatical Framework Tutorial</H1> +<FONT SIZE="4"> +<I>Aarne Ranta</I><BR> +Draft, November 2007 +</FONT></CENTER> + +<P></P> +<HR NOSHADE SIZE=1> +<P></P> + <UL> + <LI><A HREF="#toc1">Getting started with GF</A> + <UL> + <LI><A HREF="#toc2">What GF is</A> + <LI><A HREF="#toc3">Getting the GF system</A> + <LI><A HREF="#toc4">Running the GF system</A> + <LI><A HREF="#toc5">A "Hello World" grammar</A> + <UL> + <LI><A HREF="#toc6">The program: abstract syntax and concrete syntaxes</A> + <LI><A HREF="#toc7">Using the grammar in the GF system</A> + </UL> + <LI><A HREF="#toc8">Using grammars from outside GF</A> + <LI><A HREF="#toc9">What else can be done with the grammar</A> + <LI><A HREF="#toc10">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc11">Modules</A> + <LI><A HREF="#toc12">Judgements</A> + <LI><A HREF="#toc13">Types and terms</A> + <LI><A HREF="#toc14">Type checking</A> + </UL> + </UL> + <LI><A HREF="#toc15">Designing a grammar for complex phrases</A> + <UL> + <LI><A HREF="#toc16">The abstract syntax Food</A> + <LI><A HREF="#toc17">The concrete syntax FoodEng</A> + <LI><A HREF="#toc18">Commands for testing grammars</A> + <UL> + <LI><A HREF="#toc19">Generating trees and strings</A> + <LI><A HREF="#toc20">More on pipes; tracing</A> + <LI><A HREF="#toc21">Writing and reading files</A> + <LI><A HREF="#toc22">Visualizing trees</A> + <LI><A HREF="#toc23">System commands</A> + </UL> + <LI><A HREF="#toc24">An Italian concrete syntax</A> + <LI><A HREF="#toc25">Free variation</A> + <LI><A HREF="#toc26">More application of multilingual grammars</A> + <UL> + <LI><A HREF="#toc27">Multilingual treebanks</A> + <LI><A HREF="#toc28">Translation session</A> + <LI><A HREF="#toc29">Translation quiz</A> + <LI><A HREF="#toc30">Multilingual syntax editing</A> + </UL> + <LI><A HREF="#toc31">Context-free grammars and GF</A> + <UL> + <LI><A HREF="#toc32">The "cf" grammar format</A> + <LI><A HREF="#toc33">Restrictions of context-free grammars</A> + </UL> + <LI><A HREF="#toc34">Modules and files</A> + <LI><A HREF="#toc35">Using operations and resource modules</A> + <UL> + <LI><A HREF="#toc36">The golden rule of functional programming</A> + <LI><A HREF="#toc37">Operation definitions</A> + <LI><A HREF="#toc38">The ``resource`` module type</A> + <LI><A HREF="#toc39">Opening a resource</A> + <LI><A HREF="#toc40">Partial application</A> + <LI><A HREF="#toc41">Testing resource modules</A> + </UL> + <LI><A HREF="#toc42">Grammar architecture</A> + <UL> + <LI><A HREF="#toc43">Extending a grammar</A> + <LI><A HREF="#toc44">Multiple inheritance</A> + <LI><A HREF="#toc45">Visualizing module structure</A> + </UL> + <LI><A HREF="#toc46">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc47">Modules</A> + <LI><A HREF="#toc48">Judgements</A> + <LI><A HREF="#toc49">Free variation</A> + <LI><A HREF="#toc50">The context-free grammar format</A> + <LI><A HREF="#toc51">Character encoding</A> + </UL> + </UL> + <LI><A HREF="#toc52">Grammars with parameters</A> + <UL> + <LI><A HREF="#toc53">The problem: words have to be inflected</A> + <LI><A HREF="#toc54">Parameters and tables</A> + <LI><A HREF="#toc55">Inflection tables and paradigms</A> + <LI><A HREF="#toc56">Using parameters in concrete syntax</A> + <UL> + <LI><A HREF="#toc57">Agreement</A> + <LI><A HREF="#toc58">Determiners</A> + <LI><A HREF="#toc59">Parametric vs. inherent features</A> + </UL> + <LI><A HREF="#toc60">An English concrete syntax for Foods with parameters</A> + <LI><A HREF="#toc61">More on inflection paradigms</A> + <UL> + <LI><A HREF="#toc62">Worst-case functions</A> + <LI><A HREF="#toc63">Intelligent paradigms</A> + <LI><A HREF="#toc64">Function types with variables</A> + <LI><A HREF="#toc65">Separating operation types and definitions</A> + <LI><A HREF="#toc66">Overloading of operations</A> + <LI><A HREF="#toc67">Morphological analysis and morphology quiz</A> + </UL> + <LI><A HREF="#toc68">The Italian Foods grammar</A> + <LI><A HREF="#toc69">Discontinuous constituents</A> + <LI><A HREF="#toc70">Strings at compile time vs. run time</A> + <LI><A HREF="#toc71">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc72">Parameter and table types</A> + <LI><A HREF="#toc73">Pattern matching</A> + <LI><A HREF="#toc74">Overloading</A> + <LI><A HREF="#toc75">Local definitions</A> + <LI><A HREF="#toc76">Supplementary constructs</A> + </UL> + </UL> + <LI><A HREF="#toc77">Using the resource grammar library</A> + <UL> + <LI><A HREF="#toc78">The coverage of the library</A> + <LI><A HREF="#toc79">The structure of the library</A> + <UL> + <LI><A HREF="#toc80">Lexical vs. phrasal rules</A> + <LI><A HREF="#toc81">Lexical categories</A> + <LI><A HREF="#toc82">Lexical rules</A> + <LI><A HREF="#toc83">Phrasal categories</A> + </UL> + <LI><A HREF="#toc84">The resource API</A> + <LI><A HREF="#toc85">Example: English</A> + <LI><A HREF="#toc86">Functor implementation of multilingual grammars</A> + <LI><A HREF="#toc87">Interfaces and instances</A> + <LI><A HREF="#toc88">Adding languages to a functor implementation</A> + <LI><A HREF="#toc89">Division of labour revisited</A> + <LI><A HREF="#toc90">Restricted inheritance</A> + <LI><A HREF="#toc91">Grammar reuse</A> + <LI><A HREF="#toc92">Browsing the resource with GF commands</A> + <LI><A HREF="#toc93">An extended Foods grammar</A> + <UL> + <LI><A HREF="#toc94">Abstract syntax</A> + <LI><A HREF="#toc95">Linearization types</A> + <LI><A HREF="#toc96">Linearization rules</A> + </UL> + <LI><A HREF="#toc97">Tenses</A> + <LI><A HREF="#toc98">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc99">Interfaces and instances</A> + <LI><A HREF="#toc100">Grammar reuse</A> + <LI><A HREF="#toc101">Functors</A> + <LI><A HREF="#toc102">Restricted inheritance</A> + </UL> + </UL> + <LI><A HREF="#toc103">Refining semantics in abstract syntax</A> + <UL> + <LI><A HREF="#toc104">GF as a logical framework</A> + <LI><A HREF="#toc105">Dependent types</A> + <LI><A HREF="#toc106">Polymorphism</A> + <UL> + <LI><A HREF="#toc107">Digression: dependent types in concrete syntax</A> + </UL> + <LI><A HREF="#toc108">Proof objects</A> + <UL> + <LI><A HREF="#toc109">Proof-carrying documents</A> + </UL> + <LI><A HREF="#toc110">Restricted polymorphism</A> + <LI><A HREF="#toc111">Variable bindings</A> + <LI><A HREF="#toc112">Semantic definitions</A> + <LI><A HREF="#toc113">Summary of GF language features</A> + <UL> + <LI><A HREF="#toc114">Judgements</A> + <LI><A HREF="#toc115">Dependent function types</A> + </UL> + </UL> + <LI><A HREF="#toc116">Grammars of formal languages</A> + <UL> + <LI><A HREF="#toc117">Arithmetic expressions</A> + <UL> + <LI><A HREF="#toc118">Abstract syntax</A> + <LI><A HREF="#toc119">Concrete syntax: a simple approach</A> + </UL> + <LI><A HREF="#toc120">Lexing and unlexing</A> + <LI><A HREF="#toc121">Precedence and fixity</A> + <LI><A HREF="#toc122">Code generation as linearization</A> + <LI><A HREF="#toc123">Speaking aloud arithmetic expressions</A> + <LI><A HREF="#toc124">Programs with variables</A> + <UL> + <LI><A HREF="#toc125">The concrete syntax of assignments</A> + <LI><A HREF="#toc126">A liberal syntax of variables</A> + </UL> + <LI><A HREF="#toc127">Conclusion</A> + <LI><A HREF="#toc128">Summary of GF language constructs</A> + <UL> + <LI><A HREF="#toc129">Lexers and unlexers</A> + <LI><A HREF="#toc130">Built-in abstract syntax types</A> + </UL> + </UL> + <LI><A HREF="#toc131">Embedded grammars</A> + <UL> + <LI><A HREF="#toc132">The portable grammar format</A> + <LI><A HREF="#toc133">The embedded interpreter and its API</A> + <LI><A HREF="#toc134">Embedded GF applications in Haskell</A> + <UL> + <LI><A HREF="#toc135">The EmbedAPI module</A> + <LI><A HREF="#toc136">First application: a translator</A> + <LI><A HREF="#toc137">A looping translator</A> + <LI><A HREF="#toc138">A question-answer system</A> + <LI><A HREF="#toc139">Exporting GF datatypes</A> + <LI><A HREF="#toc140">Putting it all together</A> + </UL> + <LI><A HREF="#toc141">Embedded GF applications in Java</A> + <UL> + <LI><A HREF="#toc142">Translets</A> + <LI><A HREF="#toc143">Dialogue systems</A> + </UL> + <LI><A HREF="#toc144">Language models for speech recognition</A> + <LI><A HREF="#toc145">Dependent types and spoken language models</A> + <UL> + <LI><A HREF="#toc146">Statistical language models</A> + </UL> + </UL> + </UL> + +<P></P> +<HR NOSHADE SIZE=1> +<P></P> +<P> +<h2>Overview</h2> +</P> +<P> +This tutorial gives a hands-on introduction to grammar writing in GF. +It has been written for all programmers +who want to learn to write grammars in GF. +It will go through the programming concepts of GF, and also +explain, without presupposing them, the main ingredients of GF: +linguistics, functional programming, and type theory. +This knowledge will be introduced as a part of grammar writing +practice. +Thus the tutorial should be accessible to anyone who has some +previous experience from any programming language; the basics +of using computers are also presupposed, e.g. the use of +text editors and the management of files. +</P> +<P> +We start in <a href="#chaptwo">the second chapter</a> +by building a "Hello World" grammar, which covers greetings +in three languages: English (<I>hello world</I>), +Finnish (<I>terve maailma</I>), and Italian (<I>ciao mondo</I>). +This <B>multilingual grammar</B> is based on the most central idea of GF: +the distinction between <B>abstract syntax</B> +(the logical structure) and <B>concrete syntax</B> (the +sequence of words). +</P> +<P> +From the "Hello World" example, we proceed +in <a href="#chapthree">the third chapter</a> +to a larger grammar for the domain of food. +In this grammar, you can say things like +<center> +<I>this Italian cheese is delicious</I> +</center> +in English and Italian. This grammar illustrates how translation is +more than just replacement of words. For instance, the order of +words may have to be changed: +<center> +<I>Italian cheese</I> +</P> +<P> +<I>formaggio italiano</I> +</center> +Moreover, words can have different forms, and which forms +they have vary from language to language. For instance, +Italian adjectives usually have four forms where English +has just one: +<center> +<I>delicious</I> (<I>wine, wines, pizza, pizzas</I>) +</P> +<P> +<I>vino delizioso, vini deliziosi, pizza deliziosa, pizze deliziose</I> +</center> +The <B>morphology</B> of a language describes the +forms of its words, and the basics of implementing morphology and +integrating it with syntax are covered in <a href="#chaptwo">the fourth chapter</a>. +</P> +<P> +The complete description of morphology and syntax in natural +languages is in GF preferably left to the <B>resource grammar library</B>. +Its use is therefore an important part of GF programming, and +it is covered in <a href="#chapfive">the fifth chapter</a>. How to contribute to resource +grammars as an author will only be covered in Part III; +however, the tutorial does go through all the +programming concepts of GF, including those involved in +resource grammars. +</P> +<P> +In addition to multilinguality, <B>semantics</B> is an important aspect of GF +grammars. The "purely linguistic" aspects (morphology and syntax) belong to +the concrete syntax part of GF, whereas semantics is expressed in the abstract +syntax. After the presentation of concrete syntax constructs, we proceed +in <a href="#chapsix">the sixth chapter</a> to the enrichment of abstract syntax with <B>dependent types</B>, +<B>variable bindings</B>, and <B>semantic definitions</B>. +<a href="#chapseven">the seventh chapter</a> concludes the tutorial by technical tips for implementing formal +languages. It will also illustrate the close relation between GF grammars +and compilers by actually implementing a small compiler from C-like statements +and expressions to machine code similar to Java Virtual Machine. +</P> +<P> +English and Italian are used as example languages in many grammars. +Of course, we will not presuppose that the reader knows any Italian. +We have chosen Italian because it has a rich structure +that illustrates very well the capacities of GF. +Moreover, even those readers who don't know Italian, will find many of +its words familiar, due to the Latin heritage. +The exercises will encourage the reader to +port the examples to other languages as well; in particular, +it should be instructive for the reader to look at her +own native language from the point of view of writing a grammar +implementation. +</P> +<P> +To learn how to write GF grammars is not the only goal of +this tutorial. We will also explain the most important +commands of the GF system, mostly in passing. With these commands, +simple application programs such as translation and +quiz systems, can be built simply by writing scripts for the +GF system. More complicated applications, such as natural-language +interfaces and dialogue systems, moreover require programming in +some general-purpose language; such applications are covered in <a href="#chapeight">the eighth chapter</a>. +</P> +<A NAME="toc1"></A> +<H1>Getting started with GF</H1> +<P> +<a name="chaptwo"></a> +</P> +<P> +In this chapter, we will introduce the GF system and write the first GF grammar, +a "Hello World" grammar. While extremely small, this grammar already illustrates +how GF can be used for the tasks of translation and multilingual +generation. +</P> +<A NAME="toc2"></A> +<H2>What GF is</H2> +<P> +We use the term GF for three different things: +</P> +<UL> +<LI>a <B>system</B> (computer program) used for working with grammars +<LI>a <B>programming language</B> in which grammars can be written +<LI>a <B>theory</B> about grammars and languages +</UL> + +<P> +The relation between these things is obvious: the GF system is an implementation +of the GF programming language, which in turn is built on the ideas of the +GF theory. The main focus of this book is on the GF programming language. +We learn how grammars are written in this language. At the same time, we learn +the way of thinking in the GF theory. To make this all useful and fun, and +to encourage experimenting, we make the grammars run on a computer by +using the GF system. +</P> +<P> +A GF program is called a <B>grammar</B>. A grammar is, traditionally, a +definition of a language. From this definition, different language +processing components can be derived: +</P> +<UL> +<LI><B>parsing</B>: to analyse the language +<LI><B>linearization</B>: to generate the language +<LI><B>translation</B>: to analyse one language and generate another +</UL> + +<P> +A GF grammar is thus a declarative program from which these +procedures can be automatically derived. In general, a GF grammar +is <B>multilingual</B>: it defines many languages and translations between them. +</P> +<A NAME="toc3"></A> +<H2>Getting the GF system</H2> +<P> +The GF system is open-source free software, which can be downloaded via the +GF Homepage: +<center> +<CODE>gf.digitalgrammars.com</CODE> +</center> +There you can download +</P> +<UL> +<LI>binaries for Linux, Mac OS X, and Windows +<LI>source code and documentation +<LI>grammar libraries and examples +</UL> + +<P> +In particular, many of the examples in this book are included in the +subdirectory <CODE>examples/tutorial</CODE> of the source distribution package. +This directory is also available +<A HREF="http://digitalgrammars.com/gf/examples/tutorial">online</A>. +</P> +<P> +If you want to compile GF from source, you need a Haskell compiler. +To compile the interactive editor, you also need a Java compilers. +But normally you don't have to compile anything yourself, and you definitely +don't need to know Haskell or Java to use GF. +</P> +<P> +We are assuming the availability of a Unix shell. Linux and Mac OS X users +have it automatically, the latter under the name "terminal". +Windows users are recommended to install Cywgin, the free Unix shell for Windows. +</P> +<A NAME="toc4"></A> +<H2>Running the GF system</H2> +<P> +To start the GF system, assuming you have installed it, just type +<CODE>gf</CODE> in the Unix (or Cygwin) shell: +</P> +<PRE> + % gf +</PRE> +<P> +You will see GF's welcome message and the prompt <CODE>></CODE>. +The command +</P> +<PRE> + > help +</PRE> +<P> +will give you a list of available commands. +</P> +<P> +As a common convention in this book, we will use +</P> +<UL> +<LI><CODE>%</CODE> as a prompt that marks system commands +<LI><CODE>></CODE> as a prompt that marks GF commands +</UL> + +<P> +Thus you should not type these prompts, but only the characters that +follow them. +</P> +<A NAME="toc5"></A> +<H2>A "Hello World" grammar</H2> +<P> +The tradition in programming language tutorials is to start with a +program that prints "Hello World" on the terminal. GF should be no +exception. But our program has features that distinguish it from +most "Hello World" programs: +</P> +<UL> +<LI><B>Multilinguality</B>: the message is printed in many languages. +<LI><B>Reversibility</B>: in addition to printing, you can <B>parse</B> the + message and <B>translate</B> it to other languages. +</UL> + +<A NAME="toc6"></A> +<H3>The program: abstract syntax and concrete syntaxes</H3> +<P> +A GF program, in general, is a <B>multilingual grammar</B>. Its main parts +are +</P> +<UL> +<LI>an <B>abstract syntax</B> +<LI>one or more <B>concrete syntaxes</B> +</UL> + +<P> +The abstract syntax defines, in a language-independent way, what <B>meanings</B> +can be expressed in the grammar. In the "Hello World" grammar we want +to express <I>Greetings</I>, where we greet a <I>Recipient</I>, which can be +<I>World</I> or <I>Mum</I> or <I>Friends</I>. Here is the entire +GF code for the abstract syntax: +</P> +<PRE> + -- a "Hello World" grammar + abstract Hello = { + + flags startcat = Greeting ; + + cat Greeting ; Recipient ; + + fun + Hello : Recipient -> Greeting ; + World, Mum, Friends : Recipient ; + } +</PRE> +<P> +The code has the following parts: +</P> +<UL> +<LI>a <B>comment</B> (optional), saying what the module is doing +<LI>a <B>module header</B> indicating that it is an abstract syntax + module named <CODE>Hello</CODE> +<LI>a <B>module body</B> in braces, consisting of + <UL> + <LI>a <B>startcat flag declaration</B> stating that <CODE>Greeting</CODE> is the + main category, i.e. the one in which parsing and generation are + performed by default + <LI><B>category declarations</B> stating that <CODE>Greeting</CODE> and <CODE>Recipient</CODE> + are categories, i.e. types of meanings + <LI><B>function declarations</B> stating what meaning-building functions there + are; these are the function <CODE>Hello</CODE> constructing a greeting from a recipient, + as well as the three possible recipients + </UL> +</UL> + +<P> +A concrete syntax defines a mapping from the abstract meanings to their +expressions in a language. We first give an English concrete syntax: +</P> +<PRE> + concrete HelloEng of Hello = { + + lincat Greeting, Recipient = {s : Str} ; + + lin + Hello recip = {s = "hello" ++ recip.s} ; + World = {s = "world"} ; + Mum = {s = "mum"} ; + Friends = {s = "friends"} ; + } +</PRE> +<P> +The major parts of this code are: +</P> +<UL> +<LI>a module header indicating that it is a concrete syntax of the abstract syntax + <CODE>Hello</CODE>, itself named <CODE>HelloEng</CODE> +<LI>a module body in curly brackets, consisting of + <UL> + <LI><B>linearization type definitions</B> stating that + <CODE>Greeting</CODE> and <CODE>Recipient</CODE> are <B>records</B> with a <B>string</B> <CODE>s</CODE> + <LI><B>linearization definitions</B> telling what records are assigned to + each of the meanings defined in the abstract syntax; the recipients are + linearized to records containing single words, whereas the <CODE>Hello</CODE> greeting + has a function telling that the word <CODE>hello</CODE> is prefixed to the string + <CODE>s</CODE> contained in the record <CODE>recip</CODE> + </UL> +</UL> + +<P> +To make the grammar truly multilingual, we add a Finnish and an Italian concrete +syntax: +</P> +<PRE> + concrete HelloFin of Hello = { + lincat Greeting, Recipient = {s : Str} ; + lin + Hello recip = {s = "terve" ++ recip.s} ; + World = {s = "maailma"} ; + Mum = {s = "äiti"} ; + Friends = {s = "ystävät"} ; + } + + concrete HelloIta of Hello = { + lincat Greeting, Recipient = {s : Str} ; + lin + Hello recip = {s = "ciao" ++ recip.s} ; + World = {s = "mondo"} ; + Mum = {s = "mamma"} ; + Friends = {s = "amici"} ; + } +</PRE> +<P> +Now we have a trilingual grammar usable for translation and +many other tasks, which we will now start experimenting with. +</P> +<A NAME="toc7"></A> +<H3>Using the grammar in the GF system</H3> +<P> +In order to compile the grammar in GF, each of the four modules +has to be put into a file named <I>Modulename</I><CODE>.gf</CODE>: +</P> +<PRE> + Hello.gf HelloEng.gf HelloFin.gf HelloIta.gf +</PRE> +<P> +The first GF command needed when using a grammar is to <B>import</B> it. +The command has a long name, <CODE>import</CODE>, and a short name, <CODE>i</CODE>. +When you have started GF (by the shell command <CODE>gf</CODE>), you can thus type either +</P> +<PRE> + > import HelloEng.gf +</PRE> +<P> +or +</P> +<PRE> + > i HelloEng.gf +</PRE> +<P> +to get the same effect. In general, all GF commands have a long and a short name; +short names are convenient when typing commands by hand, whereas long command +names are more readable in scripts, i.e. files that include sequences of commands. +</P> +<P> +The effect of <CODE>import</CODE> is that the GF system <B>compiles</B> your grammar +into an internal representation, and shows a new prompt when it is ready. +It will also show how much CPU time was consumed: +</P> +<PRE> + > i HelloEng.gf + - compiling Hello.gf... wrote file Hello.gfc 8 msec + - compiling HelloEng.gf... wrote file HelloEng.gfc 12 msec + + 12 msec + > +</PRE> +<P> +You can now use GF for <B>parsing</B>: +</P> +<PRE> + > parse "hello world" + Hello World +</PRE> +<P> +The <CODE>parse</CODE> (= <CODE>p</CODE>) command takes a <B>string</B> +(in double quotes) and returns an <B>abstract syntax tree</B> --- the meaning +of the string as defined in the abstract syntax. +A tree is, in general, something easier than a string +for a machine to understand and to process further, although this +is not so obvious in this simple grammar. The syntax for trees is that +of <B>function application</B>, which in GF is written +</P> +<PRE> + function argument1 ... argumentn +</PRE> +<P> +Parentheses are only needed for grouping. For instance, <CODE>f (a b)</CODE> is +<CODE>f</CODE> applied to the application of <CODE>a</CODE> to <CODE>b</CODE>. This is different +from <CODE>f a b</CODE>, which is <CODE>f</CODE> applied to <CODE>a</CODE> and <CODE>b</CODE>. +</P> +<P> +Strings that return a tree when parsed do so in virtue of the grammar +you imported. Try to parse something that is not in grammar, and you will fail +</P> +<PRE> + > parse "hello dad" + Unknown words: dad + + > parse "world hello" + no tree found +</PRE> +<P> +In the first example, the failure is caused by an unknown word. +In the second example, the combination of words is ungrammatical. +</P> +<P> +In addition to parsing, you can also use GF for <B>linearization</B> +(<CODE>linearize = l</CODE>). This is the inverse of +parsing, taking trees into strings: +</P> +<PRE> + > linearize Hello World + hello world +</PRE> +<P> +What is the use of this? Typically not that you type in a tree at +the GF prompt. The utility of linearization comes from the fact that +you can obtain a tree from somewhere else --- for instance, from +a parser. A prime example of this is <B>translation</B>: you parse +with one concrete syntax and linearize with another. Let us +now do this by first importing the Italian grammar: +</P> +<PRE> + > import HelloIta.gf +</PRE> +<P> +We can now parse with <CODE>HelloEng</CODE> and <B>pipe</B> the result +into linearizing with <CODE>HelloIta</CODE>: +</P> +<PRE> + > parse -lang=HelloEng "hello mum" | linearize -lang=HelloIta + ciao mamma +</PRE> +<P> +Notice that, since there are now two concrete syntaxes read into the +system, the commands use a <B>language flag</B> to indicate +which concrete syntax is used in each operation. If no language flag is +given, the last-imported language is applied. +</P> +<P> +To conclude the translation exercise, we import the Finnish grammar +and pipe English parsing into <B>multilingual generation</B>: +</P> +<PRE> + > parse -lang=HelloEng "hello friends" | linearize -multi + terve ystävät + ciao amici + hello friends +</PRE> +<P></P> +<P> +<B>Exercise</B>. Test the parsing and translation examples shown above, as well as +some other examples, in different combinations of languages. +</P> +<P> +<B>Exercise</B>. Extend the grammar <CODE>Hello.gf</CODE> and some of the +concrete syntaxes by five new recipients and one new greeting +form. +</P> +<P> +<B>Exercise</B>. Add a concrete syntax for some other +languages you might know. +</P> +<P> +<B>Exercise</B>. Add a pair of greetings that are expressed in one and the same way in +one language and in two different ways in another. For instance, <I>good morning</I> +and <I>good afternoon</I> in English are both expressed as <I>buongiorno</I> in Italian. +Test what happens when you translate <I>buongiorno</I> to English in GF. +</P> +<P> +<B>Exercise</B>. Inject errors in the <CODE>Hello</CODE> grammars, for example, leave out +some line, omit a variable in a <CODE>lin</CODE> rule, or change the name in one occurrence +of a variable. Inspect the error messages generated by GF. +</P> +<A NAME="toc8"></A> +<H2>Using grammars from outside GF</H2> +<P> +A normal "hello world" program written in C is executable from the +Unix shell and print its output on the terminal. This is possible in GF +as well, by using the <CODE>gf</CODE> program in a Unix pipe. Invoking <CODE>gf</CODE> +can be made with grammar names as arguments, +</P> +<PRE> + % gf HelloEng.gf HelloFin.gf HelloIta.gf +</PRE> +<P> +which has the same effect as opening <CODE>gf</CODE> and then importing the +grammars. A command can be send to this <CODE>gf</CODE> state by piping it from +Unix's <CODE>echo</CODE> command: +</P> +<PRE> + % echo "l -multi Hello Wordl" | gf HelloEng.gf HelloFin.gf HelloIta.gf +</PRE> +<P> +which will execute the command and then quit. Alternatively, one can write +a <B>script</B>, a file containing the lines +</P> +<PRE> + import HelloEng.gf + import HelloFin.gf + import HelloIta.gf + linearize -multi Hello World +</PRE> +<P> +If we name this script <CODE>hello.gfs</CODE>, we can do +</P> +<PRE> + $ gf -batch -s <hello.gfs s + + ciao mondo + terve maailma + hello world +</PRE> +<P> +The options <CODE>-batch</CODE> and <CODE>-s</CODE> ("silent") remove prompts, CPU time, +and other messages. Writing GF scripts and Unix shell scripts that call +GF is the simplest way to build application programs that use GF grammars. +In <a href="#chapeight">the eighth chapter</a>, we will see how to build stand-alone programs that don't need +the GF system to run. +</P> +<P> +<B>Exercise</B>. (For Unix hackers.) Write a GF application that reads +an English string from the standard input and writes an Italian +translation to the output. +</P> +<A NAME="toc9"></A> +<H2>What else can be done with the grammar</H2> +<P> +Now we have built our first multilingual grammar and seen the basic +functionalities of GF: parsing and linearization. We have tested +these functionalities inside the GF program. In the forthcoming +chapters, we will build larger grammars and can then get more out of +these functionalities. But we will also introduce new ones: +</P> +<UL> +<LI><B>morphological analysis</B>: find out the possible inflection forms of words +<LI><B>morphological synthesis</B>: generate all inflection forms of words +<LI><B>random generation</B>: generate random expressions +<LI><B>corpus generation</B>: generate all expressions +<LI><B>treebank generation</B>: generate a list of trees with their linearizations +<LI><B>teaching quizzes</B>: train morphology and translation +<LI><B>multilingual authoring</B>: create a document in many languages simultaneously +<LI><B>speech input</B>: optimize a speech recognition system for a grammar +</UL> + +<P> +The usefulness of GF would be quite limited if grammars were +usable only inside the GF system. In <a href="#chapeight">the eighth chapter</a>, +we will see other ways of using grammars: +</P> +<UL> +<LI>compile them to new formats, such as speech recognition grammars +<LI>embed them in Java and Haskell programs +<LI>build applications using compilation and embedding: + <UL> + <LI>voice commands + <LI>spoken language translators + <LI>dialogue systems + <LI>user interfaces + <LI>localization: parametrize the messages printed by a program + to support different languages + </UL> +</UL> + +<P> +All GF functionalities, both those inside the GF program and those +ported to other environments, +are of course already applicable to the simplest of grammars, +such as the <CODE>Hello</CODE> grammars presented above. But the main focus +of this tutorial will be on grammar writing. Thus we will show +how larger and more expressive grammars can be built by using +the constructs of the GF programming language, before entering the +applications. +</P> +<A NAME="toc10"></A> +<H2>Summary of GF language features</H2> +<P> +As the last section of each chapter, we will give a summary of the GF language +features covered in the chapter. The presentation is rather technical and intended +as a reference for later use, rather than to be read at once. The summaries +may cover some new features, which complement the discussion in the main chapter. +</P> +<A NAME="toc11"></A> +<H3>Modules</H3> +<P> +A GF grammar consists of <B>modules</B>, +into which judgements are grouped. The most important +module forms are +</P> +<UL> +<LI><CODE>abstract</CODE> A <CODE>= {...}</CODE> , abstract syntax A with judgements in + the <B>module body</B> <CODE>{...}</CODE>. +<LI><CODE>concrete</CODE> C <CODE>of</CODE> A <CODE>= {...}</CODE>, concrete syntax C of the + abstract syntax A, with judgements in the module body <CODE>{...}</CODE>. +</UL> + +<P> +Each module is written in a file named <I>Modulename</I><CODE>.gf</CODE>. +</P> +<A NAME="toc12"></A> +<H3>Judgements</H3> +<P> +<a name="secjment"></a> +</P> +<P> +Rules in a module body are called <B>judgements</B>. Keywords such as +<CODE>fun</CODE> and <CODE>lin</CODE> are used for distinguishing between +<B>judgement forms</B>. Here is a summary of the most important +judgement forms, which we have considered by now: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>form</TH> +<TH>reading</TH> +<TH COLSPAN="2">module type</TH> +</TR> +<TR> +<TD><CODE>cat</CODE> <I>C</I></TD> +<TD><I>C</I> is a category</TD> +<TD>abstract</TD> +</TR> +<TR> +<TD><CODE>fun</CODE> <I>f</I> <CODE>:</CODE> <I>A</I></TD> +<TD><I>f</I> is a function of type <I>A</I></TD> +<TD>abstract</TD> +</TR> +<TR> +<TD><CODE>lincat</CODE> <I>C</I> <CODE>=</CODE> <I>T</I></TD> +<TD>category <I>C</I> has linearization type <I>T</I></TD> +<TD>concrete</TD> +</TR> +<TR> +<TD><CODE>lin</CODE> <I>f <i>x</i><sub>1</sub> ... <i>x</i><sub>n</sub></I> <CODE>=</CODE> <I>t</I></TD> +<TD>function <I>f</I> has linearization <I>t</I></TD> +<TD>concrete</TD> +</TR> +<TR> +<TD><CODE>flags</CODE> <I>p</I> <CODE>=</CODE> <I>v</I></TD> +<TD>flag <I>p</I> has value <I>v</I></TD> +<TD>any</TD> +</TR> +</TABLE> + +<P></P> +<P> +Both abstract and concrete modules may moreover contain <B>comments</B> of the forms +</P> +<UL> +<LI><CODE>--</CODE> <I>anything until a newline</I> +<LI><CODE>{-</CODE> <I>anything except hyphen followed by closing brace</I> <CODE>-}</CODE> +</UL> + +<P> +Judgements are terminated by semicolons. Shorthands permit the sharing of +the keyword in subsequent judgements, +</P> +<PRE> + cat C ; D ; === cat C ; cat D ; +</PRE> +<P> +and of the right-hand-side in subsequent judgements of the same form +</P> +<PRE> + fun f, g : A ; === fun f : A ; g : A ; +</PRE> +<P> +We will use the symbol <CODE>===</CODE> to indicate <B>syntactic sugar</B> when +speaking about GF. Thus it is not a symbol of the GF language. +</P> +<P> +Each judgement declares a <B>name</B>, which is an <B>identifier</B>. +An identifier is a letter followed by a sequence of letters, digits, and +characters <CODE>'</CODE> or <CODE>_</CODE>. Each identifier can only be +defined once in the same module (that is, as next to the judgement keyword; +local variables such as those in <CODE>lin</CODE> judgemenrs can be +reused in other judgements). +</P> +<P> +Names are in <B>scope</B> in the rest of the module, i.e. usable in the other +judgements of the module (subject to type restrictions, of course). Also +the name of the module is an identifier in scope. +</P> +<P> +The order of judgements in a module is free. In particular, an identifier +need not be declared before it is used. +</P> +<A NAME="toc13"></A> +<H3>Types and terms</H3> +<P> +A <B>type</B> in an abstract syntax are either a <B>basic type</B>, +i.e. one introduced in a <CODE>cat</CODE> judgement, or a +<B>function type</B> of the form +</P> +<PRE> + A1 -> ... -> An -> A +</PRE> +<P> +where each of <CODE>A1, ..., An, A</CODE> is a basic type. +The last type in the arrow-separated sequence +is the <B>value type</B> of the function type, and the earlier types are +its <B>argument types</B>. +</P> +<P> +In a concrete syntax, the available types include +</P> +<UL> +<LI>the type of <B>token lists</B>, <CODE>Str</CODE> +<LI><B>record types</B> of form <CODE>{</CODE> r1 : T1 ; ... ; rn : Tn <CODE>}</CODE> +</UL> + +<P> +Token lists are often briefly called <B>strings</B>. +</P> +<P> +Each semi-colon separated part in a record type is called a +<B>field</B>. The identifier introduced by the left-hand-side of a field +is called a <B>label</B>. +</P> +<P> +A <B>term</B> in abstract syntax is a <B>function application</B> of form +</P> +<PRE> + f a1 ... an +</PRE> +<P> +where <CODE>f</CODE> is a function declared in a <CODE>fun</CODE> judgement and <CODE>a1 ... an</CODE> +are terms. These terms are also called <B>abstract syntax trees</B>, or just +<B>trees</B>. +The tree above is well-typed and has the type A, if +</P> +<PRE> + f : A1 -> ... -> An -> A +</PRE> +<P> +and each <CODE>ai</CODE> has type <CODE>an</CODE>. +</P> +<P> +A term used in concrete syntax has one the forms +</P> +<UL> +<LI>quoted string: <CODE>"foo"</CODE>, of type <CODE>Str</CODE> +<LI>concatenation of strings: <CODE>"foo" ++ "bar"</CODE>, +<LI>record: <CODE>{</CODE> r1 = t1 ; ... ; rn = Tn <CODE>}</CODE>, + of type <CODE>{</CODE> r1 : R1 ; ... ; rn : Rn <CODE>}</CODE> +<LI>projection <CODE>t.r</CODE> of a term <CODE>t</CODE> that has a record type, + with the record label <CODE>r</CODE>; the projection has the corresponding record + field type +<LI>argument variable <CODE>x</CODE> bound by the left-hand-side of a <CODE>lin</CODE> rule, + of the corresponding linearization type +</UL> + +<P> +Each quoted string is treated as one <B>token</B>, and strings concatenated by +<CODE>++</CODE> are treated as separate tokens. Tokens are, by default, written with +a space in between. This behaviour can be changed by <CODE>lexer</CODE> and <CODE>unlexer</CODE> +flags, as will be explained later "Rseclexing. Therefore it is usually +not correct to have a space in a token. Writing +</P> +<PRE> + "hello world" +</PRE> +<P> +in a grammar would give the parser the task to find a token with a space +in it, rather than two tokens <CODE>"hello"</CODE> and <CODE>"world"</CODE>. If the latter is +what is meant, it is possible to use the shorthand +</P> +<PRE> + ["hello world"] === "hello" ++ "world" +</PRE> +<P> +The <B>empty string</B> is denoted by <CODE>[]</CODE> or, equivalently, <CODE>`` or ``[]</CODE>. +</P> +<A NAME="toc14"></A> +<H3>Type checking</H3> +<P> +An important functionality of the GF system is <B>static type checking</B>. +This means that the grammars are controlled to be well-formed, so that all +run-time errors are eliminated. The main type checking principles are the +following: +</P> +<UL> +<LI>a concrete syntax must define the <CODE>lincat</CODE> of each <CODE>cat</CODE> and a <CODE>lin</CODE> + for each <CODE>fun</CODE> in the abstract syntax that it is "<CODE>of</CODE>" +<LI><CODE>lin</CODE> rules are type checked with respect to the <CODE>lincat</CODE> and <CODE>fun</CODE> + rules +<LI>terms have types as defined in the previous section +</UL> + +<A NAME="toc15"></A> +<H1>Designing a grammar for complex phrases</H1> +<P> +<a name="chapthree"></a> +</P> +<P> +In this chapter, we will write a grammar that has much more structure than +the <CODE>Hello</CODE> grammar. We will look at how the abstract syntax +is divided into suitable categories, and how infinitely many +phrases can be generated by using recursive rules. We will also +introduce modularity by showing how a grammar can be +divided into modules, and how functional programming +can be used to share code in and among modules. +</P> +<A NAME="toc16"></A> +<H2>The abstract syntax Food</H2> +<P> +We will write a grammar that +defines a set of phrases usable for speaking about food: +</P> +<UL> +<LI>the start category is <CODE>Phrase</CODE> +<LI>a <CODE>Phrase</CODE> can be built by assigning a <CODE>Quality</CODE> to an <CODE>Item</CODE> + (e.g. <I>this cheese is Italian</I>) +<LI>an<CODE>Item</CODE> are build from a <CODE>Kind</CODE> by prefixing <I>this</I> or <I>that</I> + (e.g. <I>this wine</I>) +<LI>a <CODE>Kind</CODE> is either <B>atomic</B> (e.g. <I>cheese</I>), or formed + qualifying a given <CODE>Kind</CODE> with a <CODE>Quality</CODE> (e.g. <I>Italian cheese</I>) +<LI>a <CODE>Quality</CODE> is either atomic (e.g. <I>Italian</I>, + or built by modifying a given <CODE>Quality</CODE> with the word <I>very</I> (e.g. <I>very warm</I>) +</UL> + +<P> +These verbal descriptions can be expressed as the following abstract syntax: +</P> +<PRE> + abstract Food = { + + flags startcat = Phrase ; + + cat + Phrase ; Item ; Kind ; Quality ; + + fun + Is : Item -> Quality -> Phrase ; + This, That : Kind -> Item ; + QKind : Quality -> Kind -> Kind ; + Wine, Cheese, Fish : Kind ; + Very : Quality -> Quality ; + Fresh, Warm, Italian, Expensive, Delicious, Boring : Quality ; + } +</PRE> +<P> +In this abstract syntax, we can build <CODE>Phrase</CODE>s such as +</P> +<PRE> + Is (This (QKind Delicious (QKind Italian Wine))) (Very (Very Expensive)) +</PRE> +<P> +In the English concrete syntax, we will want to linearize this into +</P> +<PRE> + this delicious Italian wine is very very expensive +</PRE> +<P></P> +<A NAME="toc17"></A> +<H2>The concrete syntax FoodEng</H2> +<P> +The English concrete syntax gives no surprises: +</P> +<PRE> + concrete FoodEng of Food = { + + lincat + Phrase, Item, Kind, Quality = {s : Str} ; + + lin + Is item quality = {s = item.s ++ "is" ++ quality.s} ; + This kind = {s = "this" ++ kind.s} ; + That kind = {s = "that" ++ kind.s} ; + QKind quality kind = {s = quality.s ++ kind.s} ; + Wine = {s = "wine"} ; + Cheese = {s = "cheese"} ; + Fish = {s = "fish"} ; + Very quality = {s = "very" ++ quality.s} ; + Fresh = {s = "fresh"} ; + Warm = {s = "warm"} ; + Italian = {s = "Italian"} ; + Expensive = {s = "expensive"} ; + Delicious = {s = "delicious"} ; + Boring = {s = "boring"} ; + } +</PRE> +<P> +Let us test how the grammar works in parsing: +</P> +<PRE> + > import FoodEng.gf + > parse "this delicious wine is very very Italian" + Is (This (QKind Delicious Wine)) (Very (Very Italian)) +</PRE> +<P> +We can also try parsing in other categories than the <CODE>startcat</CODE>, +by setting the command-line <CODE>cat</CODE> flag: +</P> +<PRE> + p -cat=Kind "very Italian wine" + QKind (Very Italian) Wine +</PRE> +<P></P> +<P> +<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar by ten new food kinds and +qualities, and run the parser with new kinds of examples. +</P> +<P> +<B>Exercise</B>. Add a rule that enables question phrases of the form +<I>is this cheese Italian</I>. +</P> +<P> +<B>Exercise</B>. Enable the optional prefixing of +phrases with the words "excuse me but". Do this in such a way that +the prefix can occur at most once. +</P> +<A NAME="toc18"></A> +<H2>Commands for testing grammars</H2> +<A NAME="toc19"></A> +<H3>Generating trees and strings</H3> +<P> +When we have a grammar above a trivial size, especially a recursive +one, we need more efficient ways of testing it than just by parsing +sentences that happen to come to our minds. One way to do this is +based on automatic generation, which can be either +<B>random generation</B> or <B>exhaustive generation</B>. +</P> +<P> +Random generation (<CODE>generate_random = gr</CODE>) is an operation that +builds a random tree in accordance with an abstract syntax: +</P> +<PRE> + > generate_random + Is (This (QKind Italian Fish)) Fresh +</PRE> +<P> +By using a pipe, random generation can be fed into linearization: +</P> +<PRE> + > generate_random | linearize + this Italian fish is fresh +</PRE> +<P> +Random generation is a good way to test a grammar. It can also give results +that are surprising, which shows how fast we lose intuition +when we write complex grammars. +</P> +<P> +By using the <CODE>number</CODE> flag, several trees can be generated +in one command: +</P> +<PRE> + > gr -number=10 | l + that wine is boring + that fresh cheese is fresh + that cheese is very boring + this cheese is Italian + that expensive cheese is expensive + that fish is fresh + that wine is very Italian + this wine is Italian + this cheese is boring + this fish is boring +</PRE> +<P> +To generate <I>all</I> phrases that a grammar can produce, +GF provides the command <CODE>generate_trees = gt</CODE>. +</P> +<PRE> + > generate_trees | l + that cheese is very Italian + that cheese is very boring + that cheese is very delicious + that cheese is very expensive + that cheese is very fresh + ... + this wine is expensive + this wine is fresh + this wine is warm + +</PRE> +<P> +We get quite a few trees but not all of them: only up to a given +<B>depth</B> of trees. The default depth is 3; the depth can be +set by using the <CODE>depth</CODE> flag: +</P> +<PRE> + > generate_trees -depth=5 | l +</PRE> +<P> +Other options to the generation commands (like all commands) can be seen +by GF's <CODE>help = h</CODE> command: +</P> +<PRE> + > help gr + > help gt +</PRE> +<P></P> +<P> +<B>Exercise</B>. If the command <CODE>gt</CODE> generated all +trees in your grammar, it would never terminate. Why? +</P> +<P> +<B>Exercise</B>. Measure how many trees the grammar gives with depths 4 and 5, +respectively. <B>Hint</B>. You can +use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines. +</P> +<A NAME="toc20"></A> +<H3>More on pipes; tracing</H3> +<P> +A pipe of GF commands can have any length, but the "output type" +(either string or tree) of one command must always match the "input type" +of the next command, in order for the result to make sense. +</P> +<P> +The intermediate results in a pipe can be observed by putting the +<B>tracing</B> option <CODE>-tr</CODE> to each command whose output you +want to see: +</P> +<PRE> + > gr -tr | l -tr | p + + Is (This Cheese) Boring + this cheese is boring + Is (This Cheese) Boring +</PRE> +<P> +This facility is useful for test purposes: the pipe above can show +if a grammar is <B>ambiguous</B>, i.e. +contains strings that can be parsed in more than one way. +</P> +<P> +<B>Exercise</B>. Extend the <CODE>Food</CODE> grammar so that it produces ambiguous +strings, and try out the ambiguity test. +</P> +<A NAME="toc21"></A> +<H3>Writing and reading files</H3> +<P> +To save the outputs of GF commands into a file, you can +pipe it to the <CODE>write_file = wf</CODE> command, +</P> +<PRE> + > gr -number=10 | linearize | write_file exx.tmp +</PRE> +<P> +You can read the file back to GF with the +<CODE>read_file = rf</CODE> command, +</P> +<PRE> + > read_file exx.tmp | parse -lines +</PRE> +<P> +Notice the flag <CODE>-lines</CODE> given to the parsing +command. This flag tells GF to parse each line of +the file separately. Without the flag, the grammar could +not recognize the string in the file, because it is not +a sentence but a sequence of ten sentences. +</P> +<P> +Files with examples can be used for <B>regression testing</B> +of grammars. The most systematic way to do this is by +generating treebanks; see <a href="#sectreebank">here</a>. +</P> +<A NAME="toc22"></A> +<H3>Visualizing trees</H3> +<P> +The gibberish code with parentheses returned by the parser does not +look like trees. Why is it called so? From the abstract mathematical +point of view, trees are a data structure that +represents <B>nesting</B>: trees are branching entities, and the branches +are themselves trees. Parentheses give a linear representation of trees, +useful for the computer. But the human eye may prefer to see a visualization; +for this purpose, GF provides the command <CODE>visualize_tree = vt</CODE>, to which +parsing (and any other tree-producing command) can be piped: +</P> +<PRE> + > parse "this delicious cheese is very Italian" | visualize_tree +</PRE> +<P></P> +<P> +<IMG ALIGN="middle" SRC="mytree.png" BORDER="0" ALT=""> +</P> +<P> +This command uses the programs Graphviz and Ghostview, which you +might not have, but which are freely available on the web. +</P> +<P> +Alternatively, you can print the tree into a file +e.g. a <CODE>.png</CODE> file that +can be be viewed with e.g. an HTML browser and also included in an +HTML document. You can do this +by saving the file <CODE>grphtmp.dot</CODE>, which the command <CODE>vt</CODE> +produces. Then you can process this file with the <CODE>dot</CODE> +program (from the Graphviz package). +</P> +<PRE> + % dot -Tpng grphtmp.dot > mytree.png +</PRE> +<P></P> +<A NAME="toc23"></A> +<H3>System commands</H3> +<P> +If you don't have Ghostview, or want to view graphs in some other way, +you can call <CODE>dot</CODE> and a suitable +viewer (e.g. <CODE>open</CODE> in Mac) without leaving GF, by using +a <B>system command</B>: <CODE>!</CODE> followed by a Unix command, +</P> +<PRE> + > ! dot -Tpng grphtmp.dot > mytree.png + > ! open mytree.png +</PRE> +<P> +Another form of system commands are those that receive arguments from +GF pipes. The escape symbol +is then <CODE>?</CODE>. +</P> +<PRE> + > generate_trees | ? wc +</PRE> +<P></P> +<P> +<B>Exercise</B>. (Exercise drom 3.3.1 revisited.) +Measure how many trees the grammar <CODE>FoodEng</CODE> gives with depths 4 and 5, +respectively. Use the Unix <B>word count</B> command <CODE>wc</CODE> to count lines, and +a pipe from a GF command into a Unix command. +</P> +<A NAME="toc24"></A> +<H2>An Italian concrete syntax</H2> +<P> +<a name="secanitalian"></a> +</P> +<P> +We write the Italian grammar in a straightforward way, by replacing +English words with their dictionary equivalents: +</P> +<PRE> + concrete FoodIta of Food = { + + lincat + Phrase, Item, Kind, Quality = {s : Str} ; + + lin + Is item quality = {s = item.s ++ "è" ++ quality.s} ; + This kind = {s = "questo" ++ kind.s} ; + That kind = {s = "quello" ++ kind.s} ; + QKind quality kind = {s = kind.s ++ quality.s} ; + Wine = {s = "vino"} ; + Cheese = {s = "formaggio"} ; + Fish = {s = "pesce"} ; + Very quality = {s = "molto" ++ quality.s} ; + Fresh = {s = "fresco"} ; + Warm = {s = "caldo"} ; + Italian = {s = "italiano"} ; + Expensive = {s = "caro"} ; + Delicious = {s = "delizioso"} ; + Boring = {s = "noioso"} ; + } +</PRE> +<P> +An alert reader, or one who already knows Italian, may notice one point in +which the change is more substantial than just replacement of words: the order of +a quality and the kind it modifies in +</P> +<PRE> + QKind quality kind = {s = kind.s ++ quality.s} ; +</PRE> +<P> +Thus Italian says <CODE>vino italiano</CODE> for <CODE>Italian wine</CODE>. (Some Italian adjectives +are put before the noun. This distinction can be controlled by parameters, which +are introduced in <a href="#chaptwo">the fourth chapter</a>.) +</P> +<P> +<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for some other language. +You will probably end up with grammatically incorrect linearizations --- but don't +worry about this yet. +</P> +<P> +<B>Exercise</B>. If you have written <CODE>Food</CODE> for German, Swedish, or some +other language, test with random or exhaustive generation what constructs +come out incorrect, and prepare a list of those ones that cannot be helped +with the currently available fragment of GF. You can return to your list +after having worked out <a href="#chaptwo">the fourth chapter</a>. +</P> +<A NAME="toc25"></A> +<H2>Free variation</H2> +<P> +Sometimes there are alternative ways to define a concrete syntax. +For instance, if we use the <CODE>Food</CODE> grammars in a restaurant phrase +book, we may want to accept different words for expressing the quality +"delicious" ---- and different languages can differ in how many +such words they have. Then we don't want to put the distinctions into +the abstract syntax, but into concrete syntaxes. Such semantically +neutral distinctions are known as <B>free variation</B> in linguistics. +</P> +<P> +The <CODE>variants</CODE> construct of GF expresses free variation. For example, +</P> +<PRE> + lin Delicious = {s = variants {"delicious" ; "exquisit" ; "tasty"}} ; +</PRE> +<P> +says that <CODE>Delicious</CODE> can be linearized to any of <I>delicious</I>, +<I>exquisit</I>, and <I>tasty</I>. As a consequence, both these words result in the +tree <CODE>Delicious</CODE> when parsed. By default, the <CODE>linearize</CODE> command +shows only the first variant from each <CODE>variants</CODE> list; to see them +all, the option <CODE>-all</CODE> can be used: +</P> +<PRE> + > p "this exquisit wine is delicious" | l -all + this delicious wine is delicious + this delicious wine is exquisit + ... +</PRE> +<P> +In linguistics, it is well known that free variation is almost +non-existing, if all aspects of expressions are taken into account, including style. +Therefore, free variation should not be used in grammars that are meant as +libraries for other grammars, as in <a href="#chapfive">the fifth chapter</a>. However, in a specific +application, free variation is an excellent way to scale up the parser to +variations in user input that make no difference in the semantic +treatment. +</P> +<P> +An example that clearly illustrates these points is the +English negation. If we added to the <CODE>Food</CODE> grammar the negation +of a quality, we could accept both contracted and uncontracted <I>not</I>: +</P> +<PRE> + fun IsNot : Item -> Quality -> Phrase ; + lin IsNot item qual = + {s = item.s ++ variants {"isn't" ; ["is not"]} ++ qual.s} ; +</PRE> +<P> +Both forms are likely to occur in user input. Since there is no +corresponding contrast in Italian, we do not want to put the distinction +in the abstract syntax. Yet there is a stylistic difference between +these two forms. In particular, if we are doing generation rather +than parsing, we will want to choose the one or +the other depending on the kind of language we want to generate. +</P> +<P> +A limiting case of free variation is an empty variant list +</P> +<PRE> + variants {} +</PRE> +<P> +It can be used e.g. if a word lacks a certain inflection form. +</P> +<P> +Free variation works for all types in concrete syntax; all terms in +a <CODE>variants</CODE> list must be of the same type. +</P> +<P> +<B>Exercise</B>. Modify <CODE>FoodIta</CODE> in such a way that a quality can +be assigned to an item by using two different word orders, exemplified +by <I>questo vino è delizioso</I> and <I>è delizioso questo vino</I> +(a real variation in Italian), +and that it is impossible to say that something is boring +(a rather contrived example). +</P> +<A NAME="toc26"></A> +<H2>More application of multilingual grammars</H2> +<A NAME="toc27"></A> +<H3>Multilingual treebanks</H3> +<P> +<a name="sectreebank"></a> +</P> +<P> +A <B>multilingual treebank</B> is a set of trees with their +translations in different languages: +</P> +<PRE> + > gr -number=2 | tree_bank + + Is (That Cheese) (Very Boring) + quello formaggio è molto noioso + that cheese is very boring + + Is (That Cheese) Fresh + quello formaggio è fresco + that cheese is fresh +</PRE> +<P> +There is also an XML format for treebanks and a set of commands +suitable for regression testing; see <CODE>help tb</CODE> for more details. +</P> +<A NAME="toc28"></A> +<H3>Translation session</H3> +<P> +If translation is what you want to do with a set of grammars, a convenient +way to do it is to open a <CODE>translation_session = ts</CODE>. In this session, +you can translate between all the languages that are in scope. +A dot <CODE>.</CODE> terminates the translation session. +</P> +<PRE> + > ts + + trans> that very warm cheese is boring + quello formaggio molto caldo è noioso + that very warm cheese is boring + + trans> questo vino molto italiano è molto delizioso + questo vino molto italiano è molto delizioso + this very Italian wine is very delicious + + trans> . + > +</PRE> +<P></P> +<A NAME="toc29"></A> +<H3>Translation quiz</H3> +<P> +This is a simple language exercise that can be automatically +generated from a multilingual grammar. The system generates a set of +random sentences, displays them in one language, and checks the user's +answer given in another language. The command <CODE>translation_quiz = tq</CODE> +makes this in a subshell of GF. +</P> +<PRE> + > translation_quiz FoodEng FoodIta + + Welcome to GF Translation Quiz. + The quiz is over when you have done at least 10 examples + with at least 75 % success. + You can interrupt the quiz by entering a line consisting of a dot ('.'). + + this fish is warm + questo pesce è caldo + > Yes. + Score 1/1 + + this cheese is Italian + questo formaggio è noioso + > No, not questo formaggio è noioso, but + questo formaggio è italiano + + Score 1/2 + this fish is expensive +</PRE> +<P> +You can also generate a list of translation exercises and save it in a +file for later use, by the command <CODE>translation_list = tl</CODE> +</P> +<PRE> + > translation_list -number=25 FoodEng FoodIta | write_file transl.txt +</PRE> +<P> +The <CODE>number</CODE> flag gives the number of sentences generated. +</P> +<A NAME="toc30"></A> +<H3>Multilingual syntax editing</H3> +<P> +<a name="secediting"></a> +</P> +<P> +Any multilingual grammar can be used in the graphical syntax editor, which is +opened by the shell +command <CODE>gfeditor</CODE> followed by the names of the grammar files. +Thus +</P> +<PRE> + % gfeditor FoodEng.gf FoodIta.gf +</PRE> +<P> +opens the editor for the two <CODE>Food</CODE> grammars. +</P> +<P> +The editor supports commands for manipulating an abstract syntax tree. +The process is started by choosing a category from the "New" menu. +Choosing <CODE>Phrase</CODE> creates a new tree of type <CODE>Phrase</CODE>. A new tree +is in general completely unknown: it consists of a <B>metavariable</B> +<CODE>?1</CODE>. However, since the category <CODE>Phrase</CODE> in <CODE>Food</CODE> has +only one possible constructor, <CODE>Is</CODE>, the tree is readily +given the form <CODE>Is ?1 ?2</CODE>. Here is what the editor looks like at +this stage: +</P> +<P> + <IMG ALIGN="right" SRC="food1.png" BORDER="0" ALT=""> +</P> +<P> +Editing goes on by <B>refinements</B>, i.e. choices of constructors from +the menu, until no metavariables remain. Here is a tree resulting from the +current editing session: +</P> +<P> + <IMG ALIGN="right" SRC="food2.png" BORDER="0" ALT=""> +</P> +<P> +Editing can be continued even when the tree is finished. The user can shift +the <B>focus</B> to some of the subtrees by clicking at it or the corresponding +part of a linearization. In the picture, the focus is on "fish". +Since there are no metavariables, +the menu shows no refinements, but some other possible actions: +</P> +<UL> +<LI>to <B>change</B> "fish" to "cheese" or "wine" +<LI>to <B>delete</B> "fish", i.e. change it to a metavariable +<LI>to <B>wrap</B> "fish" in a qualification, i.e. change it to + <CODE>QKind ? Fish</CODE>, where the quality can be given in a later refinement +</UL> + +<P> +In addition to menu-based editing, the tool supports refinement by parsing, +which is accessible by middle-clicking in the tree or in the linearization field. +</P> +<P> +<B>Exercise</B>. Construct the sentence +<I>this very expensive cheese is very very delicious</I> +and its Italian translation by using <CODE>gfeditor</CODE>. +</P> +<A NAME="toc31"></A> +<H2>Context-free grammars and GF</H2> +<P> +Readers not familar with context-free grammars, also known as BNF grammars, can +skip this section. Those that are familar with them will find here the exact +relation between GF and context-free grammars. We will moreover show how +the BNF format can be used as input to the GF program; it is often more +concise than GF proper, but also more restricted in expressive power. +</P> +<A NAME="toc32"></A> +<H3>The "cf" grammar format</H3> +<P> +The grammar <CODE>FoodEng</CODE> could be written in a BNF format as follows: +</P> +<PRE> + Is. Phrase ::= Item "is" Quality ; + That. Item ::= "that" Kind ; + This. Item ::= "this" Kind ; + QKind. Kind ::= Quality Kind ; + Cheese. Kind ::= "cheese" ; + Fish. Kind ::= "fish" ; + Wine. Kind ::= "wine" ; + Italian. Quality ::= "Italian" ; + Boring. Quality ::= "boring" ; + Delicious. Quality ::= "delicious" ; + Expensive. Quality ::= "expensive" ; + Fresh. Quality ::= "fresh" ; + Very. Quality ::= "very" Quality ; + Warm. Quality ::= "warm" ; +</PRE> +<P> +In this format, each rule is prefixed by a <B>label</B> that gives +the constructor function GF gives in its <CODE>fun</CODE> rules. In fact, +each context-free rule is a fusion of a <CODE>fun</CODE> and a <CODE>lin</CODE> rule: +it states simultaneously that +</P> +<UL> +<LI>the label is a function from the nonterminal categories + on the right-hand side to the category on the left-hand side; + the first rule gives +<PRE> + fun Is : Item -> Quality -> Phrase +</PRE> +<LI>trees built by the label are linearized in the way indicated + by the right-hand side; + the first rule gives +<PRE> + lin Is item quality = {s = item.s ++ "is" ++ quality.s} +</PRE> +</UL> + +<P> +The translation from BNF to GF described above is in fact used in +the GF system to convert BNF grammars into GF. BNF files are recognized +by the file name suffix <CODE>.cf</CODE>; thus the grammar above can be +put into a file named <CODE>food.cf</CODE> and read into GF by +</P> +<PRE> + > import food.cf +</PRE> +<P></P> +<A NAME="toc33"></A> +<H3>Restrictions of context-free grammars</H3> +<P> +Even though we managed to write <CODE>FoodEng</CODE> in the context-free format, +we cannot do this for GF grammars in general. It is enough to try this +with <CODE>FoodIta</CODE> at the same time as <CODE>FoodEng</CODE>, +we lose an important aspect of multilinguality: +that the order of constituents is defined only in concrete syntax. +Thus we could not use context-free <CODE>FoodEng</CODE> and <CODE>FoodIta</CODE> in a multilingual +grammar that supports translation via common abstract syntax: the +qualification function <CODE>QKind</CODE> has different types in the two +grammars. +</P> +<P> +In general terms, the separation of concrete and abstract syntax allows +three deviations from context-free grammar: +</P> +<UL> +<LI><B>permutation</B>: changing the order of constituents +<LI><B>suppression</B>: omitting constituents +<LI><B>reduplication</B>: repeating constituents +</UL> + +<P> +The third property is the one that definitely shows that GF is +stronger than context-free: GF can define the <B>copy language</B> +<CODE>{x x | x <- (a|b)*}</CODE>, which is known not to be context-free. +The other properties have more to do with the kind of trees that +the grammar can associate with strings: permutation is important +in multilingual grammars, and suppression is exploited in grammars +where trees carry some hidden semantic information (see <a href="#chapsix">the sixth chapter</a> +below). +</P> +<P> +Of course, context-free grammars are also restricted from the +grammar engineering point of view. They give no support to +modules, functions, and parameters, which are so central +for the productivity of GF. Despite the initial conciseness +of context-free grammars, GF can easily produce grammars where +30 lines of GF code would need hundreds of lines of +context-free grammar code to produce; see exercises +<a href="#secitalian">here</a> and <a href="#sectense">here</a>. +</P> +<P> +<B>Exercise</B>. GF can also interpret unlabelled BNF grammars, by +creating labels automatically. The right-hand sides of BNF rules +can moreover be disjunctions, e.g. +</P> +<PRE> + Quality ::= "fresh" | "Italian" | "very" Quality ; +</PRE> +<P> +Experiment with this format in GF, possibly with a grammar that +you import from some other source, such as a programming language +document. +</P> +<P> +<B>Exercise</B>. Define the copy language <CODE>{x x | x <- (a|b)*}</CODE> in GF. +</P> +<A NAME="toc34"></A> +<H2>Modules and files</H2> +<P> +GF uses suffixes to recognize different file formats. The most +important ones are: +</P> +<UL> +<LI>Source files: <I>Modulename</I><CODE>.gf</CODE> +<LI>Target files: <I>Modulename</I><CODE>.gfc</CODE> +</UL> + +<P> +When you import <CODE>FoodEng.gf</CODE>, you see the target files being +generated: +</P> +<PRE> + > i FoodEng.gf + - compiling Food.gf... wrote file Food.gfc 16 msec + - compiling FoodEng.gf... wrote file FoodEng.gfc 20 msec +</PRE> +<P> +You also see that the GF program does not only read the file +<CODE>FoodEng.gf</CODE>, but also all other files that it +depends on --- in this case, <CODE>Food.gf</CODE>. +</P> +<P> +For each file that is compiled, a <CODE>.gfc</CODE> file +is generated. The GFC format (="GF Canonical") is the +"machine code" of GF, which is faster to process than +GF source files. When reading a module, GF decides whether +to use an existing <CODE>.gfc</CODE> file or to generate +a new one, by looking at modification times. +</P> +<P> +<I>In GF version 3, the</I> <CODE>gfc</CODE> <I>format is replaced by the format suffixed</I> +<CODE>gfo</CODE>, <I>"GF object"</I>. +</P> +<P> +<B>Exercise</B>. What happens when you import <CODE>FoodEng.gf</CODE> for +a second time? Try this in different situations: +</P> +<UL> +<LI>Right after importing it the first time (the modules are kept in + the memory of GF and need no reloading). +<LI>After issuing the command <CODE>empty</CODE> (<CODE>e</CODE>), which clears the memory + of GF. +<LI>After making a small change in <CODE>FoodEng.gf</CODE>, be it only an added space. +<LI>After making a change in <CODE>Food.gf</CODE>. +</UL> + +<A NAME="toc35"></A> +<H2>Using operations and resource modules</H2> +<A NAME="toc36"></A> +<H3>The golden rule of functional programming</H3> +<P> +When writing a grammar, you have to type lots of +characters. You have probably +done this by the copy-and-paste method, which is a universally +available way to avoid repeating work. +</P> +<P> +However, there is a more elegant way to avoid repeating work than +the copy-and-paste +method. The <B>golden rule of functional programming</B> says that +</P> +<UL> +<LI>whenever you find yourself programming by copy-and-paste, + write a function instead. +</UL> + +<P> +A function separates the shared parts of different computations from the +changing parts, its <B>arguments</B>, or <B>parameters</B>. +In functional programming languages, such as +Haskell, it is possible to share much more +code with functions than in languages such as C and Java, because +of higher-order functions (functions that takes functions as arguments). +</P> +<A NAME="toc37"></A> +<H3>Operation definitions</H3> +<P> +GF is a functional programming language, not only in the sense that +the abstract syntax is a system of functions (<CODE>fun</CODE>), but also because +functional programming can be used when defining concrete syntax. This is +done by using a new form of judgement, with the keyword <CODE>oper</CODE> (for +<B>operation</B>), distinct from <CODE>fun</CODE> for the sake of clarity. +Here is a simple example of an operation: +</P> +<PRE> + oper ss : Str -> {s : Str} = \x -> {s = x} ; +</PRE> +<P> +The operation can be <B>applied</B> to an argument, and GF will +<B>compute</B> the application into a value. For instance, +</P> +<PRE> + ss "boy" ===> {s = "boy"} +</PRE> +<P> +We use the symbol <CODE>===</CODE> to indicate how an expression is +computed into a value; this symbol is not a part of GF. +</P> +<P> +Thus an <CODE>oper</CODE> judgement includes the name of the defined operation, +its type, and an expression defining it. As for the syntax of the defining +expression, notice the <B>lambda abstraction</B> form <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I> of +the function. It reads: function with variable <I>x</I> and <B>function body</B> +<I>t</I>. Any occurrence of <I>x</I> in <I>t</I> is said to be <B>bound</B> in <I>t</I>. +</P> +<P> +For lambda abstraction with multiple arguments, we have the shorthand +</P> +<PRE> + \x,y -> t === \x -> \y -> t +</PRE> +<P> +The notation we have used for linearization rules, where +variables are bound on the left-hand side, is actually syntactic +sugar for abstraction: +</P> +<PRE> + lin f x = t === lin f = \x -> t +</PRE> +<P></P> +<A NAME="toc38"></A> +<H3>The ``resource`` module type</H3> +<P> +Operator definitions can be included in a concrete syntax. +But they are usually not really tied to a particular +set of linearization rules. +They should rather be seen as <B>resources</B> +usable in many concrete syntaxes. +</P> +<P> +The <CODE>resource</CODE> module type is used to package +<CODE>oper</CODE> definitions into reusable resources. Here is +an example, with a handful of operations to manipulate +strings and records. +</P> +<PRE> + resource StringOper = { + oper + SS : Type = {s : Str} ; + ss : Str -> SS = \x -> {s = x} ; + cc : SS -> SS -> SS = \x,y -> ss (x.s ++ y.s) ; + prefix : Str -> SS -> SS = \p,x -> ss (p ++ x.s) ; + } +</PRE> +<P></P> +<A NAME="toc39"></A> +<H3>Opening a resource</H3> +<P> +Any number of <CODE>resource</CODE> modules can be +<B>open</B>ed in a <CODE>concrete</CODE> syntax, which +makes definitions contained +in the resource usable in the concrete syntax. Here is +an example, where the resource <CODE>StringOper</CODE> is +opened in a new version of <CODE>FoodEng</CODE>. +</P> +<PRE> + concrete FoodEng of Food = open StringOper in { + + lincat + S, Item, Kind, Quality = SS ; + + lin + Is item quality = cc item (prefix "is" quality) ; + This k = prefix "this" k ; + That k = prefix "that" k ; + QKind k q = cc k q ; + Wine = ss "wine" ; + Cheese = ss "cheese" ; + Fish = ss "fish" ; + Very = prefix "very" ; + Fresh = ss "fresh" ; + Warm = ss "warm" ; + Italian = ss "Italian" ; + Expensive = ss "expensive" ; + Delicious = ss "delicious" ; + Boring = ss "boring" ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Use the same string operations to write <CODE>FoodIta</CODE> +more concisely. +</P> +<A NAME="toc40"></A> +<H3>Partial application</H3> +<P> +<a name="secpartapp"></a> +</P> +<P> +GF, like Haskell, permits <B>partial application</B> of +functions. An example of this is the rule +</P> +<PRE> + lin This k = prefix "this" k ; +</PRE> +<P> +which can be written more concisely +</P> +<PRE> + lin This = prefix "this" ; +</PRE> +<P> +The first form is perhaps more intuitive to write +but, once you get used to partial application, you will appreciate its +conciseness and elegance. The logic of partial application +is known as <B>currying</B>, with a reference to Haskell B. Curry. +The idea is that any <I>n</I>-place function can be seen as a 1-place +function whose value is an <I>n-</I>1 -place function. Thus +</P> +<PRE> + oper prefix : Str -> SS -> SS ; +</PRE> +<P> +can be used as a 1-place function that takes a <CODE>Str</CODE> into a +function <CODE>SS -> SS</CODE>. The expected linearization of <CODE>This</CODE> is exactly +a function of such a type, operating on an argument of type <CODE>Kind</CODE> +whose linearization is of type <CODE>SS</CODE>. Thus we can define the +linearization directly as <CODE>prefix "this"</CODE>. +</P> +<P> +An important part of the art of functional programming is to decide the order +of arguments in a function, so that partial application can be used as much +as possible. For instance, of the operation <CODE>prefix</CODE> we know that it +will be typically applied to linearization variables with constant strings. +This is the reason to put the <CODE>Str</CODE> argument before the <CODE>SS</CODE> argument --- not +the prefixity. A <CODE>postfix</CODE> function would have exactly the same order of arguments. +</P> +<P> +<B>Exercise</B>. Define an operation <CODE>infix</CODE> analogous to <CODE>prefix</CODE>, +such that it allows you to write +</P> +<PRE> + lin Is = infix "is" ; +</PRE> +<P></P> +<A NAME="toc41"></A> +<H3>Testing resource modules</H3> +<P> +To test a <CODE>resource</CODE> module independently, you must import it +with the flag <CODE>-retain</CODE>, which tells GF to retain <CODE>oper</CODE> definitions +in the memory; the usual behaviour is that <CODE>oper</CODE> definitions +are just applied to compile linearization rules +(this is called <B>inlining</B>) and then thrown away. +</P> +<PRE> + > import -retain StringOper.gf +</PRE> +<P> +The command <CODE>compute_concrete = cc</CODE> computes any expression +formed by operations and other GF constructs. For example, +</P> +<PRE> + > compute_concrete prefix "in" (ss "addition") + { + s : Str = "in" ++ "addition" + } +</PRE> +<P></P> +<A NAME="toc42"></A> +<H2>Grammar architecture</H2> +<P> +<a name="secarchitecture"></a> +</P> +<A NAME="toc43"></A> +<H3>Extending a grammar</H3> +<P> +The module system of GF makes it possible to write a new module that <B>extend</B>s +an old one. The syntax of extension is +shown by the following example. We extend <CODE>Food</CODE> into <CODE>MoreFood</CODE> by +adding a category of questions and two new functions. +</P> +<PRE> + abstract Morefood = Food ** { + cat + Question ; + fun + QIs : Item -> Quality -> Question ; + Pizza : Kind ; + + } +</PRE> +<P> +Parallel to the abstract syntax, extensions can +be built for concrete syntaxes: +</P> +<PRE> + concrete MorefoodEng of Morefood = FoodEng ** { + lincat + Question = {s : Str} ; + lin + QIs item quality = {s = "is" ++ item.s ++ quality.s} ; + Pizza = {s = "pizza"} ; + } +</PRE> +<P> +The effect of extension is that all of the contents of the extended +and extending module are put together. We also say that the new +module <B>inherits</B> the contents of the old module. +</P> +<P> +At the same time as extending a module of the same type, a concrete +syntax module may open resources. Since <CODE>open</CODE> takes effect in +the module body and not in the extended module, its logical place +in the module header is after the extend part: +</P> +<PRE> + concrete MorefoodIta of Morefood = FoodIta ** open StringOper in { + lincat + Question = SS ; + lin + QIs item quality = ss (item.s ++ "è" ++ quality.s) ; + Pizza = ss "pizza" ; + } +</PRE> +<P> +Resource modules can extend other resource modules, in the +same way as modules of other types can extend modules of the +same type. Thus it is possible to build resource hierarchies. +</P> +<A NAME="toc44"></A> +<H3>Multiple inheritance</H3> +<P> +Specialized vocabularies can be represented as small grammars that +only do "one thing" each. For instance, the following are grammars +for fruit and mushrooms +</P> +<PRE> + abstract Fruit = { + cat Fruit ; + fun Apple, Peach : Fruit ; + } + + abstract Mushroom = { + cat Mushroom ; + fun Cep, Agaric : Mushroom ; + } +</PRE> +<P> +They can afterwards be combined into bigger grammars by using +<B>multiple inheritance</B>, i.e. extension of several grammars at the +same time: +</P> +<PRE> + abstract Foodmarket = Food, Fruit, Mushroom ** { + fun + FruitKind : Fruit -> Kind ; + MushroomKind : Mushroom -> Kind ; + } +</PRE> +<P> +The main advantages with splitting a grammar to modules are +<B>reusability</B>, <B>separate compilation</B>, and <B>division of labour</B>. +Reusability means +that one and the same module can be put into different uses; for instance, +a module with mushroom names might be used in a mycological information system +as well as in a restaurant phrasebook. Separate compilation means that a module +once compiled into <CODE>.gfc</CODE> need not be compiled again unless changes have +taken place. +Division of labour means simply that programmers that are experts in +special areas can work on modules belonging to those areas. +</P> +<P> +<B>Exercise</B>. Refactor <CODE>Food</CODE> by taking apart <CODE>Wine</CODE> into a special +<CODE>Drink</CODE> module. +</P> +<A NAME="toc45"></A> +<H3>Visualizing module structure</H3> +<P> +When you have created all the abstract syntaxes and +one set of concrete syntaxes needed for <CODE>Foodmarket</CODE>, +your grammar consists of eight GF modules. To see how their +dependences look like, you can use the command +<CODE>visualize_graph = vg</CODE>, +</P> +<PRE> + > visualize_graph +</PRE> +<P> +and the graph will pop up in a separate window: +</P> +<P> +<IMG ALIGN="middle" SRC="foodmarket.png" BORDER="0" ALT=""> +</P> +<P> +The graph uses +</P> +<UL> +<LI>oval boxes for abstract modules +<LI>square boxes for concrete modules +<LI>black-headed arrows for inheritance +<LI>white-headed arrows for the concrete-of-abstract relation +</UL> + +<P> +Just as the <CODE>visualize_tree = vt</CODE> command, the freely available tools +Ghostview and Graphviz are needed. As an alternative, you can again print +the graph into a <CODE>.dot</CODE> file by using the command <CODE>print_multi = pm</CODE>: +</P> +<PRE> + > print_multi -printer=graph | write_file Foodmarket.dot + > ! dot -Tpng Foodmarket.dot > Foodmarket.png +</PRE> +<P></P> +<A NAME="toc46"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc47"></A> +<H3>Modules</H3> +<P> +The general form of a module is +<center> + <I>Moduletype</I> <I>M</I> <I>Of</I> <CODE>=</CODE> (<I>Extends</I> <CODE>**</CODE>)? (<CODE>open</CODE> <I>Opens</I> <CODE>in</CODE>)? <I>Body</I> +</center> +where <I>Moduletype</I> is one of <CODE>abstract</CODE>, <CODE>concrete</CODE>, and <CODE>resource</CODE>. +</P> +<P> +If <I>Moduletype</I> is <CODE>concrete</CODE>, the <I>Of</I>-part has the form <CODE>of</CODE> <I>A</I>, +where <I>A</I> is the name of an abstract module. Otherwise it is empty. +</P> +<P> +The name of the module is given by the identifier <I>M</I>. +</P> +<P> +The optional <I>Extends</I> part is a comma-separated +list of module names, which have to be modules of +the same <I>Moduletype</I>. The contents of these modules are <B>inherited</B> by +<I>M</I>. This means that they are both usable in <I>Body</I> and exported by <I>M</I>, +i.e. inherited when <I>M</I> is inherited and available when <I>M</I> is opened. +(Exception: <CODE>oper</CODE> and <CODE>param</CODE> judgements are not inherited from +<CODE>concrete</CODE> modules.) +</P> +<P> +The optional <I>Opens</I> part is a comma-separated +list of resource module names. The contents of these +modules are usable in the <I>Body</I>, but they are not exported. +</P> +<P> +Opening can be <B>qualified</B>, e.g. +</P> +<PRE> + concrete C of A = open (P = Prelude) in ... +</PRE> +<P> +This means that the names from <CODE>Prelude</CODE> are only available in the form +<CODE>P.name</CODE>. This form of qualifying a name is always possible, and it can +be used to resolve <B>name conflicts</B>, which result when the same name is +declared in more than one module that is in scope. +</P> +<A NAME="toc48"></A> +<H3>Judgements</H3> +<P> +The <I>Body</I> part consists of judgements. The judgement form table #secjment +is extended with the following forms: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>form</TH> +<TH>reading</TH> +<TH COLSPAN="2">module type</TH> +</TR> +<TR> +<TD ALIGN="center"><CODE>oper</CODE> <I>h</I> <CODE>:</CODE> <I>T</I> <CODE>=</CODE> <I>t</I></TD> +<TD>operation <I>h</I> of type <I>T</I> is defined as <I>t</I></TD> +<TD>resource, concrete</TD> +</TR> +<TR> +<TD ALIGN="right"><CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD> +<TD>parameter type P has constructors <I>C1...Cn</I></TD> +<TD>resource, concrete</TD> +</TR> +</TABLE> + +<P></P> +<P> +The <CODE>param</CODE> judgement will be explained in the next chapter. +</P> +<P> +The type part of an <CODE>oper</CODE> judgement can be omitted, if the type can be inferred +by the GF compiler. +</P> +<PRE> + oper hello = "hello" ++ "world" ; +</PRE> +<P> +As a rule, type inference works for all terms except lambda abstracts. +</P> +<P> +<B>Lambda abstracts</B> are expressions of the form <CODE>\</CODE><I>x</I> <CODE>-></CODE> <I>t</I>, +where <I>x</I> is a variable <B>bound</B> in the expression <I>t</I>, which is the +<B>body</B> of the lambda abstract. The type of the lambda abstract is +<I>A</I> <CODE>-></CODE><I>B</I>, where <I>A</I> is the type of the variable <CODE>x</CODE> and +<I>B</I> the type of the body <I>t</I>. +</P> +<P> +For multiple lambda abstractions, there is a shorthand +</P> +<PRE> + \x,y -> t === \x -> \y -> t +</PRE> +<P> +For <CODE>lin</CODE> judgements, there is the shorthand +</P> +<PRE> + lin f x = t === lin f = \x -> t +</PRE> +<P></P> +<A NAME="toc49"></A> +<H3>Free variation</H3> +<P> +The <CODE>variants</CODE> construct of GF can be used to give a list of +concrete syntax terms, of the same type, in free variation. For example, +</P> +<PRE> + variants {["does not"] ; "doesn't"} +</PRE> +<P> +A limiting case is the empty variant list <CODE>variants {}</CODE>. +</P> +<A NAME="toc50"></A> +<H3>The context-free grammar format</H3> +<P> +The <CODE>.cf</CODE> file format is used for <B>context-free grammars</B>, which are +always interpretable as GF grammars. Files of this format consist of +rules of the form +<center> + (<I>Label</I> <CODE>.</CODE>)? <I>Cat</I> <CODE>::=</CODE> <I>RHS</I> <CODE>;</CODE> +</center> +where the <I>RHS</I> is a sequence of terminals (quoted strings) and +nonterminals (identifiers). The optional <I>Label</I> gives the abstract +syntax function created. If it is omitted, a function name is generated +automatically. Then it is also possible to have more than one <I>RHS</I>, +separated by <I>|</I>. An empty <I>RHS</I> is interpreted as an empty sequence +of terminals, not as an empty disjunction. +</P> +<P> +The <B>Extended BNF</B> format (<B>EBNF</B>) can also be used, in files suffixed <CODE>.ebnf</CODE>. +This format does not allow user-written labels. The right-hand-side of a rule +can contain everything that is possible in the <CODE>.cf</CODE> format, but also +optional parts (<CODE>p ?</CODE>), sequences (<CODE>p *</CODE>) and non-empty sequences (<CODE>p +</CODE>). +For example, the phrases in <CODE>FoodEng</CODE> could be recognized with the following +EBNF grammar: +</P> +<PRE> + Phrase ::= + ("this" | "that") Quality* ("wine" | "cheese" | "fish") "is" Quality ; + Quality ::= + ("very"* ("fresh" | "warm" | "boring" | "Italian" | "expensive")) ; +</PRE> +<P></P> +<A NAME="toc51"></A> +<H3>Character encoding</H3> +<P> +The default encoding is iso-latin-1. UTF-8 can be set by the flag <CODE>coding=utf8</CODE> +in the grammar. The resource grammar libraries are in iso-latin-1, except Russian +and Arabic, which are in UTF-8. The resources may be changed to UTF-8 in future. +Letters in identifiers must currently be iso-latin-1. +</P> +<A NAME="toc52"></A> +<H1>Grammars with parameters</H1> +<P> +<a name="chapfour"></a> +</P> +<P> +In this chapter, we will introduce the techniques needed for +describing the inflection of words, as well as the rules by +which correct word forms are selected in syntactic combinations. +These techniques are already needed in a very slight extension +of the Food grammar of the previous chapter. While explaining +how the linguistic problems are solved for English and Italian, +we also cover all the language constructs GF has for +defining concrete syntax. +</P> +<P> +It is in principle possible to skip this chapter and go directly +to the next, since the use of the GF Resource Grammar library +makes it unnecessary to use any more constructs of GF than we +have already covered: parameters could be left to library implementors. +</P> +<A NAME="toc53"></A> +<H2>The problem: words have to be inflected</H2> +<P> +Suppose we want to say, with the vocabulary included in +<CODE>Food.gf</CODE>, things like +<center> +<I>these Italian wines are delicious</I> +</center> +The new grammatical facility we need are the plural forms +of nouns and verbs (<I>wines, are</I>), as opposed to their +singular forms. +</P> +<P> +The introduction of plural forms requires two things: +</P> +<UL> +<LI>the <B>inflection</B> of nouns and verbs in singular and plural +<LI>the <B>agreement</B> of the verb to subject: + the verb must have the same number as the subject +</UL> + +<P> +Different languages have different types of inflection and agreement. +For instance, Italian has also agreement in gender (masculine vs. feminine). +In a multilingual grammar, +we want to express such differences between languages in the +concrete syntax while ignoring them in the abstract syntax. +</P> +<P> +To be able to do all this, we need one new judgement form +and some new expression forms. +We also need to generalize linearization types +from strings to more complex types. +</P> +<P> +<B>Exercise</B>. Make a list of the possible forms that nouns, +adjectives, and verbs can have in some languages that you know. +</P> +<A NAME="toc54"></A> +<H2>Parameters and tables</H2> +<P> +We define the <B>parameter type</B> of number in English by +using a new form of judgement: +</P> +<PRE> + param Number = Sg | Pl ; +</PRE> +<P> +This judgement defines the parameter type <CODE>Number</CODE> by listing +its two <B>constructors</B>, <CODE>Sg</CODE> and <CODE>Pl</CODE> (common shorthands for +singular and plural). +</P> +<P> +To state that <CODE>Kind</CODE> expressions in English have a linearization +depending on number, we replace the linearization type <CODE>{s : Str}</CODE> +with a type where the <CODE>s</CODE> field is a <B>table</B> depending on number: +</P> +<PRE> + lincat Kind = {s : Number => Str} ; +</PRE> +<P> +The <B>table type</B> <CODE>Number => Str</CODE> is in many respects similar to +a function type (<CODE>Number -> Str</CODE>). The main difference is that the +argument type of a table type must always be a parameter type. This means +that the argument-value pairs can be listed in a finite table. The following +example shows such a table: +</P> +<PRE> + lin Cheese = { + s = table { + Sg => "cheese" ; + Pl => "cheeses" + } + } ; +</PRE> +<P> +The table consists of <B>branches</B>, where a <B>pattern</B> on the +left of the arrow <CODE>=></CODE> is assigned a <B>value</B> on the right. +</P> +<P> +The application of a table to a parameter is done by the <B>selection</B> +operator <CODE>!</CODE>, which is computed by <B>pattern matching</B>: it returns +the value from the first branch whose pattern matches the +selection argument. For instance, +</P> +<PRE> + table {Sg => "cheese" ; Pl => "cheeses"} ! Pl + ===> "cheeses" +</PRE> +<P> +As syntactic sugar for table selections, we can define the +<B>case expressions</B>, which are common in functional programming and also +handy to use in GF. +</P> +<PRE> + case e of {...} === table {...} ! e +</PRE> +<P></P> +<P> +A parameter type can have any number of constructors, and these can +also take arguments from other parameter types. For instance, an accurate +type system for English verbs (except <I>be</I>) is +</P> +<PRE> + param VerbForm = VPresent Number | VPast | VPastPart | VPresPart ; +</PRE> +<P> +This system expresses accurately the fact that only the present tense has +number variation. (Agreement also requires variation in person, but +this can be defined in syntax rules, by picking the singular form for third person +singular subjects and the plural forms for all others). As an example of +a table, here are the forms of the verb <I>drink</I>: +</P> +<PRE> + table { + VPresent Sg => "drinks" ; + VPresent Pl => "drink" ; + VPast => "drank" ; + VPastPart => "drunk" ; + VPresPart => "drinking" + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. In an earlier exercise (previous section), +you made a list of the possible +forms that nouns, adjectives, and verbs can have in some languages that +you know. Now take some of the results and implement them by +using parameter type definitions and tables. Write them into a <CODE>resource</CODE> +module, which you can test by using the command <CODE>compute_concrete</CODE>. +</P> +<A NAME="toc55"></A> +<H2>Inflection tables and paradigms</H2> +<P> +All English common nouns are inflected for number, most of them in the +same way: the plural form is obtained from the singular by adding the +ending <I>s</I>. This rule is an example of +a <B>paradigm</B> --- a formula telling how a class of words is inflected. +</P> +<P> +From the GF point of view, a paradigm is a function that takes +a <B>lemma</B> --- also known as a <B>dictionary form</B> or a <B>citation form</B> --- and +returns an inflection +table of desired type. Paradigms are not functions in the sense of the +<CODE>fun</CODE> judgements of abstract syntax (which operate on trees and not +on strings), but operations defined in <CODE>oper</CODE> judgements. +The following operation defines the regular noun paradigm of English: +</P> +<PRE> + oper regNoun : Str -> {s : Number => Str} = \dog -> { + s = table { + Sg => dog ; + Pl => dog + "s" + } + } ; +</PRE> +<P> +The <B>gluing</B> operator <CODE>+</CODE> tells that +the string held in the variable <CODE>dog</CODE> and the ending <CODE>"s"</CODE> +are written together to form one <B>token</B>. Thus, for instance, +</P> +<PRE> + (regNoun "cheese").s ! Pl ===> "cheese" + "s" ===> "cheeses" +</PRE> +<P> +A more complex example are regular verbs: +</P> +<PRE> + oper regVerb : Str -> {s : VerbForm => Str} = \talk -> { + s = table { + VPresent Sg => talk + "s" ; + VPresent Pl => talk ; + VPresPart => talk + "ing" ; + _ => talk + "ed" + } + } ; +</PRE> +<P> +Notice how a catch-all case for the past tense and the past participle +is expressed by using a <B>wild card</B> pattern <CODE>_</CODE>. Here again, pattern matching +tries all patterns in order until it finds a matching pattern; +and it is the wild card that is the first match for both <CODE>VPast</CODE> and +<CODE>VPastPart</CODE>. +</P> +<P> +<B>Exercise</B>. Identify cases in which the <CODE>regNoun</CODE> paradigm does not +apply in English, and implement some alternative paradigms. +</P> +<P> +<B>Exercise</B>. Implement some regular paradigms for other languages you have +considered in earlier exercises. +</P> +<A NAME="toc56"></A> +<H2>Using parameters in concrete syntax</H2> +<P> +We can now enrich the concrete syntax definitions to +comprise morphology. This will permit a more radical +variation between languages (e.g. English and Italian) +than just the use of different words. In general, +parameters and linearization types are different in +different languages --- but this does not prevent using a +the common abstract syntax. +</P> +<P> +We consider a grammar <CODE>Foods</CODE>, which is similar to +<CODE>Food</CODE>, with the addition two rules for forming plural items: +</P> +<PRE> + fun These, Those : Kind -> Item ; +</PRE> +<P> +We also add a noun which in Italian has the feminine case; all nouns in +<CODE>Food</CODE> were carefully chosen to be masculine! +</P> +<PRE> + fun Pizza : Kind ; +</PRE> +<P> +This noun will force us to deal with gender in the Italian grammar, +which is what we need for the grammar to scale up for larger applications. +</P> +<A NAME="toc57"></A> +<H3>Agreement</H3> +<P> +In the English <CODE>Foods</CODE> grammar, we need just one type of parameters: +<CODE>Number</CODE> as defined above. The phrase-forming rule +</P> +<PRE> + fun Is : Item -> Quality -> Phrase ; +</PRE> +<P> +is affected by the number because of <B>subject-verb agreement</B>. +In English, agreement says that the verb of a sentence +must be inflected in the number of the subject. Thus we will linearize +</P> +<PRE> + Is (This Pizza) Warm ===> "this pizza is warm" + Is (These Pizza) Warm ===> "these pizzas are warm" +</PRE> +<P> +Here it is the <B>copula</B>, i.e. the verb <I>be</I> that is affected. We define +the copula as the operation +</P> +<PRE> + oper copula : Number -> Str = \n -> + case n of { + Sg => "is" ; + Pl => "are" + } ; +</PRE> +<P> +We don't need to inflect the copula for person and tense in this grammar. +</P> +<P> +The form of the copula in a sentence depends on the +<B>subject</B> of the sentence, i.e. the item +that is qualified. This means that an <CODE>Item</CODE> must have such a number to provide. +The obvious way to guarantee this is by including a number field in +the linearization type: +</P> +<PRE> + lincat Item = {s : Str ; n : Number} ; +</PRE> +<P> +Now we can write precisely the <CODE>Is</CODE> rule that expresses agreement: +</P> +<PRE> + lin Is item qual = {s = item.s ++ copula item.n ++ qual.s} ; +</PRE> +<P> +The copula receives the number that it needs from the subject item. +</P> +<A NAME="toc58"></A> +<H3>Determiners</H3> +<P> +Let us turn to <CODE>Item</CODE> subjects and see how they receive their +numbers. The two rules +</P> +<PRE> + fun This, These : Kind -> Item ; +</PRE> +<P> +form <CODE>Item</CODE>s from <CODE>Kind</CODE>s by adding <B>determiners</B>, either +<I>this</I> or <I>these</I>. The determiners +require different numbers of their <CODE>Kind</CODE> arguments: <CODE>This</CODE> +requires the singular (<I>this pizza</I>) and <CODE>These</CODE> the plural +(<I>these pizzas</I>). The <CODE>Kind</CODE> is the same in both cases: <CODE>Pizza</CODE>. +Thus a <CODE>Kind</CODE> must have both singular and plural forms. +The obvious way to express this is by using a table: +</P> +<PRE> + lincat Kind = {s : Number => Str} ; +</PRE> +<P> +The linearization rules for <CODE>This</CODE> and <CODE>These</CODE> can now be written +</P> +<PRE> + lin This kind = { + s = "this" ++ kind.s ! Sg ; + n = Sg + } ; + + lin These kind = { + s = "these" ++ kind.s ! Pl ; + n = Pl + } ; +</PRE> +<P> +The grammatical relation between the determiner and the noun is similar to +agreement, but due to some differences into which we don't go here +it is often called <B>government</B>. +</P> +<P> +Since the same pattern for determination is used four times in +the <CODE>FoodsEng</CODE> grammar, we codify it as an operation, +</P> +<PRE> + oper det : + Str -> Number -> {s : Number => Str} -> {s : Str ; n : Number} = + \det,n,kind -> { + s = det ++ kind.s ! n ; + n = n + } ; +</PRE> +<P> +Now we can write, for instance, +</P> +<PRE> + lin This = det Sg "this" ; + lin These = det Pl "these" ; +</PRE> +<P> +Notice the order of arguments that permits partial +application (<a href="#secpartapp">here</a>). +</P> +<P> +In a more <B>lexicalized</B> grammar, determiners would be made into a +category of their own and given an inherent number: +</P> +<PRE> + lincat Det = {s : Str ; n : Number} ; + fun Det : Det -> Kind -> Item ; + lin Det det kind = { + s = det.s ++ kind.s ! det.n ; + n = det.n + } ; +</PRE> +<P> +Linguistically motivated grammars, such as the GF resource grammars, +usually favour lexicalized treatments of words; see <a href="#seclexical">here</a> below. +Notice that the fields of the record in <CODE>Det</CODE> are precisely the two +arguments needed in the <CODE>det</CODE> operation. +</P> +<A NAME="toc59"></A> +<H3>Parametric vs. inherent features</H3> +<P> +<CODE>Kind</CODE>s, as in general <B>common nouns</B> in English, have both singular +and plural forms; what form is chosen is determined by the construction +in which the noun is used. We say that the number is a +<B>parametric feature</B> of nouns. In GF, parametric features +appear as argument types of tables in linearization types. +</P> +<PRE> + lincat Kind = {s : Number => Str} ; +</PRE> +<P> +<CODE>Item</CODE>s, as in general <B>noun phrases</B> in English, don't +have variation in number. The number is instead an <B>inherent feature</B>, +which the noun phrase passes to the verb. In GF, inherent features +appear as record fields in linearization types. +</P> +<PRE> + lincat Item = {s : Str ; n : Number} ; +</PRE> +<P> +A category can have both parametric and inherent features. As we will see +in the Italian <CODE>Foods</CODE> grammar, nouns have parametric number and +inherent gender: +</P> +<PRE> + lincat Kind = {s : Number => Str ; g : Gender} ; +</PRE> +<P> +Nothing prevents the same parameter type from appearing both +as parametric and inherent feature, or the appearance of several inherent +features of the same type, etc. Determining the linearization types +of categories is one of the most crucial steps in the design of a GF +grammar. These two conditions must be in balance: +</P> +<UL> +<LI>existence: what forms are possible to build by morphological and + other means? +<LI>need: what features are expected via agreement or government? +</UL> + +<P> +Grammar books and dictionaries give good advice on existence; for instance, +an Italian dictionary has entries such as +<center> +<B>uomo</B>, pl. <I>uomini</I>, n.m. "man" +</center> +which tells that <I>uomo</I> is a masculine noun with the plural form <I>uomini</I>. +From this alone, or with a couple more examples, we can generalize to the type +for all nouns in Italian: they have both singular and plural forms and thus +a parametric number, and they have an inherent gender. +</P> +<P> +The distinction between parametric and inherent features can be stated in +object-oriented programming terms: a linearization type is like a <B>class</B>, +which has a <B>method</B> for linearization and also some <B>attributes</B>. +In this class, the parametric features appear as arguments to the +linearization method, whereas the inherent features appear as attributes. +</P> +<P> +For words, inherent features are usually given <I>ad hoc</I> as lexical information. +For combinations, they are typically <I>inherited</I> from some part of the construction. +For instance, qualified noun constructs in Italian inherit their gender from noun part +(called the <B>head</B> of the construction in linguistics): +</P> +<PRE> + lin QKind qual kind = + let gen = kind.g in { + s = table {n => kind.s ! n ++ qual.s ! gen ! n} ; + g = gen + } ; +</PRE> +<P> +This rule uses a <B>local definition</B> (also known as a <B>let expression</B>) to +avoid computing <CODE>kind.g</CODE> twice, and also to express the linguistic +generalization that it is the same gender that is both passed to +the adjective and inherited by the construct. +The parametric number feature is in this rule passed to both the noun and +the adjective. In the table, a <B>variable pattern</B> is used to match +any possible number. Variables introduced in patterns are in scope in +the right-hand sides of corresponding branches. Again, it is good to +use a variable to express the linguistic generalization that the number +is passed to the parts, rather than expand the table into <CODE>Sg</CODE> and <CODE>Pl</CODE> +branches. +</P> +<P> +Sometimes the puzzle of making agreement and government work in a grammar has +several solutions. For instance, <B>precedence</B> in programming languages can +be equivalently described by a parametric or an inherent feature +(see <a href="#secprecedence">here</a> below). +</P> +<P> +In natural language applications that use the resource grammar library, +all parameters are hidden from the user, who thereby does not need to bother +about them. The only thing that she has to think about is what linguistic +categories are given as linearization types to each semantic category. +</P> +<P> +For instance, the GF resource grammar library has a category <CODE>NP</CODE> of +noun phrases, <CODE>AP</CODE> of adjectival phrases, and <CODE>Cl</CODE> of sentence-like clauses. +In the implementation of <CODE>Foods</CODE> <a href="#secenglish">here</a>, we will define +</P> +<PRE> + lincat Phrase = Cl ; Item = NP ; Quality = AP ; +</PRE> +<P> +To express that an item has a quality, we will use a resource function +</P> +<PRE> + mkCl : NP -> AP -> Cl ; +</PRE> +<P> +in the linearization rule: +</P> +<PRE> + lin Is = mkCl ; +</PRE> +<P> +In this way, we have no need to think about parameters and agreement. +<a href="#chapfive">the fifth chapter</a> will show a complete implementation of <CODE>Foods</CODE> by the +resource grammar, port it to many new languages, and extend it with +many new constructs. +</P> +<A NAME="toc60"></A> +<H2>An English concrete syntax for Foods with parameters</H2> +<P> +We repeat some of the rules above by showing the entire +module <CODE>FoodsEng</CODE>, equipped with parameters. The parameters and +operations are, for the sake of brevity, included in the same module +and not in a separate <CODE>resource</CODE>. However, some string operations +from the library <CODE>Prelude</CODE> are used. +</P> +<PRE> + --# -path=.:prelude + + concrete FoodsEng of Foods = open Prelude in { + + lincat + S, Quality = SS ; + Kind = {s : Number => Str} ; + Item = {s : Str ; n : Number} ; + + lin + Is item quality = ss (item.s ++ copula item.n ++ quality.s) ; + This = det Sg "this" ; + That = det Sg "that" ; + These = det Pl "these" ; + Those = det Pl "those" ; + QKind quality kind = {s = table {n => quality.s ++ kind.s ! n}} ; + Wine = regNoun "wine" ; + Cheese = regNoun "cheese" ; + Fish = noun "fish" "fish" ; + Pizza = regNoun "pizza" ; + Very = prefixSS "very" ; + Fresh = ss "fresh" ; + Warm = ss "warm" ; + Italian = ss "Italian" ; + Expensive = ss "expensive" ; + Delicious = ss "delicious" ; + Boring = ss "boring" ; + + param + Number = Sg | Pl ; + + oper + det : Number -> Str -> {s : Number => Str} -> {s : Str ; n : Number} = + \n,d,cn -> { + s = d ++ cn.s ! n ; + n = n + } ; + noun : Str -> Str -> {s : Number => Str} = + \man,men -> {s = table { + Sg => man ; + Pl => men + } + } ; + regNoun : Str -> {s : Number => Str} = + \car -> noun car (car + "s") ; + copula : Number -> Str = + \n -> case n of { + Sg => "is" ; + Pl => "are" + } ; + } +</PRE> +<P> +To find the Prelude library --- or in general, +GF files located in other directories, a <B>path directive</B> is needed +either on the command line or as the first line of +the topmost file compiled. +The paths in the path list are separated by colons (<CODE>:</CODE>), and every item +is interpreted primarily relative to the current directory and, secondarily, +to the value of <CODE>GF_LIB_PATH</CODE> (<B>GF library path</B>). Hence it is a +good idea to make <CODE>GF_LIB_PATH</CODE> to point into your <CODE>GF/lib/</CODE> whenever +you start working in GF. For instance, in the Bash shell this is done by +</P> +<PRE> + % export GF_LIB_PATH=<the location of GF/lib in your file system> +</PRE> +<P></P> +<A NAME="toc61"></A> +<H2>More on inflection paradigms</H2> +<P> +<a name="secinflection"></a> +</P> +<P> +Let us try to extend the English noun paradigms so that we can +deal with all nouns, not just the regular ones. The goal is to +provide a morphology module that is maximally easy to use when +words are added to the lexicon. In fact, we can think of a +division of labour where a linguistically trained grammarian +writes a morphology and hands it over to the lexicon writer +who knows much less about the rules of inflection. +</P> +<P> +In passing, we will introduce some new GF constructs: local definitions, +regular expression patterns, and operation overloading. +</P> +<A NAME="toc62"></A> +<H3>Worst-case functions</H3> +<P> +To start with, it is useful to perform <B>data abstraction</B> from the type +of nouns by writing a constructor operation, a <B>worst-case function</B>: +</P> +<PRE> + oper mkNoun : Str -> Str -> Noun = \x,y -> { + s = table { + Sg => x ; + Pl => y + } + } ; +</PRE> +<P> +This presupposes that we have defined +</P> +<PRE> + oper Noun : Type = {s : Number => Str} ; +</PRE> +<P> +Using <CODE>mkNoun</CODE>, we can define +</P> +<PRE> + lin Mouse = mkNoun "mouse" "mice" ; +</PRE> +<P> +and +</P> +<PRE> + oper regNoun : Str -> Noun = \x -> mkNoun x (x + "s") ; +</PRE> +<P> +instead of writing the inflection tables explicitly. +</P> +<P> +Nouns like <I>mouse</I>-<I>mice</I>, are so irregular that +it hardly makes sense to see them as instances of a +paradigm that forms the plural from the singular form. +But in general, as we will see, there can be different +regular patterns in a language. +</P> +<P> +The grammar engineering advantage of worst-case functions is that +the author of the resource module may change the definitions of +<CODE>Noun</CODE> and <CODE>mkNoun</CODE>, and still retain the +interface (i.e. the system of type signatures) that makes it +correct to use these functions in concrete modules. In programming +terms, <CODE>Noun</CODE> is then treated as an <B>abstract datatype</B>: +its definition is not available, but only an indirect way of constructing +its objects. +</P> +<P> +A case where a change of the <CODE>Noun</CODE> type could +actually happen is if we introduces <B>case</B> (nominative or +genitive) in the noun inflection: +</P> +<PRE> + param Case = Nom | Gen ; + + oper Noun : Type = {s : Number => Case => Str} ; +</PRE> +<P> +Now we have to redefine the worst-case function +</P> +<PRE> + oper mkNoun : Str -> Str -> Noun = \x,y -> { + s = table { + Sg => table { + Nom => x ; + Gen => x + "'s" + } ; + Pl => table { + Nom => y ; + Gen => y + case last y of { + "s" => "'" ; + _ => "'s" + } + } + } ; +</PRE> +<P> +But up from this level, we can retain the old definitions +</P> +<PRE> + lin Mouse = mkNoun "mouse" "mice" ; + oper regNoun : Str -> Noun = \x -> mkNoun x (x + "s") ; +</PRE> +<P> +which will just compute to different values now. +</P> +<P> +In the last definition of <CODE>mkNoun</CODE>, we used a case expression +on the last character of the plural form to decide if the genitive +should be formed with an <CODE>'</CODE> (as in <I>dogs</I>-<I>dogs'</I>) or with +<CODE>'s</CODE> (as in <I>mice</I>-<I>mice's</I>). The expression <CODE>last y</CODE> +uses the <CODE>Prelude</CODE> operation +</P> +<PRE> + last : Str -> Str ; +</PRE> +<P> +The case expression uses <B>pattern matching over strings</B>, which +is supported in GF, alongside with pattern matching over +parameters. +</P> +<A NAME="toc63"></A> +<H3>Intelligent paradigms</H3> +<P> +Between the completely regular <I>dog</I>-<I>dogs</I> and the completely +irregular <I>mouse</I>-<I>mice</I>, there are some +predictable variations: +</P> +<UL> +<LI>nouns ending with an <I>y</I>: <I>fly</I>-<I>flies</I>, except if + a vowel precedes the <I>y</I>: <I>boy</I>-<I>boys</I> +<LI>nouns ending with <I>s</I>, <I>ch</I>, and a number of + other endings: <I>bus</I>-<I>buses</I>, <I>leech</I>-<I>leeches</I> +</UL> + +<P> +One way to deal with them would be to provide alternative paradigms: +</P> +<PRE> + noun_y : Str -> Noun = \fly -> mkNoun fly (init fly + "ies") ; + noun_s : Str -> Noun = \bus -> mkNoun bus (bus + "es") ; +</PRE> +<P> +The Prelude function <CODE>init</CODE> drops the last character of a token. +But this solution has some drawbacks: +</P> +<UL> +<LI>it can be difficult to select the correct paradigm +<LI>it can be difficult to remember the names of the different paradigms +</UL> + +<P> +To help the lexicon builder in this task, the morphology programmer +can put some intelligence in the regular noun paradigm. The easiest +way to express this in GF is by the use of <B>regular expression patterns</B>: +</P> +<PRE> + regNoun : Str -> Noun = \w -> + let + ws : Str = case w of { + _ + ("a" | "e" | "i" | "o") + "o" => w + "s" ; -- bamboo + _ + ("s" | "x" | "sh" | "o") => w + "es" ; -- bus, hero + _ + "z" => w + "zes" ;-- quiz + _ + ("a" | "e" | "o" | "u") + "y" => w + "s" ; -- boy + x + "y" => x + "ies" ;-- fly + _ => w + "s" -- car + } + in + mkNoun w ws +</PRE> +<P> +In this definition, we have used a local definition just in order to +structure the code, even though there is no multiple evaluation to eliminate. +In the case expression itself, we have used +</P> +<UL> +<LI><B>disjunctive patterns</B> <I>P</I> <CODE>|</CODE> <I>Q</I> +<LI><B>concatenation patterns</B> <I>P</I> <CODE>+</CODE> <I>Q</I> +</UL> + +<P> +The patterns are ordered in such a way that, for instance, +the suffix <CODE>"oo"</CODE> prevents <I>bamboo</I> from matching the suffix +<CODE>"o"</CODE>. +</P> +<P> +<B>Exercise</B>. The same rules that form plural nouns in English also +apply in the formation of third-person singular verbs. +Write a regular verb paradigm that uses this idea, but first +rewrite <CODE>regNoun</CODE> so that the analysis needed to build <I>s</I>-forms +is factored out as a separate <CODE>oper</CODE>, which is shared with +<CODE>regVerb</CODE>. +</P> +<P> +<B>Exercise</B>. Extend the verb paradigms to cover all verb forms +in English, with special care taken of variations with the suffix +<I>ed</I> (e.g. <I>try</I>-<I>tried</I>, <I>use</I>-<I>used</I>). +</P> +<P> +<B>Exercise</B>. Implement the German <B>Umlaut</B> operation on word stems. +The operation changes the vowel of the stressed stem syllable as follows: +<I>a</I> to <I>ä</I>, <I>au</I> to <I>äu</I>, <I>o</I> to <I>ö</I>, and <I>u</I> to <I>ü</I>. You +can assume that the operation only takes syllables as arguments. Test the +operation to see whether it correctly changes <I>Arzt</I> to <I>Ärzt</I>, +<I>Baum</I> to <I>Bäum</I>, <I>Topf</I> to <I>Töpf</I>, and <I>Kuh</I> to <I>Küh</I>. +</P> +<A NAME="toc64"></A> +<H3>Function types with variables</H3> +<P> +In <a href="#chapsix">the sixth chapter</a>, we will introduce <B>dependent function types</B>, where +the value type depends on the argument. For this end, we need a notation +that binds a variable to the argument type, as in +</P> +<PRE> + switchOff : (k : Kind) -> Action k +</PRE> +<P> +Function types <I>without</I> +variables are actually a shorthand notation: writing +</P> +<PRE> + PredVP : NP -> VP -> S +</PRE> +<P> +is shorthand for +</P> +<PRE> + PredVP : (x : NP) -> (y : VP) -> S +</PRE> +<P> +or any other naming of the variables. Actually the use of variables +sometimes shortens the code, since they can share a type: +</P> +<PRE> + octuple : (x,y,z,u,v,w,s,t : Str) -> Str +</PRE> +<P> +If a bound variable is not used, it can here, as elsewhere in GF, be replaced by +a wildcard: +</P> +<PRE> + octuple : (_,_,_,_,_,_,_,_ : Str) -> Str +</PRE> +<P> +A good practice for functions with many arguments of the same type +is to indicate the number of arguments: +</P> +<PRE> + octuple : (x1,_,_,_,_,_,_,x8 : Str) -> Str +</PRE> +<P> +One can also use heuristic variable names to document what +information each argument is expected to provide. +This is very handy in the types of inflection paradigms: +</P> +<PRE> + mkNoun : (mouse,mice : Str) -> Noun +</PRE> +<P></P> +<A NAME="toc65"></A> +<H3>Separating operation types and definitions</H3> +<P> +In grammars intended as libraries, it is useful to separate oparation +definitions from their type signatures. The user is only interested +in the type, whereas the definition is kept for the implementor and +the maintainer. This is possible by using separate <CODE>oper</CODE> fragments +for the two parts: +</P> +<PRE> + oper regNoun : Str -> Noun ; + oper regNoun s = mkNoun s (s + "s") ; +</PRE> +<P> +The type checker combines the two into one <CODE>oper</CODE> judgement to see +if the definition matches the type. Notice that, in this syntax, it +is moreover possible to bind the argument variables on the left hand side +instead of using lambda abstration. +</P> +<P> +In the library module, the type signatures are typically placed in +the beginning and the definitions in the end. A more radical separation +can be achieved by using the <CODE>interface</CODE> and <CODE>instance</CODE> module types +(see <a href="#secinterface">here</a>): the type signatures are placed in the interface +and the definitions in the instance. +</P> +<A NAME="toc66"></A> +<H3>Overloading of operations</H3> +<P> +Large libraries, such as the GF Resource Grammar Library, may define +hundreds of names. This can be unpractical +for both the library author and the user: the author has to invent longer +and longer names which are not always intuitive, +and the author has to learn or at least be able to find all these names. +A solution to this problem, adopted by languages such as C++, +is <B>overloading</B>: one and the same name can be used for several functions. +When such a name is used, the +compiler performs <B>overload resolution</B> to find out which of +the possible functions is meant. Overload resolution is based on +the types of the functions: all functions that +have the same name must have different types. +</P> +<P> +In C++, functions with the same name can be scattered everywhere in the program. +In GF, they must be grouped together in <CODE>overload</CODE> groups. Here is an example +of an overload group, giving the different ways to define nouns in English: +</P> +<PRE> + oper mkN : overload { + mkN : (dog : Str) -> Noun ; -- regular nouns + mkN : (mouse,mice : Str) -> Noun ; -- irregular nouns + } +</PRE> +<P> +Intuitively, the function comes very close to the way in which +regular and irregular words are given in most dictionaries. If the +word is regular, just one form is needed. If it is irregular, +more forms are given. There is no need to use explicit paradigm +names. +</P> +<P> +The <CODE>mkN</CODE> example gives only the possible types of the overloaded +operation. Their definitions can be given separately, possibly in another module. +Here is a definition of the above overload group: +</P> +<PRE> + oper mkN = overload { + mkN : (dog : Str) -> Noun = regNoun ; + mkN : (mouse,mice : Str) -> Noun = mkNoun ; + } +</PRE> +<P> +Notice that the types of the branches must be repeated so that they can be +associated with proper definitions; the order of the branches has no +significance. +</P> +<P> +<B>Exercise</B>. Design a system of English verb paradigms presented by +an overload group. +</P> +<A NAME="toc67"></A> +<H3>Morphological analysis and morphology quiz</H3> +<P> +Even though morphology is in GF +mostly used as an auxiliary for syntax, it +can also be useful on its own right. The command <CODE>morpho_analyse = ma</CODE> +can be used to read a text and return for each word the analyses that +it has in the current concrete syntax. +</P> +<PRE> + > read_file bible.txt | morpho_analyse +</PRE> +<P> +In the same way as translation exercises, morphological exercises can +be generated, by the command <CODE>morpho_quiz = mq</CODE>. Usually, +the category is then set to some lexical category. For instance, +French irregular verbs in the resource grammar library can be trained as +follows: +</P> +<PRE> + % gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/IrregFre.gfc + + > morpho_quiz -cat=V + + Welcome to GF Morphology Quiz. + ... + + réapparaître : VFin VCondit Pl P2 + réapparaitriez + > No, not réapparaitriez, but + réapparaîtriez + Score 0/1 +</PRE> +<P> +Just like translation exercises, a list of morphological exercises can be generated +off-line and saved in a +file for later use, by the command <CODE>morpho_list = ml</CODE> +</P> +<PRE> + > morpho_list -number=25 -cat=V | write_file exx.txt +</PRE> +<P> +The <CODE>number</CODE> flag gives the number of exercises generated. +</P> +<A NAME="toc68"></A> +<H2>The Italian Foods grammar</H2> +<P> +<a name="secitalian"></a> +</P> +<P> +We conclude the parametrization of the Food grammar by presenting an +Italian variant, now complete with parameters, inflection, and +agreement. +</P> +<P> +The header part is similar to English: +</P> +<PRE> + --# -path=.:prelude + + concrete FoodsIta of Foods = open Prelude in { +</PRE> +<P> +Parameters include not only number but also gender. +</P> +<PRE> + param + Number = Sg | Pl ; + Gender = Masc | Fem ; +</PRE> +<P> +Qualities are inflected for gender and number, whereas kinds +have a parametric number (as in English) and an inherent gender. +Items have an inherent number (as in English) but also gender. +</P> +<PRE> + lincat + Phr = SS ; + Quality = {s : Gender => Number => Str} ; + Kind = {s : Number => Str ; g : Gender} ; + Item = {s : Str ; g : Gender ; n : Number} ; +</PRE> +<P> +A Quality is expressed by an adjective, which in Italian has one form for each +gender-number combination. +</P> +<PRE> + oper + adjective : (_,_,_,_ : Str) -> {s : Gender => Number => Str} = + \nero,nera,neri,nere -> { + s = table { + Masc => table { + Sg => nero ; + Pl => neri + } ; + Fem => table { + Sg => nera ; + Pl => nere + } + } + } ; +</PRE> +<P> +The very common case of regular adjectives works by adding +endings to the stem. +</P> +<PRE> + regAdj : Str -> {s : Gender => Number => Str} = \nero -> + let ner = init nero + in adjective nero (ner + "a") (ner + "i") (ner + "e") ; +</PRE> +<P></P> +<P> +For noun inflection, there are several paradigms; since only two forms +are ever needed, we will just give them explicitly (the resource grammar +library also has a paradigm that takes the singular form and infers the +plural and the gender from it). +</P> +<PRE> + noun : Str -> Str -> Gender -> {s : Number => Str ; g : Gender} = + \vino,vini,g -> { + s = table { + Sg => vino ; + Pl => vini + } ; + g = g + } ; +</PRE> +<P> +As in <CODE>FoodEng</CODE>, we need only number variation for the copula. +</P> +<PRE> + copula : Number -> Str = + \n -> case n of { + Sg => "è" ; + Pl => "sono" + } ; +</PRE> +<P> +Determination is more complex than in English, because of gender: +it uses separate determiner forms for the two genders, and selects +one of them as function of the noun determined. +</P> +<PRE> + det : Number -> Str -> Str -> {s : Number => Str ; g : Gender} -> + {s : Str ; g : Gender ; n : Number} = + \n,m,f,cn -> { + s = case cn.g of {Masc => m ; Fem => f} ++ cn.s ! n ; + g = cn.g ; + n = n + } ; +</PRE> +<P> +Here is, finally, the complete set of linearization rules. +</P> +<PRE> + lin + Is item quality = + ss (item.s ++ copula item.n ++ quality.s ! item.g ! item.n) ; + This = det Sg "questo" "questa" ; + That = det Sg "quello" "quella" ; + These = det Pl "questi" "queste" ; + Those = det Pl "quelli" "quelle" ; + QKind quality kind = { + s = \\n => kind.s ! n ++ quality.s ! kind.g ! n ; + g = kind.g + } ; + Wine = noun "vino" "vini" Masc ; + Cheese = noun "formaggio" "formaggi" Masc ; + Fish = noun "pesce" "pesci" Masc ; + Pizza = noun "pizza" "pizze" Fem ; + Very qual = {s = \\g,n => "molto" ++ qual.s ! g ! n} ; + Fresh = adjective "fresco" "fresca" "freschi" "fresche" ; + Warm = regAdj "caldo" ; + Italian = regAdj "italiano" ; + Expensive = regAdj "caro" ; + Delicious = regAdj "delizioso" ; + Boring = regAdj "noioso" ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Experiment with multilingual generation and translation in the +<CODE>Foods</CODE> grammars. +</P> +<P> +<B>Exercise</B>. Add items, qualities, and determiners to the grammar, and try to get +their inflection and inherent features right. +</P> +<P> +<B>Exercise</B>. Write a concrete syntax of <CODE>Food</CODE> for a language of your choice, +now aiming for complete grammatical correctness by the use of parameters. +</P> +<P> +<B>Exercise</B>. Measure the size of the context-free grammar corresponding to +<CODE>FoodsIta</CODE>. You can do this by printing the grammar in the context-free format +(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines. +</P> +<A NAME="toc69"></A> +<H2>Discontinuous constituents</H2> +<P> +A linearization type may contain more strings than one. +An example of where this is useful are English particle +verbs, such as <I>switch off</I>. The linearization of +a sentence may place the object between the verb and the particle: +<I>he switched it off</I>. +</P> +<P> +The following judgement defines transitive verbs as +<B>discontinuous constituents</B>, i.e. as having a linearization +type with two strings and not just one. +</P> +<PRE> + lincat TV = {s : Number => Str ; part : Str} ; +</PRE> +<P> +In the abstract syntax, we can now have a rule that combines a subject and an +object item with a transitive verb to form a sentence: +</P> +<PRE> + fun AppTV : Item -> TV -> Item -> Phrase ; +</PRE> +<P> +The linearization rule places the object between the two parts of the verb: +</P> +<PRE> + lin AppTV subj tv obj = + {s = subj.s ++ tv.s ! subj.n ++ obj.s ++ tv.part} ; +</PRE> +<P> +There is no restriction in the number of discontinuous constituents +(or other fields) a <CODE>lincat</CODE> may contain. The only condition is that +the fields must be built from records, tables, +parameters, and <CODE>Str</CODE>, but not functions. +</P> +<P> +Notice that the parsing and linearization commands only give accurate +results for categories whose linearization type has a unique <CODE>Str</CODE> +valued field labelled <CODE>s</CODE>. Therefore, discontinuous constituents +are not a good idea in top-level categories accessed by the users +of a grammar application. +</P> +<P> +<B>Exercise</B>. Define the language <CODE>a^n b^n c^n</CODE> in GF, i.e. +any number of <I>a</I>'s followed by the same number of <I>b</I>'s and +the same number of <I>c</I>'s. This language is not context-free, +but can be defined in GF by using discontinuous constituents. +</P> +<A NAME="toc70"></A> +<H2>Strings at compile time vs. run time</H2> +<P> +A common difficulty in GF are the conditions under which tokens +can be created. Tokens are created in the following ways: +</P> +<UL> +<LI>quoted string: <CODE>"foo"</CODE> +<LI>gluing : <CODE>t + s</CODE> +<LI>predefined operations <CODE>init, tail, tk, dp</CODE> +<LI>pattern matching over strings +</UL> + +<P> +The general principle is that +<I>tokens must be known at compile time</I>. This means that the above operations +may not have <B>run-time variables</B> in their arguments. Run-time variables, in +turn, are the variables that stand for function arguments in linearization rules. +</P> +<P> +Hence it is not legal to write +</P> +<PRE> + cat Noun ; + fun Plural : Noun -> Noun ; + lin Plural n = {s = n.s + "s"} ; +</PRE> +<P> +because <CODE>n</CODE> is a run-time variable. Also +</P> +<PRE> + lin Plural n = {s = (regNoun n).s ! Pl} ; +</PRE> +<P> +is incorrect with <CODE>regNoun</CODE> as defined <a href="#secinflection">here</a>, because the run-time +variable is eventually sent to string pattern matching and gluing. +</P> +<P> +Writing tokens together without a space is an often-wanted behaviour, for instance, +with punctuation marks. Thus one might try to write +</P> +<PRE> + lin Question p = {s = p + "?"} ; +</PRE> +<P> +which is incorrect. The way to go is to use an <B>unlexer</B> that creates correct spacing +after linearization. Correspondingly, a <B>lexer</B> that e.g. analyses <CODE>"warm?"</CODE> into +to tokens is needed before parsing. This can be done by using flags: +</P> +<PRE> + flags lexer=text ; unlexer=text ; +</PRE> +<P> +works in the desired way for English text. More on lexers and unlexers will be +told <a href="#seclexing">here</a>. +</P> +<A NAME="toc71"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc72"></A> +<H3>Parameter and table types</H3> +<P> +A judgement of the form +<center> + <CODE>param</CODE> <I>P</I> <CODE>=</CODE> <I>C1</I> <I>X1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I> <I>Xn</I> +</center> +defines a <B>parameter type</B> <I>P</I> with <B>constructors</B> <I>C1</I> ... <I>Cn</I>. +Each constructor has a <B>context</B> <I>X</I>, which is a (possibly empty) +sequence of parameter types. A <B>parameter value</B> is an application +of a constructor to a sequence of parameter values from each type in +its context. +</P> +<P> +In addition to types defined in <CODE>param</CODE> judgements, also +records of parameter types are parameter types. Their values are records +of corresponding field values. +</P> +<P> +Moreover, the type <CODE>Ints</CODE> <I>n</I> is a parameter type for any positive +integer <I>n</I>, and its values are <CODE>0</CODE>, ..., <I>n-1</I>. +</P> +<P> +A <B>table type</B> <I>P</I> <CODE>=></CODE> <I>T</I> must have a parameter type <I>P</I> as +its argument type. The normal form of an object of this type is a <B>table</B> +<center> + <CODE>table {</CODE> <I>V1</I> <CODE>=></CODE> <I>t1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>Vm</I> <CODE>=></CODE> <I>tm</I> <CODE>}</CODE> +</center> +which has a <B>branch</B> for every parameter value <I>Vi</I> of type <I>P</I>. +A table can be given in many other ways by using pattern matching. +</P> +<P> +Tables with only one branch are a common special case. +GF provides syntactic sugar for writing one-branch tables concisely: +</P> +<PRE> + \\P,...,Q => t === table {P => ... table {Q => t} ...} +</PRE> +<P></P> +<A NAME="toc73"></A> +<H3>Pattern matching</H3> +<P> +<a name="secmatching"></a> +</P> +<P> +We will list all forms of patterns that can be used in table branches. +the following are available for any parameter types, as well +as for the types <CODE>Int</CODE> and <CODE>Str</CODE> +</P> +<UL> +<LI>a constructor pattern <I>C P1 ... Pn</I> matches any value <I>C V1 ... Vn</I> where + each <I>Vi</I> matches <I>Pi</I>, + and binds the union of all variables bound in the subpatterns <I>Pi</I> +<LI>a record pattern + <CODE>{</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>;</CODE> ... <CODE>;</CODE> <I>r1</I> <CODE>=</CODE> <I>P1</I> <CODE>}</CODE> + matches any record that has values of the corresponding fields. + and binds the union of all variables bound in the subpatterns <I>Pi</I> +<LI>a variable pattern <I>x</I> + (identifier other than constant parameter) matches any value, and + binds <I>x</I> to this value +<LI>the wild card <CODE>_</CODE> matches any value +<LI>a disjunctive pattern <I>P</I> <CODE>|</CODE> <I>Q</I> matches anything that + either <I>P</I> or <I>Q</I> matches; bindings must be the same in both +<LI>a negative pattern <CODE>-</CODE><I>P</I> matches anything that <I>P</I> does not match; + no bindings are returned +<LI>an alias pattern <I>x</I> <CODE>@</CODE> <I>P</I> matches whatever value <I>P</I> matches and + binds <I>x</I> to this value; also the bindings in <I>P</I> are returned +</UL> + +<P> +The following patterns are only available for the type <CODE>Str</CODE>: +</P> +<UL> +<LI>a string literal pattern, e.g. <CODE>"s"</CODE>, matches the same string +<LI>a concatenation pattern <I>P</I> <CODE>+</CODE> <I>Q</I> matches any string that consists + of a prefix matching <I>P</I> and a suffix matching <I>Q</I>; + the union of bindings is returned +<LI>a repetition pattern <I>P</I><CODE>*</CODE> matches any string that can be decomposed + into strings that match <I>P</I>; no bindings are returned +</UL> + +<P> +The following pattern is only available for the types <CODE>Int</CODE> and <CODE>Ints</CODE> <I>n</I>: +</P> +<UL> +<LI>an integer literal pattern, e.g. <CODE>214</CODE>, matches the same integer +</UL> + +<P> +Pattern matching is performed in the order in which the branches +appear in the table: the branch of the first matching pattern is followed. +The type checker reject sets of patterns that are not exhaustive, and +warns for completely overshadowed patterns. +To guarantee exhaustivity when the infinite types <CODE>Int</CODE> and <CODE>Str</CODE> are +used as argument types, the last pattern must be a "catch-all" variable +or wild card. +</P> +<P> +It follows from the definition of record pattern matching +that it can utilize partial records: the branch +</P> +<PRE> + {g = Fem} => t +</PRE> +<P> +in a table of type <CODE>{g : Gender ; n : Number} => T</CODE> means the same as +</P> +<PRE> + {g = Fem ; n = _} => t +</PRE> +<P> +Variables in regular expression patterns +are always bound to the <B>first match</B>, which is the first +in the sequence of binding lists. For example: +</P> +<UL> +<LI><CODE>x + "e" + y</CODE> matches <CODE>"peter"</CODE> with <CODE>x = "p", y = "ter"</CODE> +<LI><CODE>x + "er"*</CODE> matches <CODE>"burgerer"</CODE> with ``x = "burg" +</UL> + +<A NAME="toc74"></A> +<H3>Overloading</H3> +<P> +Judgements of the <CODE>oper</CODE> form can introduce overloaded functions. +The syntax is record-like, but all fields must have the same +name and different types. +</P> +<PRE> + oper mkN = overload { + mkN : (dog : Str) -> Noun = regNoun ; + mkN : (mouse,mice : Str) -> Noun = mkNoun ; + } +</PRE> +<P> +To give just the type of an overloaded operation, the record type +syntax is used. +</P> +<PRE> + oper mkN : overload { + mkN : (dog : Str) -> Noun ; -- regular nouns + mkN : (mouse,mice : Str) -> Noun ; -- irregular nouns + } +</PRE> +<P> +Overloading is not possible in other forms of judgement. +</P> +<A NAME="toc75"></A> +<H3>Local definitions</H3> +<P> +Local definitions ("<CODE>let</CODE> expressions") can appear in groups: +</P> +<PRE> + oper regNoun : Str -> Noun = \vino -> + let + vin : Str = init vino ; + o = last vino + in + ... +</PRE> +<P> +The type can be omitted if it can be inferred. Later definitions may +refer to earlier ones. +</P> +<A NAME="toc76"></A> +<H3>Supplementary constructs</H3> +<P> +The rest of the GF language constructs are presented for the sake +of completeness. They will not be used in the rest of this tutorial. +</P> +<H4>Record extension and subtyping</H4> +<P> +Record types and records can be <B>extended</B> with new fields. For instance, +in German it is natural to see transitive verbs as verbs with a case, which +is usually accusative or dative, and is passed to the object of the verb. +The symbol <CODE>**</CODE> is used for both record types and record objects. +</P> +<PRE> + lincat TV = Verb ** {c : Case} ; + + lin Follow = regVerb "folgen" ** {c = Dative} ; +</PRE> +<P> +To extend a record type or a record with a field whose label it +already has is a type error. It is also an error to extend a type or +object that is not a record. +</P> +<P> +A record type <I>T</I> is a <B>subtype</B> of another one <I>R</I>, if <I>T</I> has +all the fields of <I>R</I> and possibly other fields. For instance, +an extension of a record type is always a subtype of it. +If <I>T</I> is a subtype of <I>R</I>, then <I>R</I> is a <B>supertype</B> of <I>T</I>. +</P> +<P> +If <I>T</I> is a subtype of <I>R</I>, an object of <I>T</I> can be used whenever +an object of <I>R</I> is required. +For instance, a transitive verb can be used whenever a verb is required. +</P> +<P> +<B>Covariance</B> means that a function returning a record <I>T</I> as value can +also be used to return a value of a supertype <I>R</I>. +<B>Contravariance</B> means that a function taking an <I>R</I> as argument +can also be applied to any object of a subtype <I>T</I>. +</P> +<H4>Tuples and product types</H4> +<P> +Product types and tuples are syntactic sugar for record types and records: +</P> +<PRE> + T1 * ... * Tn === {p1 : T1 ; ... ; pn : Tn} + <t1, ..., tn> === {p1 = T1 ; ... ; pn = Tn} +</PRE> +<P> +Thus the labels <CODE>p1, p2,...</CODE> are hard-coded. +As patterns, tuples are translated to record patterns in the +same way as tuples to records; partial patterns make it +possible to write, slightly surprisingly, +</P> +<PRE> + case <g,n,p> of { + <Fem> => t + ... + } +</PRE> +<P></P> +<H4>Prefix-dependent choices</H4> +<P> +Sometimes a token has different forms depending on the token +that follows. An example is the English indefinite article, +which is <I>an</I> if a vowel follows, <I>a</I> otherwise. +Which form is chosen can only be decided at run time, i.e. +when a string is actually build. GF has a special construct for +such tokens, the <CODE>pre</CODE> construct exemplified in +</P> +<PRE> + oper artIndef : Str = + pre {"a" ; "an" / strs {"a" ; "e" ; "i" ; "o"}} ; +</PRE> +<P> +Thus +</P> +<PRE> + artIndef ++ "cheese" ---> "a" ++ "cheese" + artIndef ++ "apple" ---> "an" ++ "apple" +</PRE> +<P> +This very example does not work in all situations: the prefix +<I>u</I> has no general rules, and some problematic words are +<I>euphemism, one-eyed, n-gram</I>. Since the branches are matched in +order, it is possible to write +</P> +<PRE> + oper artIndef : Str = + pre {"a" ; + "a" / strs {"eu" ; "one"} ; + "an" / strs {"a" ; "e" ; "i" ; "o" ; "n-"} + } ; +</PRE> +<P> +Somewhat illogically, the default value is given as the first element in the list. +</P> +<P> +<I>Prefix-dependent choice may be deprecated in GF version 3.</I> +</P> +<A NAME="toc77"></A> +<H1>Using the resource grammar library</H1> +<P> +<a name="chapfive"></a> +</P> +<P> +In this chapter, we will take a look at the GF resource grammar library. +We will use the library to implement the <CODE>Foods</CODE> grammar of the +previous chapter +and port it to some new languages. Some new concepts of GF's module system +are also introduced, most notably the technique of <B>parametrized modules</B>, +which has become an important "design pattern" for multilingual grammars. +</P> +<A NAME="toc78"></A> +<H2>The coverage of the library</H2> +<P> +The GF Resource Grammar Library contains grammar rules for +10 languages. In addition, 2 languages are available as yet incomplete +implementations, and a few more are under construction. The purpose +of the library is to define the low-level morphological and syntactic +rules of languages, and thereby enable application programmers +to concentrate on the semantic and stylistic +aspects of their grammars. The guiding principle is that +<center> +grammar checking becomes type checking +</center> +that is, whatever is type-correct in the resource grammar is also +grammatically correct. +</P> +<P> +The intended level of application grammarians +is that of a skilled programmer with +a practical knowledge of the target languages, but without +theoretical knowledge about their grammars. +Such a combination of +skills is typical of programmers who, for instance, want to localize +language software to new languages. +</P> +<P> +The current resource languages are +</P> +<UL> +<LI><CODE>Ara</CODE>bic (incomplete) +<LI><CODE>Cat</CODE>alan (incomplete) +<LI><CODE>Dan</CODE>ish +<LI><CODE>Eng</CODE>lish +<LI><CODE>Fin</CODE>nish +<LI><CODE>Fre</CODE>nch +<LI><CODE>Ger</CODE>man +<LI><CODE>Ita</CODE>lian +<LI><CODE>Nor</CODE>wegian +<LI><CODE>Rus</CODE>sian +<LI><CODE>Spa</CODE>nish +<LI><CODE>Swe</CODE>dish +</UL> + +<P> +The first three letters (<CODE>Eng</CODE> etc) are used in grammar module names. +We use the three-letter codes for languages from the ISO 639 standard. +</P> +<P> +The incomplete Arabic and Catalan implementations are +sufficient for use in some applications; they both contain, amoung other +things, complete inflectional morphology. +</P> +<A NAME="toc79"></A> +<H2>The structure of the library</H2> +<P> +<a name="seclexical"></a> +</P> +<A NAME="toc80"></A> +<H3>Lexical vs. phrasal rules</H3> +<P> +So far we have looked at grammars from a semantic point of view: +a grammar defines a system of meanings (specified in the abstract syntax) and +tells how they are expressed in some language (as specified in the concrete syntax). +In resource grammars, as often in the linguistic tradition, the goal is more modest: +to specify the <B>grammatically correct combinations of words</B>, whatever their +meanings are. With this more modest goal, it is possible to achieve a much +wider coverage than with semantic grammars. +</P> +<P> +Given the focus on <I>words</I> and their combinations, +the resource grammar has two kinds of categories and two kinds of rules: +</P> +<UL> +<LI>lexical: + <UL> + <LI>lexical categories, to classify words + <LI>lexical rules, to define words and their properties + </UL> +</UL> + +<UL> +<LI>phrasal (combinatorial, syntactic): + <UL> + <LI>phrasal categories, to classify phrases of arbitrary size + <LI>phrasal rules, to combine phrases into larger phrases + </UL> +</UL> + +<P> +Some grammar formalisms make a formal distinction between +the lexical and syntactic +components; sometimes it is necessary to use separate formalisms for these +two kinds of rules. GF has no such restrictions. +Nevertheless, it has turned out +to be a good discipline to maintain a distinction between +the lexical and syntactic components in the resource grammar. This fits +also well with what is needed in applications: while syntactic structures +are more or less the same across applications, vocabularies can be +very different. +</P> +<A NAME="toc81"></A> +<H3>Lexical categories</H3> +<P> +Within lexical categories, there is a further classification +into <B>closed</B> and <B>open</B> categories. The definining property +of closed categories is that the +words in them can easily be enumerated; it is very seldom that any +new words are introduced in them. In general, closed categories +contain <B>structural words</B>, also known as <B>function words</B>. +Examples of closed categories are +</P> +<PRE> + QuantSg ; -- singular quantifier e.g. "this" + QuantPl ; -- plural quantifier e.g. "those" + AdA ; -- adadjective e.g. "very" +</PRE> +<P> +We have already used words of all these categories in the <CODE>Food</CODE> +examples; they have just not been assigned a category, but +treated as <B>syncategorematic</B>. In GF, a syncategoramatic +word is one that is introduced in a linearization rule of +some construction alongside with some other expressions that +are combined; there is no abstract syntax tree for that word +alone. Thus in the rules +</P> +<PRE> + fun That : Kind -> Item ; + lin That k = {"that" ++ k.s} ; +</PRE> +<P> +the word <I>that</I> is syncategoramatic. In linguistically motivated +grammars, syncategorematic words are avoided, whereas in +semantically motivated grammars, structural words are typically treated +as syncategoramatic. This is partly so because the function expressed +by a structural word in one language is often expressed by some other +means than an individual word in another. For instance, the definite +article <I>the</I> is a determiner word in English, whereas Swedish expresses +determination by inflecting the determined noun: <I>the wine</I> is <I>vinet</I> +in Swedish. +</P> +<P> +As for open categories, we will start with these two: +</P> +<PRE> + N ; -- noun e.g. "pizza" + A ; -- adjective e.g. "good" +</PRE> +<P> +Later in this chapter we will also need verbs of different kinds. +</P> +<P> +<I>Note</I>. Having adadjectives as a closed category is not quite right, because +one can form adadjectives from adjectives: <I>incredibly warm</I>. +</P> +<A NAME="toc82"></A> +<H3>Lexical rules</H3> +<P> +The words of closed categories can be listed once and for all in a +library. In the first example, the <CODE>Foods</CODE> grammar of the previous section, +we will use the following structural words from the <CODE>Syntax</CODE> module: +</P> +<PRE> + this_QuantSg, that_QuantSg : QuantSg ; + these_QuantPl, those_QuantPl : QuantPl ; + very_AdA : AdA ; +</PRE> +<P> +The naming convention for lexical rules is that we use a word followed by +the category. In this way we can for instance distinguish the quantifier +<I>that</I> from the conjunction <I>that</I>. +</P> +<P> +Open lexical categories have no objects in <CODE>Syntax</CODE>. Such objects +will be built as they are needed in applications. The abstract +syntax of words in applications is already familiar, e.g. +</P> +<PRE> + fun Wine : Kind ; +</PRE> +<P> +The concrete syntax can be given directly, e.g. +</P> +<PRE> + lin Wine = mkN "wine" ; +</PRE> +<P> +by using the morphological paradigm library <CODE>ParadigmsEng</CODE>. +However, there are some advantages in giving the concrete syntax +indirectly, via the creation of a <B>resource lexicon</B>. In this lexicon, +there will be entries such as +</P> +<PRE> + oper wine_N : N = mkN "wine" ; +</PRE> +<P> +which can then be used in the linearization rules, +</P> +<PRE> + lin Wine = wine_N ; +</PRE> +<P> +One advantage of this indirect method is that each new word gives +an addition to a reusable resource lexicon, instead of just doing +the job of implementing the application. Another advantage will +be shown <a href="#secfunctor">here</a>: the possibility to write functors over +lexicon interfaces. +</P> +<A NAME="toc83"></A> +<H3>Phrasal categories</H3> +<P> +There are just four phrasal categories needed in the first application: +</P> +<PRE> + Cl ; -- clause e.g. "this pizza is good" + NP ; -- noun phrase e.g. "this pizza" + CN ; -- common noun e.g. "warm pizza" + AP ; -- adjectival phrase e.g. "very warm" +</PRE> +<P> +Clauses are, roughly, the same as declarative sentences; we will +define in <a href="#secextended">here</a> a sentence <CODE>S</CODE> as a clause that has a fixed tense. +The distinction between common nouns and noun phrases is that common nouns +cannot generally be used alone as subjects (?<I>dog sleeps</I>), +whereas noun phrases can (<I>the dog sleeps</I>). +Noun phrases can be built from common nouns by adding determiners, +such as quantifiers; but there are also other kinds of noun phrases, e.g. +pronouns. +</P> +<P> +The syntactic combinations we need are the following: +</P> +<PRE> + mkCl : NP -> AP -> Cl ; -- e.g. "this pizza is very warm" + mkNP : QuantSg -> CN -> NP ; -- e.g. "this pizza" + mkNP : QuantPl -> CN -> NP ; -- e.g. "these pizzas" + mkCN : AP -> CN -> CN ; -- e.g. "warm pizza" + mkAP : AdA -> AP -> AP ; -- e.g. "very warm" +</PRE> +<P> +To start building phrases, we need rules of <B>lexical insertion</B>, which +form phrases from single words: +</P> +<PRE> + mkCN : N -> NP ; + mkAP : A -> AP ; +</PRE> +<P> +Notice that all (or, as many as possible) operations in the resource library +have the name <CODE>mk</CODE><I>C</I>, where <I>C</I> is the value category of the operation. +This means of course heavy overloading. For instance, the current library +(version 1.2) has no less than 23 operations named <CODE>mkNP</CODE>! +</P> +<P> +Now the sentence +<center> +<I>these very warm pizzas are Italian</I> +</center> +can be built as follows: +</P> +<PRE> + mkCl + (mkNP these_QuantPl + (mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN))) + (mkAP italian_AP) +</PRE> +<P> +The task we are facing now is to define the concrete syntax of <CODE>Foods</CODE> so that +this syntactic tree gives the value of linearizing the semantic tree +</P> +<PRE> + Is (These (QKind (Very Warm) Pizza)) Italian +</PRE> +<P></P> +<A NAME="toc84"></A> +<H2>The resource API</H2> +<P> +The resource library API is divided into language-specific +and language-independent parts. To put it roughly, +</P> +<UL> +<LI>the syntax API is language-independent, i.e. has the same types and + functions for all languages. + Its name is <CODE>Syntax</CODE><I>L</I> for each language <I>L</I> +<LI>the morphology API is language-specific, i.e. has partly + different types and functions + for different languages. + Its name is <CODE>Paradigms</CODE><I>L</I> for each language <I>L</I> +</UL> + +<P> +A full documentation of the API is available on-line in the +<B>resource synopsis</B>. +For the examples of this chapter, we will only need a +fragment of the full API. The fragment needed for <CODE>Foods</CODE> has +already been introduced, but let us summarize the descriptions +by giving tables of the same form as used in the resource synopsis. +</P> +<P> +Thus we will make use of the following categories from the module <CODE>Syntax</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Category</TH> +<TH>Explanation</TH> +<TH COLSPAN="2">Example</TH> +</TR> +<TR> +<TD><CODE>Cl</CODE></TD> +<TD>clause (sentence), with all tenses</TD> +<TD><I>she looks at this</I></TD> +</TR> +<TR> +<TD><CODE>AP</CODE></TD> +<TD>adjectival phrase</TD> +<TD><I>very warm</I></TD> +</TR> +<TR> +<TD><CODE>CN</CODE></TD> +<TD>common noun (without determiner)</TD> +<TD><I>red house</I></TD> +</TR> +<TR> +<TD><CODE>NP</CODE></TD> +<TD>noun phrase (subject or object)</TD> +<TD><I>the red house</I></TD> +</TR> +<TR> +<TD><CODE>AdA</CODE></TD> +<TD>adjective-modifying adverb,</TD> +<TD><I>very</I></TD> +</TR> +<TR> +<TD><CODE>QuantSg</CODE></TD> +<TD>singular quantifier</TD> +<TD><I>these</I></TD> +</TR> +<TR> +<TD><CODE>QuantPl</CODE></TD> +<TD>plural quantifier</TD> +<TD><I>these</I></TD> +</TR> +<TR> +<TD><CODE>A</CODE></TD> +<TD>one-place adjective</TD> +<TD><I>warm</I></TD> +</TR> +<TR> +<TD><CODE>N</CODE></TD> +<TD>common noun</TD> +<TD><I>house</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +We will use the following syntax rules from <CODE>Syntax</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH>Type</TH> +<TH COLSPAN="2">Example</TH> +</TR> +<TR> +<TD><CODE>mkCl</CODE></TD> +<TD><CODE>NP -> AP -> Cl</CODE></TD> +<TD><I>John is very old</I></TD> +</TR> +<TR> +<TD><CODE>mkNP</CODE></TD> +<TD><CODE>QuantSg -> CN -> NP</CODE></TD> +<TD><I>this old man</I></TD> +</TR> +<TR> +<TD><CODE>mkNP</CODE></TD> +<TD><CODE>QuantPl -> CN -> NP</CODE></TD> +<TD><I>these old man</I></TD> +</TR> +<TR> +<TD><CODE>mkCN</CODE></TD> +<TD><CODE>N -> CN</CODE></TD> +<TD><I>house</I></TD> +</TR> +<TR> +<TD><CODE>mkCN</CODE></TD> +<TD><CODE>AP -> CN -> CN</CODE></TD> +<TD><I>very big blue house</I></TD> +</TR> +<TR> +<TD><CODE>mkAP</CODE></TD> +<TD><CODE>A -> AP</CODE></TD> +<TD><I>old</I></TD> +</TR> +<TR> +<TD><CODE>mkAP</CODE></TD> +<TD><CODE>AdA -> AP -> AP</CODE></TD> +<TD><I>very very old</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +We will use the following structural words from <CODE>Syntax</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH>Type</TH> +<TH COLSPAN="2">In English</TH> +</TR> +<TR> +<TD><CODE>this_QuantSg</CODE></TD> +<TD><CODE>QuantSg</CODE></TD> +<TD><I>this</I></TD> +</TR> +<TR> +<TD><CODE>that_QuantSg</CODE></TD> +<TD><CODE>QuantSg</CODE></TD> +<TD><I>that</I></TD> +</TR> +<TR> +<TD><CODE>these_QuantPl</CODE></TD> +<TD><CODE>QuantPl</CODE></TD> +<TD><I>this</I></TD> +</TR> +<TR> +<TD><CODE>those_QuantPl</CODE></TD> +<TD><CODE>QuantPl</CODE></TD> +<TD><I>that</I></TD> +</TR> +<TR> +<TD><CODE>very_AdA</CODE></TD> +<TD><CODE>AdA</CODE></TD> +<TD><I>very</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +For English, we will use the following part of <CODE>ParadigmsEng</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(dog : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(man,men : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(cold : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +For Italian, we need just the following part of <CODE>ParadigmsIta</CODE> +(Exercise). The "smart" paradigms will take care of variations +such as <I>formaggio</I>-<I>formaggi</I>, and also infer the genders +correctly. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(vino : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(caro : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +For German, we will use the following part of <CODE>ParadigmsGer</CODE>. +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>Gender</CODE></TD> +<TD><CODE>Type</CODE></TD> +</TR> +<TR> +<TD><CODE>masculine</CODE></TD> +<TD><CODE>Gender</CODE></TD> +</TR> +<TR> +<TD><CODE>feminine</CODE></TD> +<TD><CODE>Gender</CODE></TD> +</TR> +<TR> +<TD><CODE>neuter</CODE></TD> +<TD><CODE>Gender</CODE></TD> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(Stufe : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(Bild,Bilder : Str) -> Gender -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(klein : Str) -> A</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(gut,besser,beste : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +For Finnish, we only need the smart regular paradigms: +</P> +<TABLE CELLPADDING="4" BORDER="1"> +<TR> +<TH>Function</TH> +<TH COLSPAN="2">Type</TH> +</TR> +<TR> +<TD><CODE>mkN</CODE></TD> +<TD><CODE>(talo : Str) -> N</CODE></TD> +</TR> +<TR> +<TD><CODE>mkA</CODE></TD> +<TD><CODE>(hieno : Str) -> A</CODE></TD> +</TR> +</TABLE> + +<P></P> +<P> +<B>Exercise</B>. Try out the morphological paradigms in different languages. Do +as follows: +</P> +<PRE> + > i -path=alltenses:prelude -retain alltenses/ParadigmsGer.gfr + > cc mkN "Farbe" + > cc mkA "gut" "besser" "beste" +</PRE> +<P></P> +<A NAME="toc85"></A> +<H2>Example: English</H2> +<P> +<a name="secenglish"></a> +</P> +<P> +We work with the abstract syntax <CODE>Foods</CODE> from <a href="#chaptwo">the fourth chapter</a>, and +build first an English implementation. Now we can do it without +thinking about inflection and agreement, by just picking appropriate +functions from the resource grammar library. +</P> +<P> +The concrete syntax opens <CODE>SyntaxEng</CODE> and <CODE>ParadigmsEng</CODE> +to get access to the resource libraries needed. In order to find +the libraries, a <CODE>path</CODE> directive is prepended. It contains +two resource subdirectories --- <CODE>present</CODE> and <CODE>prelude</CODE> --- +which are found relative to the environment variable <CODE>GF_LIB_PATH</CODE>. +It also contains the current directory <CODE>.</CODE> and the directory <CODE>../foods</CODE>, +in which <CODE>Foods.gf</CODE> resides. +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsEng of Foods = open SyntaxEng,ParadigmsEng in { +</PRE> +<P> +As linearization types, we will use clauses for <CODE>Phrase</CODE>, noun phrases +for <CODE>Item</CODE>, common nouns for <CODE>Kind</CODE>, and adjectival phrases for <CODE>Quality</CODE>. +</P> +<PRE> + lincat + Phrase = Cl ; + Item = NP ; + Kind = CN ; + Quality = AP ; +</PRE> +<P> +These types fit perfectly with the way we have used the categories +in the application; hence +the combination rules we need almost write themselves automatically: +</P> +<PRE> + lin + Is item quality = mkCl item quality ; + This kind = mkNP this_QuantSg kind ; + That kind = mkNP that_QuantSg kind ; + These kind = mkNP these_QuantPl kind ; + Those kind = mkNP those_QuantPl kind ; + QKind quality kind = mkCN quality kind ; + Very quality = mkAP very_AdA quality ; +</PRE> +<P> +We write the lexical part of the grammar by using resource paradigms directly. +Notice that we have to apply the lexical insertion rules to get type-correct +linearizations. Notice also that we need to use the two-place noun paradigm for +<I>fish</I>, but everythins else is regular. +</P> +<PRE> + Wine = mkCN (mkN "wine") ; + Pizza = mkCN (mkN "pizza") ; + Cheese = mkCN (mkN "cheese") ; + Fish = mkCN (mkN "fish" "fish") ; + Fresh = mkAP (mkA "fresh") ; + Warm = mkAP (mkA "warm") ; + Italian = mkAP (mkA "Italian") ; + Expensive = mkAP (mkA "expensive") ; + Delicious = mkAP (mkA "delicious") ; + Boring = mkAP (mkA "boring") ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Compile the grammar <CODE>FoodsEng</CODE> and generate +and parse some sentences. +</P> +<P> +<B>Exercise</B>. Write a concrete syntax of <CODE>Foods</CODE> for Italian +or some other language included in the resource library. You can +compare the results with the hand-written +grammars presented earlier in this tutorial. +</P> +<A NAME="toc86"></A> +<H2>Functor implementation of multilingual grammars</H2> +<P> +<a name="secfunctor"></a> +</P> +<P> +If you did the exercise of writing a concrete syntax of <CODE>Foods</CODE> for some other +language, you probably noticed that much of the code looks exactly the same +as for English. The reason for this is that the <CODE>Syntax</CODE> API is the +same for all languages. This is in turn possible because +all languages (at least those in the resource package) +implement the same syntactic structures. Moreover, languages tend to use the +syntactic structures in similar ways, even though this is not exceptionless. +But usually, it is only the lexical parts of a concrete syntax that +we need to write anew for a new language. Thus, to port a grammar to +a new language, you +</P> +<OL> +<LI>copy the concrete syntax of a given language +<LI>change the words (strings and inflection paradigms) +</OL> + +<P> +Now, programming by copy-and-paste is not worthy of a functional programmer! +So, can we write a <I>function</I> that takes care of the shared parts of grammar modules? +Yes, we can. It is not a function in the <CODE>fun</CODE> or <CODE>oper</CODE> sense, but +a function operating on modules, called a <B>functor</B>. This construct +is familiar from the functional programming +languages ML and OCaml, but it does not +exist in Haskell. It also bears some resemblance to templates in C++. +Functors are also known as <B>parametrized modules</B>. +</P> +<P> +In GF, a functor is a module that <CODE>open</CODE>s one or more <B>interfaces</B>. +An <CODE>interface</CODE> is a module similar to a <CODE>resource</CODE>, but it only +contains the <I>types</I> of <CODE>oper</CODE>s, not their definitions. You can think +of an interface as a kind of a record type. The <CODE>oper</CODE> names are the +labels of this record type. The corresponding <I>record</I> is called an +<B>instance</B> of the interface. +Thus a functor is a module-level function taking instances as +arguments and producing modules as values. +</P> +<P> +Let us now write a functor implementation of the <CODE>Food</CODE> grammar. +Consider its module header first: +</P> +<PRE> + incomplete concrete FoodsI of Foods = open Syntax, LexFoods in +</PRE> +<P> +A functor is distinguished from an ordinary module by the leading +keyword <CODE>incomplete</CODE>. +</P> +<P> +In the functor-function analogy, <CODE>FoodsI</CODE> would be presented as a function +with the following type signature: +</P> +<PRE> + FoodsI : + instance of Syntax -> instance of LexFoods -> concrete of Foods +</PRE> +<P> +It takes as arguments instances of two interfaces: +</P> +<UL> +<LI><CODE>Syntax</CODE>, the resource grammar interface +<LI><CODE>LexFoods</CODE>, the domain-specific lexicon interface +</UL> + +<P> +Functors opening <CODE>Syntax</CODE> and a domain lexicon interface are in fact +so typical in GF applications, that this structure could be called +a <B>design pattern</B> +for GF grammars. What makes this pattern so useful is, again, that +languages tend to use the same syntactic structures and only differ in words. +</P> +<P> +We will show the exact syntax of interfaces and instances in next Section. +Here it is enough to know that we have +</P> +<UL> +<LI><CODE>SyntaxGer</CODE>, an instance of <CODE>Syntax</CODE> +<LI><CODE>LexFoodsGer</CODE>, an instance of <CODE>LexFoods</CODE> +</UL> + +<P> +Then we can complete the German implementation by "applying" the functor: +</P> +<PRE> + FoodI SyntaxGer LexFoodsGer : concrete of Foods +</PRE> +<P> +The GF syntax for doing so is +</P> +<PRE> + concrete FoodsGer of Foods = FoodsI with + (Syntax = SyntaxGer), + (LexFoods = LexFoodsGer) ; +</PRE> +<P> +Notice that this is the <I>whole</I> module, not just a header of it. +The module body is received from <CODE>FoodsI</CODE>, by instantiating the +interface constants with their definitions given in the German +instances. A module of this form, characterized by the keyword <CODE>with</CODE>, is +called a <B>functor instantiation</B>. +</P> +<P> +Here is the complete code for the functor <CODE>FoodsI</CODE>: +</P> +<PRE> + --# -path=.:../foods:present:prelude + + incomplete concrete FoodsI of Foods = open Syntax, LexFoods in { + lincat + Phrase = Cl ; + Item = NP ; + Kind = CN ; + Quality = AP ; + lin + Is item quality = mkCl item quality ; + This kind = mkNP this_QuantSg kind ; + That kind = mkNP that_QuantSg kind ; + These kind = mkNP these_QuantPl kind ; + Those kind = mkNP those_QuantPl kind ; + QKind quality kind = mkCN quality kind ; + Very quality = mkAP very_AdA quality ; + + Wine = mkCN wine_N ; + Pizza = mkCN pizza_N ; + Cheese = mkCN cheese_N ; + Fish = mkCN fish_N ; + Fresh = mkAP fresh_A ; + Warm = mkAP warm_A ; + Italian = mkAP italian_A ; + Expensive = mkAP expensive_A ; + Delicious = mkAP delicious_A ; + Boring = mkAP boring_A ; + } +</PRE> +<P></P> +<A NAME="toc87"></A> +<H2>Interfaces and instances</H2> +<P> +<a name="secinterface"></a> +</P> +<P> +Let us now define the <CODE>LexFoods</CODE> interface: +</P> +<PRE> + interface LexFoods = open Syntax in { + oper + wine_N : N ; + pizza_N : N ; + cheese_N : N ; + fish_N : N ; + fresh_A : A ; + warm_A : A ; + italian_A : A ; + expensive_A : A ; + delicious_A : A ; + boring_A : A ; + } +</PRE> +<P> +In this interface, only lexical items are declared. In general, an +interface can declare any functions and also types. The <CODE>Syntax</CODE> +interface does so. +</P> +<P> +Here is a German instance of the interface. +</P> +<PRE> + instance LexFoodsGer of LexFoods = open SyntaxGer, ParadigmsGer in { + oper + wine_N = mkN "Wein" ; + pizza_N = mkN "Pizza" "Pizzen" feminine ; + cheese_N = mkN "Käse" "Käsen" masculine ; + fish_N = mkN "Fisch" ; + fresh_A = mkA "frisch" ; + warm_A = mkA "warm" "wärmer" "wärmste" ; + italian_A = mkA "italienisch" ; + expensive_A = mkA "teuer" ; + delicious_A = mkA "köstlich" ; + boring_A = mkA "langweilig" ; + } +</PRE> +<P> +Notice that when an interface opens an interface, such as <CODE>Syntax</CODE>, +here, then its instance has to open an instance of it. But the instance +may also open some other resources --- very typically, like here, +a domain lexicon instance opens a <CODE>Paradigms</CODE> module. +</P> +<P> +Just to complete the picture, we repeat the German functor instantiation +for <CODE>FoodsI</CODE>, this time with a path directive that makes it compilable. +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsGer of Foods = FoodsI with + (Syntax = SyntaxGer), + (LexFoods = LexFoodsGer) ; +</PRE> +<P></P> +<P> +<B>Exercise</B>. Compile and test <CODE>FoodsGer</CODE>. +</P> +<P> +<B>Exercise</B>. Refactor <CODE>FoodsEng</CODE> into a functor instantiation. +</P> +<A NAME="toc88"></A> +<H2>Adding languages to a functor implementation</H2> +<P> +Once we have an application grammar defined by using a functor, +adding a new language is simple. Just two modules need to be written: +</P> +<UL> +<LI>a domain lexicon instance +<LI>a functor instantiation +</UL> + +<P> +The functor instantiation is completely mechanical to write. +Here is one for Finnish: +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsFin of Foods = FoodsI with + (Syntax = SyntaxFin), + (LexFoods = LexFoodsFin) ; +</PRE> +<P> +The domain lexicon instance requires some knowledge of the words of the +language: what words are used for which concepts, how the words are +inflected, plus features such as genders. Here is a lexicon instance for +Finnish: +</P> +<PRE> + instance LexFoodsFin of LexFoods = open SyntaxFin, ParadigmsFin in { + oper + wine_N = mkN "viini" ; + pizza_N = mkN "pizza" ; + cheese_N = mkN "juusto" ; + fish_N = mkN "kala" ; + fresh_A = mkA "tuore" ; + warm_A = mkA "lämmin" ; + italian_A = mkA "italialainen" ; + expensive_A = mkA "kallis" ; + delicious_A = mkA "herkullinen" ; + boring_A = mkA "tylsä" ; + } +</PRE> +<P></P> +<P> +<B>Exercise</B>. Instantiate the functor <CODE>FoodsI</CODE> to some language of +your choice. +</P> +<A NAME="toc89"></A> +<H2>Division of labour revisited</H2> +<P> +One purpose with the resource grammars was stated to be a division +of labour between linguists and application grammarians. We can now +reflect on what this means more precisely, by asking ourselves what +skills are required of grammarians working on different components. +</P> +<P> +Building a GF application starts from the abstract syntax. Writing +an abstract syntax requires +</P> +<UL> +<LI>understanding of the semantic structure of the application domain +<LI>knowledge of the GF fragment with categories and functions +</UL> + +<P> +If the concrete syntax is written by using a functor, the programmer +has to decide what parts of the implementation are put to the interface +and what parts are shared in the functor. This requires +</P> +<UL> +<LI>knowing how the domain concepts are expressed in natural language +<LI>knowledge of the resource grammar library --- the categories and combinators +<LI>understanding what parts are likely to be expressed in language-dependent + ways, so that they are put to an interface and not the functor +<LI>knowledge of the GF fragment with function applications and strings +</UL> + +<P> +Instantiating a ready-made functor to a new language is less demanding. +It requires essentially +</P> +<UL> +<LI>knowing how the domain words are expressed in the language +<LI>knowing, roughly, how these words are inflected +<LI>knowledge of the paradigms available in the library +<LI>knowledge of the GF fragment with function applications and strings +</UL> + +<P> +Notice that none of these tasks requires the use of GF records, tables, +or parameters. Thus only a small fragment of GF is needed; the rest of +GF is only relevant for those who write the libraries. Essentially, +all the machinery introduced in <a href="#chaptwo">the fourth chapter</a> is unnecessary! +</P> +<P> +Of course, grammar writing is not always just straightforward usage of libraries. +For example, GF can be used for other languages than just those in the +libraries --- for both natural and formal languages. A knowledge of records +and tables can, unfortunately, also be needed for understanding GF's error +messages. +</P> +<P> +<B>Exercise</B>. Design a small grammar that can be used for controlling +an MP3 player. The grammar should be able to recognize commands such +as <I>play this song</I>, with the following variations: +</P> +<UL> +<LI>verbs: <I>play</I>, <I>remove</I> +<LI>objects: <I>song</I>, <I>artist</I> +<LI>determiners: <I>this</I>, <I>the previous</I> +<LI>verbs without arguments: <I>stop</I>, <I>pause</I> +</UL> + +<P> +The implementation goes in the following phases: +</P> +<OL> +<LI>abstract syntax +<LI>functor and lexicon interface +<LI>lexicon instance for the first language +<LI>functor instantiation for the first language +<LI>lexicon instance for the second language +<LI>functor instantiation for the second language +<LI>... +</OL> + +<A NAME="toc90"></A> +<H2>Restricted inheritance</H2> +<P> +A functor implementation using the resource <CODE>Syntax</CODE> interface +works well as long as all concepts are expressed by using the same structures +in all languages. If this is not the case, the deviant linearization can +be made into a parameter and moved to the domain lexicon interface. +</P> +<P> +The <CODE>Foods</CODE> grammar works so well that we have to +take a contrived example: assume that English has +no word for <CODE>Pizza</CODE>, but has to use the paraphrase <I>Italian pie</I>. +This paraphrase is no longer a noun <CODE>N</CODE>, but a complex phrase +in the category <CODE>CN</CODE>. An obvious way to solve this problem is +to change interface <CODE>LexFoods</CODE> so that the constant declared for +<CODE>Pizza</CODE> gets a new type: +</P> +<PRE> + oper pizza_CN : CN ; +</PRE> +<P> +But this solution is unstable: we may end up changing the interface +and the function with each new language, and we must every time also +change the interface instances for the old languages to maintain +type correctness. +</P> +<P> +A better solution is to use <B>restricted inheritance</B>: the English +instantiation inherits the functor implementation except for the +constant <CODE>Pizza</CODE>. This is how we write: +</P> +<PRE> + --# -path=.:../foods:present:prelude + + concrete FoodsEng of Foods = FoodsI - [Pizza] with + (Syntax = SyntaxEng), + (LexFoods = LexFoodsEng) ** + open SyntaxEng, ParadigmsEng in { + + lin Pizza = mkCN (mkA "Italian") (mkN "pie") ; + } +</PRE> +<P> +Restricted inheritance is available for all inherited modules. One can for +instance exclude some mushrooms and pick up just some fruit in +the <CODE>FoodMarket</CODE> example "Rsecarchitecture: +</P> +<PRE> + abstract Foodmarket = Food, Fruit [Peach], Mushroom - [Agaric] +</PRE> +<P> +A concrete syntax of <CODE>Foodmarket</CODE> must then have the same inheritance +restrictions, in order to be well-typed with respect to the abstract syntax. +</P> +<A NAME="toc91"></A> +<H2>Grammar reuse</H2> +<P> +The alert reader has certainly noticed an analogy between <CODE>abstract</CODE> +and <CODE>concrete</CODE>, on the one hand, and <CODE>interface</CODE> and <CODE>instance</CODE>, +on the other. Why are these two pairs of module types kept separate +at all? There is, in fact, a very close correspondence between +judgements in the two kinds of modules: +</P> +<PRE> + cat C <---> oper C : Type + + fun f : A <---> oper f : A + + lincat C = T <---> oper C : Type = T + + lin f = t <---> oper f : A = t +</PRE> +<P> +But there are also some differences: +</P> +<UL> +<LI><CODE>abstract</CODE> and <CODE>concrete</CODE> modules define <B>top-level grammars</B>, i.e. + grammars that can be used for parsing and linearization; this is because +<LI>the types and terms in <CODE>concrete</CODE> modules are restricted to a subset + of those available in <CODE>interface</CODE>, <CODE>instance</CODE>, and <CODE>resource</CODE> +<LI><CODE>param</CODE> judgements have no counterparts in top-level grammars +</UL> + +<P> +The term that can be used for interfaces, instances, and resources is +<B>resource-level grammars</B>. +From these explanations and the above translations it follows that top-level +grammars are, in a sense, a special case of resource-level grammars. +</P> +<P> +Thus, indeed, abstract syntax modules can be used like interfaces, and concrete syntaxes +as their instances. The use of top-level grammars as resources +is called <B>grammar reuse</B>. Whether a library module is a top-level or a +resource-level module is mostly invisible to application programmers +(see the Summary <a href="#seclock">here</a> +for an exception to this). The GF resource grammar +library itself is in fact built in two layers: +</P> +<UL> +<LI>the <B>ground resource</B>: a set of top-level grammars for syntactic structures +<LI>the <B>surface resource</B>: a resource-level grammar with overloaded operations + defined in terms of the ground resource +</UL> + +<P> +Both the ground +resource and the surface resource can be used by application programmers, +but it is the surface resource that we use in this book. Because of overloading, +it has much fewer function names and also flatter trees. For instance, the clause +<center> +<I>these very warm pizzas are Italian</I> +</center> +which in the surface resource can be built as +</P> +<PRE> + mkCl + (mkNP these_QuantPl + (mkCN (mkAP very_AdA (mkAP warm_A)) (mkCN pizza_CN))) + (mkAP italian_AP) +</PRE> +<P> +has in the ground resource the much more complex tree +</P> +<PRE> + PredVP + (DetCN (DetPl (PlQuant this_Quant) NoNum NoOrd) + (AdjCN (AdAP very_AdA (PositA warm_A)) (UseN pizza_N))) + (UseComp (CompAP (PositA italian_A))) +</PRE> +<P> +The main advantage of using the ground resource is that the trees can then be found +by using the parser, as shown in the next section. Otherwise, the overloaded surface +resource constants are much easier to use. +</P> +<P> +Needless to say, once a library has been defined in some way, it is easy to +build layers of <B>derived libraries</B> on top of it, by using grammar reuse +and, in the case of multilingual libraries, functors. This is indeed how +the surface resource has been implemented: as a functored parametrized on +the abstract syntax of the ground resource. +</P> +<A NAME="toc92"></A> +<H2>Browsing the resource with GF commands</H2> +<P> +<a name="secbrowsing"></a> +</P> +<P> +In addition to reading the +<A HREF="../../lib/resource-1.0/synopsis.html">resource synopsis</A>, you +can find resource function combinations by using the parser. This +is so because the resource library is in the end implemented as +a top-level <CODE>abstract-concrete</CODE> grammar, on which parsing +and linearization work. +</P> +<P> +Unfortunately, currently (GF 2.8) +only English and the Scandinavian languages can be +parsed within acceptable computer resource limits when the full +resource is used. +</P> +<P> +To look for a syntax tree in the overload API by parsing, do like this: +</P> +<PRE> + % gf -path=alltenses:prelude $GF_LIB_PATH/alltenses/OverLangEng.gfc + + > p -cat=S -overload "this grammar is too big" + mkS (mkCl (mkNP this_QuantSg grammar_N) (mkAP too_AdA big_A)) +</PRE> +<P> +The <CODE>-overload</CODE> option given to the parser is a directive to find the +shallowest overloaded term that matches the parse tree. +</P> +<P> +To view linearizations in all languages by parsing from English: +</P> +<PRE> + % gf $GF_LIB_PATH/alltenses/langs.gfcm + + > p -cat=S -lang=LangEng "this grammar is too big" | tb + UseCl TPres ASimul PPos (PredVP (DetCN (DetSg (SgQuant this_Quant) + NoOrd) (UseN grammar_N)) (UseComp (CompAP (AdAP too_AdA (PositA big_A))))) + Den här grammatiken är för stor + Esta gramática es demasiado grande + (Cyrillic: eta grammatika govorit des'at' jazykov) + Denne grammatikken er for stor + Questa grammatica è troppo grande + Diese Grammatik ist zu groß + Cette grammaire est trop grande + Tämä kielioppi on liian suuri + This grammar is too big + Denne grammatik er for stor +</PRE> +<P> +This method shows the unambiguous ground resource functions and not +the overloaded ones. It uses a precompiled grammar package of the GFCM or GFCC +format; see <a href="#chapeight">the eighth chapter</a> for more information on this. +</P> +<P> +Unfortunately, the Russian grammar uses at the moment a different +character encoding than the rest and is therefore not displayed correctly +in a terminal window. However, the GF syntax editor does display all +examples correctly --- again, using the ground resource: +</P> +<PRE> + % gfeditor $GF_LIB_PATH/alltenses/langs.gfcm +</PRE> +<P> +When you have constructed the tree, you will see the following screen: +</P> +<P> +<center> +</P> +<P> + <IMG ALIGN="right" SRC="10lang-small.png" BORDER="0" ALT=""> +</P> +<P> +</center> +</P> +<P> +<B>Exercise</B>. Find the resource grammar translations for the following +English phrases (parse in the category <CODE>Phr</CODE>). You can first try to +build the terms manually. +</P> +<P> +<I>every man loves a woman</I> +</P> +<P> +<I>this grammar speaks more than ten languages</I> +</P> +<P> +<I>which languages aren't in the grammar</I> +</P> +<P> +<I>which languages did you want to speak</I> +</P> +<A NAME="toc93"></A> +<H2>An extended Foods grammar</H2> +<P> +<a name="secextended"></a> +</P> +<P> +Now that we know how to find information in the resource grammar, +we can easily extend the <CODE>Foods</CODE> fragment considerably. We shall enable +the following new expressions: +</P> +<UL> +<LI>questions: <I>Is this pizza Italian?</I> <I>Which pizza do you want to eat?</I> +<LI>imperatives: <I>Eat that pizza please!</I> +<LI>denials: <I>These pizzas are not Italian.</I> +<LI>verbs: <I>eat</I>, <I>pay</I> +<LI>guests, in addition to food items: <I>I, you, this lady</I> +</UL> + +<A NAME="toc94"></A> +<H3>Abstract syntax</H3> +<P> +Since we don't want to change the already existing <CODE>Foods</CODE> module, +we build an extension of it, <CODE>ExtFoods</CODE>: +</P> +<PRE> + abstract ExtFoods = Foods ** { + + flags startcat=Move ; + + cat + Move ; -- dialogue move: declarative, question, or imperative + Verb ; -- transitive verb + Guest ; -- guest in restaurant + GuestKind ; -- type of guest + + fun + MAssert : Phrase -> Move ; -- This pizza is warm. + MDeny : Phrase -> Move ; -- This pizza isn't warm. + MAsk : Phrase -> Move ; -- Is this pizza warm? + + PVerb : Guest -> Verb -> Item -> Phrase ; -- we eat this pizza + PVerbWant : Guest -> Verb -> Item -> Phrase ; -- we want to eat this pizza + + WhichVerb : + Kind -> Guest -> Verb -> Move ; -- Which pizza do you eat? + WhichVerbWant : + Kind -> Guest -> Verb -> Move ; -- Which pizza do you want to eat? + WhichIs : Kind -> Quality -> Move ; -- Which wine is Italian? + + Do : Verb -> Item -> Move ; -- Pay this wine! + DoPlease : Verb -> Item -> Move ; -- Pay this wine please! + + I, You, We : Guest ; + + GThis, GThat, GThese, GThose : GuestKind -> Guest ; + + Eat, Drink, Pay : Verb ; + + Lady, Gentleman : GuestKind ; + } +</PRE> +<P> +The concrete syntax is implemented by a functor that extends the +already defined functor <CODE>FoodsI</CODE>. +</P> +<PRE> + incomplete concrete ExtFoodsI of ExtFoods = + FoodsI ** open Syntax, LexFoods in { + + flags lexer=text ; unlexer=text ; +</PRE> +<P> +The flags set up a lexer and unlexer that can deal with sentence-initial +capital letters and proper spacing with punctuation (see <a href="#seclexing">here</a> +for more information on lexers and unlexers). +</P> +<A NAME="toc95"></A> +<H3>Linearization types</H3> +<P> +If we look at the resource documentation, we find several categories +that are above the clause level and can thus host different kinds +of dialogue moves: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>Category</TH> +<TH>Explanation</TH> +<TH COLSPAN="2">Example</TH> +</TR> +<TR> +<TD><CODE>Text</CODE></TD> +<TD>text consisting of phrases</TD> +<TD><I>He is here. Why?</I></TD> +</TR> +<TR> +<TD><CODE>Phr</CODE></TD> +<TD>phrase in a text</TD> +<TD><I>but be quiet please</I></TD> +</TR> +<TR> +<TD><CODE>Utt</CODE></TD> +<TD>sentence, question, word...</TD> +<TD><I>be quiet</I></TD> +</TR> +<TR> +<TD><CODE>S</CODE></TD> +<TD>declarative sentence</TD> +<TD><I>she lived here</I></TD> +</TR> +<TR> +<TD><CODE>QS</CODE></TD> +<TD>question</TD> +<TD><I>where did she live</I></TD> +</TR> +<TR> +<TD><CODE>Imp</CODE></TD> +<TD>imperative</TD> +<TD><I>look at this</I></TD> +</TR> +<TR> +<TD><CODE>QCl</CODE></TD> +<TD>question clause, with all tenses</TD> +<TD><I>why does she walk</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +We also find that only the category <CODE>Text</CODE> contains punctuation marks. +So we choose this as the linearization type of <CODE>Move</CODE>. The other types +are quite obvious. +</P> +<PRE> + lincat + Move = Text ; + Verb = V2 ; + Guest = NP ; + GuestKind = CN ; +</PRE> +<P> +The category <CODE>V2</CODE> of <B>two-place verbs</B> includes both +<B>transitive verbs</B> that take <B>direct objects</B> (e.g. <I>we watch him</I>) +and verbs that take other kinds of <B>complements</B>, often with +prepositions (<I>we look at him</I>). In a multilingual grammar, it is +not guaranteed that transitive verbs are transitive in all languages, +so the more general notion of two-place verb is more appropriate. +</P> +<A NAME="toc96"></A> +<H3>Linearization rules</H3> +<P> +Now we need to find constructors that combine the new categories in +appropriate ways. To form a text from a clause, we first make it into +a sentence with <CODE>mkS</CODE>, and then apply <CODE>mkText</CODE>: +</P> +<PRE> + lin MAssert p = mkText (mkS p) ; +</PRE> +<P> +The function <CODE>mkS</CODE> has in the resource synopsis been given the type +</P> +<PRE> + mkS : (Tense) -> (Ant) -> (Pol) -> Cl -> S +</PRE> +<P> +Parentheses around type names do not make any difference for the GF compiler, +but in the synopsis notation they indicate <B>optionality</B>: any of the +optional arguments can be omitted, and there is an instance of <CODE>mkS</CODE> +available. For each optional type, it uses the <B>default value</B> for that +type, which for the <B>polarity</B> <CODE>Pol</CODE> is positive i.e. unnegated. +To build a negative sentence, we use an explicit polarity constructor: +</P> +<PRE> + MDeny p = mkText (mkS negativePol p) ; +</PRE> +<P> +Of course, we could have used <CODE>positivePol</CODE> in the first rule, instead of +relying on the default. (The types <CODE>Tense</CODE> and <CODE>Ant</CODE> will be explained +<a href="#sectense">here</a>.) +</P> +<P> +Phrases can be made into <B>question sentences</B>, which in turn can be +made into texts in a similar way as sentences; the default +punctuation mark is not the full stop but the question mark. +</P> +<PRE> + MAsk p = mkText (mkQS p) ; +</PRE> +<P> +There is an <CODE>mkCl</CODE> instance that directly builds a clause from a noun phrase, +a two-place verb, and another noun phrase. +</P> +<PRE> + PVerb = mkCl ; +</PRE> +<P> +The auxiliary verb <I>want</I> requires a <B>verb phrase</B> (<CODE>VP</CODE>) as its complement. It +can be built from a two-place verb and its noun phrase complement. +</P> +<PRE> + PVerbWant guest verb item = mkCl guest want_VV (mkVP verb item) ; +</PRE> +<P> +The <B>interrogative determiner</B> (<CODE>IDet</CODE>) <I>which</I> can be combined with +a common noun to form an <B>interrogative phrase</B> (<CODE>IP</CODE>). This <CODE>IP</CODE> can then +be used as a subject in a <B>question clause</B> (<CODE>QCl</CODE>), which in turn is +made into a <CODE>QS</CODE> and finally to a <CODE>Text</CODE>. +</P> +<PRE> + WhichIs kind quality = + mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) (mkVP quality))) ; +</PRE> +<P> +When interrogative phrases are used as <I>objects</I>, the resource library +uses a category named <CODE>Slash</CODE> of +objectless sentences. The name cames from the <B>slash categories</B> of the +GPSG grammar formalism +(Gazdar & al. 1985). Slashes can be formed from subjects and two-place verbs, +also with an intervening auxiliary verb. +</P> +<PRE> + WhichVerb kind guest verb = + mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) + (mkSlash guest verb))) ; + WhichVerbWant kind guest verb = + mkText (mkQS (mkQCl (mkIP whichSg_IDet kind) + (mkSlash guest want_VV verb))) ; +</PRE> +<P> +Finally, we form the <B>imperative</B> (<CODE>Imp</CODE>) of a transitive verb +and its object. We make it into a <B>polite</B> form utterance, and finally +into a <CODE>Text</CODE> with an exclamation mark. +</P> +<PRE> + Do verb item = + mkText + (mkPhr (mkUtt politeImpForm (mkImp verb item))) exclMarkPunct ; + DoPlease verb item = + mkText + (mkPhr (mkUtt politeImpForm (mkImp verb item)) please_Voc) + exclMarkPunct ; +</PRE> +<P> +The rest of the concrete syntax is straightforward use of structural words, +</P> +<PRE> + I = mkNP i_Pron ; + You = mkNP youPol_Pron ; + We = mkNP we_Pron ; + GThis = mkNP this_QuantSg ; + GThat = mkNP that_QuantSg ; + GThese = mkNP these_QuantPl ; + GThose = mkNP those_QuantPl ; +</PRE> +<P> +and of the food lexicon, +</P> +<PRE> + Eat = eat_V2 ; + Drink = drink_V2 ; + Pay = pay_V2 ; + Lady = lady_N ; + Gentleman = gentleman_N ; + } +</PRE> +<P> +Notice that we have no reason to build an extension of <CODE>LexFoods</CODE>, but we just +add words to the old one. Since <CODE>LexFoods</CODE> instances are resource modules, +the superfluous definitions that they contain have no effect on the +modules that just <CODE>open</CODE> them, and thus the smaller <CODE>Foods</CODE> grammars +don't suffer from the additions we make. +</P> +<P> +<B>Exercise</B>. Port the <CODE>ExtFoods</CODE> grammars to some new languages, building +on the <CODE>Foods</CODE> implementations from previous sections, and using the functor +defined in this section. +</P> +<A NAME="toc97"></A> +<H2>Tenses</H2> +<P> +<a name="sectense"></a> +</P> +<P> +When compiling the <CODE>ExtFoods</CODE> grammars, we have used the path +</P> +<PRE> + --# -path=.:../foods:present:prelude +</PRE> +<P> +where the library subdirectory <CODE>present</CODE> refers to a restricted version +of the resource that covers only the present tense of verbs and sentences. +Having this version available is motivatad by efficiency reasons: tenses +produce in many languages a manifold of forms and combinations, which +multiply the size of the grammar; at the same time, many applications, +both technical ones and spoken dialogues, only need the present tense. +</P> +<P> +But it is easy change the grammars so that they admit of the full set +of tenses. It is enough to change the path to +</P> +<PRE> + --# -path=.:../foods:alltenses:prelude +</PRE> +<P> +and recompile the grammars from source (flag <CODE>-src</CODE>); the libraries are +not recompiled, because their sources cannot be found on the path list. +Then it is possible to see all the tenses of +phrases, by using the <CODE>-all</CODE> flag in linearization: +</P> +<PRE> + > gr -cat=Phrase | l -all + This wine is delicious + Is this wine delicious + This wine isn't delicious + Isn't this wine delicious + This wine is not delicious + Is this wine not delicious + This wine has been delicious + Has this wine been delicious + This wine hasn't been delicious + Hasn't this wine been delicious + This wine has not been delicious + Has this wine not been delicious + This wine was delicious + Was this wine delicious + This wine wasn't delicious + Wasn't this wine delicious + This wine was not delicious + Was this wine not delicious + This wine had been delicious + Had this wine been delicious + This wine hadn't been delicious + Hadn't this wine been delicious + This wine had not been delicious + Had this wine not been delicious + This wine will be delicious + Will this wine be delicious + This wine won't be delicious + Won't this wine be delicious + This wine will not be delicious + Will this wine not be delicious + This wine will have been delicious + Will this wine have been delicious + This wine won't have been delicious + Won't this wine have been delicious + This wine will not have been delicious + Will this wine not have been delicious + This wine would be delicious + Would this wine be delicious + This wine wouldn't be delicious + Wouldn't this wine be delicious + This wine would not be delicious + Would this wine not be delicious + This wine would have been delicious + Would this wine have been delicious + This wine wouldn't have been delicious + Wouldn't this wine have been delicious + This wine would not have been delicious + Would this wine not have been delicious +</PRE> +<P> +In addition to tenses, the linearization writes all parametric +variations --- polarity and word order (direct vs. inverted) --- as +well as the variation between contracted and full negation words. +Of course, the list is even longer in languages that have more +tenses and moods, e.g. the Romance languages. +</P> +<P> +In the <CODE>ExtFoods</CODE> grammar, tenses never find their way to the +top level of <CODE>Move</CODE>s. Therefore it is useless to carry around +the clause and verb tenses given in the <CODE>alltenses</CODE> set of libraries. +But with the library, it is easy to add tenses to <CODE>Move</CODE>s. For +instance, one can add the rules +</P> +<PRE> + fun MAssertFut : Phrase -> Move ; -- I will pay this wine + fun MAssertPastPerf : Phrase -> Move ; -- I had paid that wine + lin MAssertFut p = mkText (mkS futureTense p) ; + lin MAssertPastPerf p = mkText (mkS pastTense anteriorAnt p) ; +</PRE> +<P> +Comparison with <CODE>MAssert</CODE> above shows that the absence of the tense +and anteriority features defaults to present simultaneous tenses. +</P> +<P> +<B>Exercise</B>. Measure the size of the context-free grammar corresponding to +some concrete syntax of <CODE>ExtFoods</CODE> with all tenses. +You can do this by printing the grammar in the context-free format +(<CODE>print_grammar -printer=cfg</CODE>) and counting the lines. +</P> +<A NAME="toc98"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc99"></A> +<H3>Interfaces and instances</H3> +<P> +An <B>interface module</B> (<CODE>interface</CODE> <I>I</I>) is like a <CODE>resource</CODE> module, +the difference being that it does not need to give definitions in +its <CODE>oper</CODE> and <CODE>param</CODE> judgements. Definitions are, however, +allowed, and they may use constants that appear undefined in the +module. For example, here is an interface for predication, which +is parametrized on NP case and agreement features, and on the constituent +order: +</P> +<PRE> + interface Predication = { + param + Case ; + Agreement ; + oper + subject : Case ; + object : Case ; + order : (verb,subj,obj : String) -> String ; + + NP : Type = {s : Case => Str ; a : Agreement} ; + TV : Type = {s : Agreement => Str} ; + + sentence : TV -> NP -> NP -> {s : Str} = \verb,subj,obj -> { + s = order (verb ! subj.a) (subj ! subject) (obj ! object) ; + } +</PRE> +<P> +An <B>instance module</B> (<CODE>instance</CODE> <I>J</I> <CODE>of</CODE> <I>I</I>) is also like a +<CODE>resource</CODE>, but it is compiled in union with the interface that it +is an instance <CODE>of</CODE>. This means that the definitions given in the +instance are type-checked with respect to the types given in the +interface. Moreover, overwriting types or definitions given in the interface +is not allowed. But it is legal for an instance to contain definitions +not included in the corresponding interface. Here is an instance of +<CODE>Predication</CODE>, suitable for languages like English. +</P> +<PRE> + instance PredicationSimpleSVO of Predication = { + param + Case = Nom | Acc | Gen ; + Agreement = Agr Number Person ; + + -- two new types + Number = Sg | Pl ; + Person = P1 | P2 | P3 ; + + oper + subject = Nom ; + object = Acc ; + order = \verb,subj,obj -> subj ++ verb ++ obj ; + + -- the rest of the definitions don't need repetition + } +</PRE> +<P></P> +<A NAME="toc100"></A> +<H3>Grammar reuse</H3> +<P> +<a name="seclock"></a> +</P> +<P> +Abstract syntax modules can be used like interfaces, and concrete syntaxes +as their instances. The following translations then take place: +</P> +<PRE> + cat C ---> oper C : Type + + fun f : A ---> oper f : A* + + lincat C = T ---> oper C : Type = T' + + lin f = t ---> oper f : A* = t' +</PRE> +<P> +This translation is called <B>grammar reuse</B>. It uses a homomorphism +from abstract types and terms to the concrete types and terms. For the +sake of more type safety, the types are not exactly the same. Currently +(GF 2.8), the type <I>T'</I> formed from the linearization type <I>T</I> of +a category <I>C</I> is <I>T</I> extended with a dummy <B>lock field</B>. Thus +</P> +<PRE> + lincat C = T ---> oper C = T ** {lock_C : {}} +</PRE> +<P> +and the linearization terms are lifted correspondingly. The user of +a GF library should never see any lock fields; when they appear in +the compiler's warnings, they indicate that some library category is +constructed improperly by a user program. +</P> +<A NAME="toc101"></A> +<H3>Functors</H3> +<P> +A <B>parametrized module</B>, aka. an <B>incomplete module</B>, or a +<B>functor</B>, is any module that <CODE>open</CODE>s an <CODE>interface</CODE> (or +an <CODE>abstract</CODE>). Several interfaces may be opened by one +functor. The module header must be prefixed by the word <CODE>incomplete</CODE>. +Here is a typical example, using the resource <CODE>Syntax</CODE> and +a domain specific lexicon: +</P> +<PRE> + incomplete concrete DomainI of Domain = open Syntax, Lex in {...} ; +</PRE> +<P> +A <B>functor instantiation</B> is a module that inherits a functor and +provides an instance to each of its open interfaces. Here is an example: +</P> +<PRE> + concrete DomainSwe of Domain = DomainI with + (Syntax = SyntaxSwe), + (Lex = LexSwe) ; +</PRE> +<P></P> +<A NAME="toc102"></A> +<H3>Restricted inheritance</H3> +<P> +A module of any type can make <B>restricted inheritance</B>, which is +either exclusion or inclusion: +</P> +<PRE> + module M = A[f,g], B-[k] ** ... +</PRE> +<P> +A concrete syntax given to an abstract syntax that uses restricted inheritance +must make the corresponding restrictions. In addition, the concrete syntax can +make its own restrictions in order to redefine inherited linearization types and +rules. +</P> +<P> +Overriding old definitions without explicit restrictions is not allowed. +</P> +<A NAME="toc103"></A> +<H1>Refining semantics in abstract syntax</H1> +<P> +<a name="chapsix"></a> +</P> +<P> +While the concrete syntax constructs of GF have been already +covered, there is much more that can be done in the abstract +syntax. The techniques of <B>dependent types</B> and +<B>higher order abstract syntax</B> are introduced in this chapter, +which thereby concludes the presentation of the GF language. +</P> +<P> +Many of the examples in this chapter are somewhat less close to +applications than the ones shown before. Moreover, the tools for +embedded grammars in <a href="#chapeight">the eighth chapter</a> do not yet fully support dependent +types and higher order abstract syntax. +</P> +<A NAME="toc104"></A> +<H2>GF as a logical framework</H2> +<P> +In this chapter, we will show how +to encode advanced semantic concepts in an abstract syntax. +We use concepts inherited from <B>type theory</B>. Type theory +is the basis of many systems known as <B>logical frameworks</B>, which are +used for representing mathematical theorems and their proofs on a computer. +In fact, GF has a logical framework as its proper part: +this part is the abstract syntax. +</P> +<P> +In a logical framework, the formalization of a mathematical theory +is a set of type and function declarations. The following is an example +of such a theory, represented as an <CODE>abstract</CODE> module in GF. +</P> +<PRE> + abstract Arithm = { + cat + Prop ; -- proposition + Nat ; -- natural number + fun + Zero : Nat ; -- 0 + Succ : Nat -> Nat ; -- the successor of x + Even : Nat -> Prop ; -- x is even + And : Prop -> Prop -> Prop ; -- A and B + } +</PRE> +<P> +This example does not show any new type-theoretical constructs yet, but +it could nevertheless be used as a part of a proof system for arithmetic. +</P> +<P> +<B>Exercise</B>. Give a concrete syntax of <CODE>Arithm</CODE>, preferably +by using the resource library. +</P> +<A NAME="toc105"></A> +<H2>Dependent types</H2> +<P> +<a name="secsmarthouse"></a> +</P> +<P> +<B>Dependent types</B> are a characteristic feature of GF, +inherited from the <B>constructive type theory</B> of Martin-Löf and +distinguishing GF from most other grammar formalisms and +functional programming languages. +</P> +<P> +Dependent types can be used for stating stronger +<B>conditions of well-formedness</B> than ordinary types. +A simple example is a "smart house" system, which +defines voice commands for household appliances. This example +is borrowed from the +Regulus Book +(Rayner & al. 2006). +</P> +<P> +One who enters a smart house can use a spoken <CODE>Command</CODE> to dim lights, switch +on the fan, etc. For <CODE>Device</CODE>s of each <CODE>Kind</CODE>, there is a set of +<CODE>Action</CODE>s that can be performed on them; thus one can dim the lights but + not the fan, for example. These dependencies can be expressed +by making the type <CODE>Action</CODE> dependent on <CODE>Kind</CODE>. We express these +dependencies in <CODE>cat</CODE> declarations by attaching argument types to +categories: +</P> +<PRE> + cat + Command ; + Kind ; + Device Kind ; -- argument type Kind + Action Kind ; +</PRE> +<P> +The crucial use of the dependencies is made in the rule for forming commands: +</P> +<PRE> + fun CAction : (k : Kind) -> Action k -> Device k -> Command ; +</PRE> +<P> +In other words: an action and a device can be combined into a command only +if they are of the same <CODE>Kind</CODE> <CODE>k</CODE>. If we have the functions +</P> +<PRE> + DKindOne : (k : Kind) -> Device k ; -- the light + + light, fan : Kind ; + dim : Action light ; +</PRE> +<P> +we can form the syntax tree +</P> +<PRE> + CAction light dim (DKindOne light) +</PRE> +<P> +but we cannot form the trees +</P> +<PRE> + CAction light dim (DKindOne fan) + CAction fan dim (DKindOne light) + CAction fan dim (DKindOne fan) +</PRE> +<P> +Linearization rules are written as usual: the concrete syntax does not +know if a category is a dependent type. In English, one could write as follows: +</P> +<PRE> + lincat Action = {s : Str} ; + lin CAction _ act dev = {s = act.s ++ dev.s} ; +</PRE> +<P> +Notice that the argument for <CODE>Kind</CODE> does not appear in the linearization; +therefore it is good practice to make this clear by +using a wild card for it, rather than a real +variable. +As we will show, +the type checker can reconstruct the kind from the <CODE>dev</CODE> argument. +</P> +<P> +Parsing with dependent types is performed in two phases: +</P> +<OL> +<LI>context-free parsing +<LI>filtering through type checker +</OL> + +<P> +If you just parse in the usual way, you don't enter the second phase, and +the <CODE>kind</CODE> argument is not found: +</P> +<PRE> + > parse "dim the light" + CAction ? dim (DKindOne light) +</PRE> +<P> +Moreover, type-incorrect commands are not rejected: +</P> +<PRE> + > parse "dim the fan" + CAction ? dim (DKindOne fan) +</PRE> +<P> +The question mark <CODE>?</CODE> is a <B>metavariable</B>, and is returned by the parser +for any subtree that is suppressed by a linearization rule. +These are exactly the same kind of metavariables as were used <a href="#secediting">here</a> +to mark incomplete parts of trees in the syntax editor. +</P> +<P> +To get rid of metavariables, we must feed the parse result into the +second phase of <B>solving</B> them. The <CODE>solve</CODE> process uses the dependent +type checker to restore the values of the metavariables. It is invoked by +the command <CODE>put_tree = pt</CODE> with the flag <CODE>-transform=solve</CODE>: +</P> +<PRE> + > parse "dim the light" | put_tree -transform=solve + CAction light dim (DKindOne light) +</PRE> +<P> +The <CODE>solve</CODE> process may fail, in which case no tree is returned: +</P> +<PRE> + > parse "dim the fan" | put_tree -transform=solve + no tree found +</PRE> +<P></P> +<P> +<B>Exercise</B>. Write an abstract syntax module with above contents +and an appropriate English concrete syntax. Try to parse the commands +<I>dim the light</I> and <I>dim the fan</I>, with and without <CODE>solve</CODE> filtering. +</P> +<P> +<B>Exercise</B>. Perform random and exhaustive generation, with and without +<CODE>solve</CODE> filtering. +</P> +<P> +<B>Exercise</B>. Add some device kinds and actions to the grammar. +</P> +<A NAME="toc106"></A> +<H2>Polymorphism</H2> +<P> +<a name="secpolymorphic"></a> +</P> +<P> +Sometimes an action can be performed on all kinds of devices. It would be +possible to introduce separate <CODE>fun</CODE> constants for each kind-action pair, +but this would be tedious. Instead, one can use <B>polymorphic</B> actions, +i.e. actions that take a <CODE>Kind</CODE> as an argument and produce an <CODE>Action</CODE> +for that <CODE>Kind</CODE>: +</P> +<PRE> + fun switchOn, switchOff : (k : Kind) -> Action k ; +</PRE> +<P> +Functions that are not polymorphic are <B>monomorphic</B>. However, the +dichotomy into monomorphism and full polymorphism is not always sufficient +for good semantic modelling: very typically, some actions are defined +for a proper subset of devices, but not just one. For instance, both doors and +windows can be opened, whereas lights cannot. +We will return to this problem by introducing the +concept of <B>restricted polymorphism</B> later, +after a section on proof objects. +</P> +<P> +<B>Exercise</B>. The grammar <CODE>ExtFoods</CODE> <a href="#secextended">here</a> permits the +formation of phrases such as <I>we drink this fish</I> and <I>we eat this wine</I>. +A way to prevent them is to distinguish between eatable and drinkable food items. +Another, related problem is that there is some duplicated code +due to a category distinction between guests and food items, for instance, +two constructors for the determiner <I>this</I>. This problem can also +be solved by dependent types. Rewrite the abstract syntax in <CODE>Foods</CODE> and +<CODE>ExtFoods</CODE> by using such a type system, and also update the concrete syntaxes. +If you do this right, you only have to change the functor modules +<CODE>FoodsI</CODE> and <CODE>ExtFoodsI</CODE> in the concrete syntax. +</P> +<A NAME="toc107"></A> +<H3>Digression: dependent types in concrete syntax</H3> +<P> +The <B>functional fragment</B> of GF +terms and types comprises function types, applications, lambda +abstracts, constants, and variables. This fragment is the same in +abstract and concrete syntax. In particular, +dependent types are also available in concrete syntax. +We have not made use of them yet, +but we will now look at one example of how they +can be used. +</P> +<P> +Those readers who are familiar with functional programming languages +like ML and Haskell, may already have missed <B>polymorphic</B> +functions. For instance, Haskell programmers have access to +the functions +</P> +<PRE> + const :: a -> b -> a + const c _ = c + + flip :: (a -> b -> c) -> b -> a -> c + flip f y x = f x y +</PRE> +<P> +which can be used for any given types <CODE>a</CODE>,<CODE>b</CODE>, and <CODE>c</CODE>. +</P> +<P> +The GF counterpart of polymorphic functions are <B>monomorphic</B> +functions with explicit <B>type variables</B> --- a techniques that we already +used in abstract syntax for modelling actions that can be performed +on all kinds of devices. Thus the above definitions can be written +</P> +<PRE> + oper const :(a,b : Type) -> a -> b -> a = + \_,_,c,_ -> c ; + + oper flip : (a,b,c : Type) -> (a -> b ->c) -> b -> a -> c = + \_,_,_,f,x,y -> f y x ; +</PRE> +<P> +When the operations are used, the type checker requires +them to be equipped with all their arguments; this may be a nuisance +for a Haskell or ML programmer. They have not been used very much, +except in the <CODE>Coordination</CODE> module of the resource library. +</P> +<A NAME="toc108"></A> +<H2>Proof objects</H2> +<P> +Perhaps the most well-known idea in constructive type theory is +the <B>Curry-Howard isomorphism</B>, also known as the +<B>propositions as types principle</B>. Its earliest formulations +were attempts to give semantics to the logical systems of +propositional and predicate calculus. In this section, we will consider +a more elementary example, showing how the notion of proof is useful +outside mathematics, as well. +</P> +<P> +We use the already shown category of unary (also known as Peano-style) +natural numbers: +</P> +<PRE> + cat Nat ; + fun Zero : Nat ; + fun Succ : Nat -> Nat ; +</PRE> +<P> +The <B>successor function</B> <CODE>Succ</CODE> generates an infinite +sequence of natural numbers, beginning from <CODE>Zero</CODE>. +</P> +<P> +We then define what it means for a number <I>x</I> to be <I>less than</I> +a number <I>y</I>. Our definition is based on two axioms: +</P> +<UL> +<LI><CODE>Zero</CODE> is less than <CODE>Succ</CODE> <I>y</I> for any <I>y</I>. +<LI>If <I>x</I> is less than <I>y</I>, then <CODE>Succ</CODE> <I>x</I> is less than <CODE>Succ</CODE> <I>y</I>. +</UL> + +<P> +The most straightforward way of expressing these axioms in type theory +is with a dependent type <CODE>Less</CODE> <I>x y</I>, and two functions constructing +its objects: +</P> +<PRE> + cat Less Nat Nat ; + fun lessZ : (y : Nat) -> Less Zero (Succ y) ; + fun lessS : (x,y : Nat) -> Less x y -> Less (Succ x) (Succ y) ; +</PRE> +<P> +Objects formed by <CODE>lessZ</CODE> and <CODE>lessS</CODE> are +called <B>proof objects</B>: they establish the truth of certain +mathematical propositions. +For instance, the fact that 2 is less that +4 has the proof object +</P> +<PRE> + lessS (Succ Zero) (Succ (Succ (Succ Zero))) + (lessS Zero (Succ (Succ Zero)) (lessZ (Succ Zero))) +</PRE> +<P> +whose type is +</P> +<PRE> + Less (Succ (Succ Zero)) (Succ (Succ (Succ (Succ Zero)))) +</PRE> +<P> +which is the formalization of the proposition that 2 is less than 4. +</P> +<P> +GF grammars can be used to provide a <B>semantic control</B> of +well-formedness of expressions. We have already seen examples of this: +the grammar of well-formed actions on household devices. By introducing proof objects +we have now added an even more powerful technique of expressing semantic conditions. +</P> +<P> +A simple example of the use of proof objects is the definition of +well-formed <I>time spans</I>: a time span is expected to be from an earlier to +a later time: +</P> +<PRE> + from 3 to 8 +</PRE> +<P> +is thus well-formed, whereas +</P> +<PRE> + from 8 to 3 +</PRE> +<P> +is not. The following rules for spans impose this condition +by using the <CODE>Less</CODE> predicate: +</P> +<PRE> + cat Span ; + fun span : (m,n : Nat) -> Less m n -> Span ; +</PRE> +<P></P> +<P> +<B>Exercise</B>. Write an abstract and concrete syntax with the +concepts of this section, and experiment with it in GF. +</P> +<P> +<B>Exercise</B>. Define the notions of "even" and "odd" in terms +of proof objects. <B>Hint</B>. You need one function for proving +that 0 is even, and two other functions for propagating the +properties. +</P> +<A NAME="toc109"></A> +<H3>Proof-carrying documents</H3> +<P> +Another possible application of proof objects is <B>proof-carrying documents</B>: +to be semantically well-formed, the abstract syntax of a document must contain a proof +of some property, although the proof is not shown in the concrete document. +Think, for instance, of small documents describing flight connections: +</P> +<P> +<I>To fly from Gothenburg to Prague, first take LH3043 to Frankfurt, then OK0537 to Prague.</I> +</P> +<P> +The well-formedness of this text is partly expressible by dependent typing: +</P> +<PRE> + cat + City ; + Flight City City ; + fun + Gothenburg, Frankfurt, Prague : City ; + LH3043 : Flight Gothenburg Frankfurt ; + OK0537 : Flight Frankfurt Prague ; +</PRE> +<P> +This rules out texts saying <I>take OK0537 from Gothenburg to Prague</I>. +However, there is a +further condition saying that it must be possible to +change from LH3043 to OK0537 in Frankfurt. +This can be modelled as a proof object of a suitable type, +which is required by the constructor +that connects flights. +</P> +<PRE> + cat + IsPossible (x,y,z : City)(Flight x y)(Flight y z) ; + fun + Connect : (x,y,z : City) -> + (u : Flight x y) -> (v : Flight y z) -> + IsPossible x y z u v -> Flight x z ; +</PRE> +<P></P> +<A NAME="toc110"></A> +<H2>Restricted polymorphism</H2> +<P> +In the first version of the smart house grammar <CODE>Smart</CODE>, +all Actions were either of +</P> +<UL> +<LI><B>monomorphic</B>: defined for one Kind +<LI><B>polymorphic</B>: defined for all Kinds +</UL> + +<P> +To make this scale up for new Kinds, we can refine this to +<B>restricted polymorphism</B>: defined for Kinds of a certain <B>class</B> +</P> +<P> +The notion of class can be expressed in abstract syntax +by using the Curry-Howard isomorphism as follows: +</P> +<UL> +<LI>a class is a <B>predicate</B> of Kinds --- i.e. a type depending of Kinds +<LI>a Kind is in a class if there is a proof object of this type +</UL> + +<P> +Here is an example with switching and dimming. The classes are called +<CODE>switchable</CODE> and <CODE>dimmable</CODE>. +</P> +<PRE> + cat + Switchable Kind ; + Dimmable Kind ; + fun + switchable_light : Switchable light ; + switchable_fan : Switchable fan ; + dimmable_light : Dimmable light ; + + switchOn : (k : Kind) -> Switchable k -> Action k ; + dim : (k : Kind) -> Dimmable k -> Action k ; +</PRE> +<P> +One advantage of this formalization is that classes for new +actions can be added incrementally. +</P> +<P> +<B>Exercise</B>. Write a new version of the <CODE>Smart</CODE> grammar with +classes, and test it in GF. +</P> +<P> +<B>Exercise</B>. Add some actions, kinds, and classes to the grammar. +Try to port the grammar to a new language. You will probably find +out that restricted polymorphism works differently in different languages. +For instance, in Finnish not only doors but also TVs and radios +can be "opened", which means switching them on. +</P> +<A NAME="toc111"></A> +<H2>Variable bindings</H2> +<P> +<a name="secbinding"></a> +</P> +<P> +Mathematical notation and programming languages have +expressions that <B>bind</B> variables. For instance, +a universally quantifier proposition +</P> +<PRE> + (All x)B(x) +</PRE> +<P> +consists of the <B>binding</B> <CODE>(All x)</CODE> of the variable <CODE>x</CODE>, +and the <B>body</B> <CODE>B(x)</CODE>, where the variable <CODE>x</CODE> can have +<B>bound occurrences</B>. +</P> +<P> +Variable bindings appear in informal mathematical language as well, for +instance, +</P> +<PRE> + for all x, x is equal to x + + the function that for any numbers x and y returns the maximum of x+y + and x*y + + Let x be a natural number. Assume that x is even. Then x + 3 is odd. +</PRE> +<P> +In type theory, variable-binding expression forms can be formalized +as functions that take functions as arguments. The universal +quantifier is defined +</P> +<PRE> + fun All : (Ind -> Prop) -> Prop +</PRE> +<P> +where <CODE>Ind</CODE> is the type of individuals and <CODE>Prop</CODE>, +the type of propositions. If we have, for instance, the equality predicate +</P> +<PRE> + fun Eq : Ind -> Ind -> Prop +</PRE> +<P> +we may form the tree +</P> +<PRE> + All (\x -> Eq x x) +</PRE> +<P> +which corresponds to the ordinary notation +</P> +<PRE> + (All x)(x = x). +</PRE> +<P> +An abstract syntax where trees have functions as arguments, as in +the two examples above, has turned out to be precisely the right +thing for the semantics and computer implementation of +variable-binding expressions. The advantage lies in the fact that +only one variable-binding expression form is needed, the lambda abstract +<CODE>\x -> b</CODE>, and all other bindings can be reduced to it. +This makes it easier to implement mathematical theories and reason +about them, since variable binding is tricky to implement and +to reason about. The idea of using functions as arguments of +syntactic constructors is known as <B>higher-order abstract syntax</B>. +</P> +<P> +The question now arises: how to define linearization rules +for variable-binding expressions? +Let us first consider universal quantification, +</P> +<PRE> + fun All : (Ind -> Prop) -> Prop +</PRE> +<P> +In GF, we write +</P> +<PRE> + lin All B = {s = "(" ++ "All" ++ B.$0 ++ ")" ++ B.s} +</PRE> +<P> +to obtain the form shown above. +This linearization rule brings in a new GF concept --- the <CODE>$0</CODE> +field of <CODE>B</CODE> containing a bound variable symbol. +The general rule is that, if an argument type of a function is +itself a function type <CODE>A -> C</CODE>, the linearization type of +this argument is the linearization type of <CODE>C</CODE> +together with a new field <CODE>$0 : Str</CODE>. In the linearization rule +for <CODE>All</CODE>, the argument <CODE>B</CODE> thus has the linearization +type +</P> +<PRE> + {$0 : Str ; s : Str}, +</PRE> +<P> +since the linearization type of <CODE>Prop</CODE> is +</P> +<PRE> + {s : Str} +</PRE> +<P> +In other words, the linearization of a function +consists of a linearization of the body together with a +field for a linearization of the bound variable. +Those familiar with type theory or lambda calculus +should notice that GF requires trees to be in +<B>eta-expanded</B> form in order for this to make sense: +for any function of type +</P> +<PRE> + A -> B +</PRE> +<P> +an eta-expanded syntax tree has the form +</P> +<PRE> + \x -> b +</PRE> +<P> +where <CODE>b : B</CODE> under the assumption <CODE>x : A</CODE>. +It is in this form that an expression can be analysed +as having a bound variable and a body, which can be put into +a linearization record. +</P> +<P> +Given the linearization rule +</P> +<PRE> + lin Eq a b = {s = "(" ++ a.s ++ "=" ++ b.s ++ ")"} +</PRE> +<P> +the linearization of +</P> +<PRE> + \x -> Eq x x +</PRE> +<P> +is the record +</P> +<PRE> + {$0 = "x", s = ["( x = x )"]} +</PRE> +<P> +Thus we can compute the linearization of the formula, +</P> +<PRE> + All (\x -> Eq x x) --> {s = "[( All x ) ( x = x )]"}. +</PRE> +<P> +But how did we get the linearization of the variable <CODE>x</CODE> +into the string <CODE>"x"</CODE>? GF grammars have no rules for +this: it is just hard-wired in GF that variable symbols are +linearized into the same strings that represent them in +the print-out of the abstract syntax. +</P> +<P> +To be able to <I>parse</I> variable symbols, however, GF needs to know what +to look for (instead of e.g. trying to parse <I>any</I> +string as a variable). What strings are parsed as variable symbols +is defined in the lexical analysis part of GF parsing +</P> +<PRE> + > p -cat=Prop -lexer=codevars "(All x)(x = x)" + All (\x -> Eq x x) +</PRE> +<P> +(see more details on lexers <a href="#seclexing">here</a>). If several variables are bound in the +same argument, the labels are <CODE>$0, $1, $2</CODE>, etc. +</P> +<P> +<B>Exercise</B>. Write an abstract syntax of the whole +<B>predicate calculus</B>, with the +<B>connectives</B> "and", "or", "implies", and "not", and the +<B>quantifiers</B> "exists" and "for all". Use higher-order functions +to guarantee that unbounded variables do not occur. +</P> +<P> +<B>Exercise</B>. Write a concrete syntax for your favourite +notation of predicate calculus. Use Latex as target language +if you want nice output. You can also try producing boolean +expressions of some programming language. Use as many parenthesis as you need to +guarantee non-ambiguity. +</P> +<A NAME="toc112"></A> +<H2>Semantic definitions</H2> +<P> +<a name="secdefdef"></a> +</P> +<P> +Just like any functional programming language, abstract syntax in +GF has declarations of functions, telling what the type of a function is. +But we have not yet shown how to <B>compute</B> +these functions: all we can do is provide them with arguments +and linearize the resulting terms. +Since our main interest is the well-formedness of expressions, +this has not yet bothered +us very much. As we will see, however, computation does play a role +even in the well-formedness of expressions when dependent types are +present. +</P> +<P> +GF has a form of judgement for <B>semantic definitions</B>, +marked by the key word <CODE>def</CODE>. At its simplest, it is just +the definition of one constant, e.g. +</P> +<PRE> + fun one : Nat ; + def one = Succ Zero ; +</PRE> +<P> +Notice a <CODE>def</CODE> definition can only be given to names declared by +<CODE>fun</CODE> judgements in the same module; it is not possible to define +an inherited name. +</P> +<P> +We can also define a function with arguments, +</P> +<PRE> + fun twice : Nat -> Nat ; + def twice x = plus x x ; +</PRE> +<P> +which is still a special case of the most general notion of +definition, that of a group of <B>pattern equations</B>: +</P> +<PRE> + fun plus : Nat -> Nat -> Nat ; + def + plus x Zero = x ; + plus x (Succ y) = Succ (Sum x y) ; +</PRE> +<P> +To compute a term is, as in functional programming languages, +simply to follow a chain of reductions until no definition +can be applied. For instance, we compute +</P> +<PRE> + Sum one one --> + Sum (Succ Zero) (Succ Zero) --> + Succ (sum (Succ Zero) Zero) --> + Succ (Succ Zero) +</PRE> +<P> +Computation in GF is performed with the <CODE>pt</CODE> command and the +<CODE>compute</CODE> transformation, e.g. +</P> +<PRE> + > p -tr "1 + 1" | pt -transform=compute -tr | l + sum one one + Succ (Succ Zero) + s(s(0)) +</PRE> +<P></P> +<P> +The <CODE>def</CODE> definitions of a grammar induce a notion of +<B>definitional equality</B> among trees: two trees are +definitionally equal if they compute into the same tree. +Thus, trivially, all trees in a chain of computation +(such as the one above) are definitionally equal to each other. +In general, there can be infinitely many definitionally equal trees. +</P> +<P> +An important property of definitional equality is that it is +<B>extensional</B>, i.e. has to do with the sameness of semantic value. +Linearization, on the other hand, is an <B>intensional</B> operation, +i.e. has to do with the sameness of expression. This means that +<CODE>def</CODE> definitions are <I>not</I> evaluated as linearization steps. +Intensionality is a crucial property of linearization, since we want +to use it for things like tracing a chain of evaluation. +For instance, each of the steps of the computation above +has a different linearization into standard arithmetic notation: +</P> +<PRE> + 1 + 1 + s(0) + s(0) + s(s(0) + 0) + s(s(0)) +</PRE> +<P> +In most programming languages, the operations that can be performed on +expressions are extensional, i.e. give equal values to equal arguments. +But GF has both extensional and intensional operations. +Type checking is extensional: +in the type theory with dependent types, types may depend on terms, +and types depending on definitionally equal terms are +equal types. For instance, +</P> +<PRE> + Less Zero one + Less Zero (Succ Zero)) +</PRE> +<P> +are equal types. Hence, any tree that type checks as a proof that +1 is odd also type checks as a proof that the successor of 0 is odd. +(Recall, in this connection, that the +arguments a category depends on never play any role +in the linearization of trees of that category, +nor in the definition of the linearization type.) +</P> +<P> +When pattern matching is performed with <CODE>def</CODE> equations, it is +crucial to distinguish between <B>constructors</B> and other functions +(cf. <a href="#secmatching">here</a> on pattern matching in concrete syntax). +GF has a judgement form <CODE>data</CODE> to tell that a category has +certain functions as constructors: +</P> +<PRE> + data Nat = Succ | Zero ; +</PRE> +<P> +Unlike in Haskell and ML, new constructors can be added to +a type with new <CODE>data</CODE> judgements. The type signatures of constructors +are given separately, in ordinary <CODE>fun</CODE> judgements. +One can also write directly +</P> +<PRE> + data Succ : Nat -> Nat ; +</PRE> +<P> +which is syntactic sugar for the pair of judgements +</P> +<PRE> + fun Succ : Nat -> Nat ; + data Nat = Succ ; +</PRE> +<P> +If we did not mark <CODE>Zero</CODE> as <CODE>data</CODE>, the definition +</P> +<PRE> + fun isZero : Nat -> Bool ; + def isZero Zero = True ; + def isZero _ = False ; +</PRE> +<P> +would return <CODE>True</CODE> for all arguments, because the pattern <CODE>Zero</CODE> +would be treated as a variable and it would hence match all values! +This is a common pitfall in GF. +</P> +<P> +<B>Exercise</B>. Implement an interpreter of a small functional programming +language with natural numbers, lists, pairs, lambdas, etc. Use higher-order +abstract syntax with semantic definitions. As onject language, use +your favourite programming language. +</P> +<A NAME="toc113"></A> +<H2>Summary of GF language features</H2> +<A NAME="toc114"></A> +<H3>Judgements</H3> +<P> +We have generalized the <CODE>cat</CODE> judgement form and introduced two new forms +for abstract syntax: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>form</TH> +<TH COLSPAN="2">reading</TH> +</TR> +<TR> +<TD><CODE>cat</CODE> <I>C</I> <I>G</I></TD> +<TD><I>C</I> is a category in context <I>G</I></TD> +</TR> +<TR> +<TD><CODE>def</CODE> <I>f</I> <I>P1</I> ... <I>Pn</I> <CODE>=</CODE> t</TD> +<TD>function <I>f</I> applied to <I>P1</I>...<I>Pn</I> has value <I>t</I></TD> +</TR> +<TR> +<TD><CODE>data</CODE> <I>C</I> <CODE>=</CODE> <I>C1</I> <CODE>|</CODE> ... <CODE>|</CODE> <I>Cn</I></TD> +<TD>category <I>C</I> has constructors <I>C1</I>...<I>Cn</I></TD> +</TR> +</TABLE> + +<P></P> +<P> +The <B>context</B> in the <CODE>cat</CODE> judgement has the form +</P> +<PRE> + (x1 : T1) ... (xn : Tn) +</PRE> +<P> +where the types <I>T1 ... Tn</I> may be increasingly dependent. To form a +type, <I>C</I> must be equipped with arguments of each type in the +context, satisfying the dependencies. As syntactic sugar, we have +</P> +<PRE> + T G === (x : T) G +</PRE> +<P> +if <I>x</I> does not occur in <I>G</I>. The linearization type definition of a +category does not mention the context. +</P> +<P> +In <CODE>def</CODE> judgements, the arguments <I>P1</I>...<I>Pn</I> can be constructor and +variable patterns as well as wild cards, and the binding and +evaluation rules are the same as <a href="#secmatching">here</a>. +</P> +<P> +A <CODE>data</CODE> judgement states that the names on the right-hand side are constructors +of the category on the left-hand side. The precise types of the constructors are +given in the <CODE>fun</CODE> judgements introducing them; the value type of a constructor +of <I>C</I> must be of the form <I>C a1 ... am</I>. As syntactic sugar, +</P> +<PRE> + data f : A1 ... An -> C a1 ... am === + fun f : A1 ... An -> C a1 ... am ; data C = f ; +</PRE> +<P></P> +<A NAME="toc115"></A> +<H3>Dependent function types</H3> +<P> +A <B>dependent function type</B> has the form +</P> +<PRE> + (x : A) -> B +</PRE> +<P> +where <I>B</I> depends on a variable <I>x</I> of type <I>A</I>. We have the +following syntactic sugar: +</P> +<PRE> + (x,y : A) -> B === (x : A) -> (y : A) -> B + + (_ : A) -> B === (x : A) -> B if B does not depend on x + + A -> B === (_ : A) -> B +</PRE> +<P> +A <CODE>fun</CODE> function in abstract syntax may have function types as +argument types. This is called <B>higher-order abstract syntax</B>. +The linearization of an argument +</P> +<PRE> + \z0, ..., zn -> b : (x0 : A1) -> ... -> (xn : An) -> B +</PRE> +<P> +if formed from the linearization of <I>b*</I> of <I>b</I> by adding +fields that hold the variable symbols: +</P> +<PRE> + b* ** {$0 = "z0" ; ... ; $n = "zn"} +</PRE> +<P> +If an argument function is itself a higher-order function, its +bound variables cannot be reached in linearization. Thus, in a sense, +the higher-order abstract syntax of GF is just <B>second-orde abstract syntax</B>. +</P> +<P> +A <B>syntax tree</B> is a well-typed term in <B>beta-eta normal form</B>, which +means that +</P> +<UL> +<LI>its type is a basic type, i.e. it is not a partial application; +<LI>its arguments are in eta normal form, i.e. either full applications or + lambda abstractions with bodies that are full applications; +<LI>it has no beta redexes, i.e. applications of abstractions. +</UL> + +<P> +Terms that are not in this form may occur as arguments of dependent types +and in <CODE>def</CODE> judgements, but they cannot be linearized. +</P> +<A NAME="toc116"></A> +<H1>Grammars of formal languages</H1> +<P> +<a name="chapseven"></a> +</P> +<P> +In this chapter, we will write a grammar for arithmetic expressions as known +from school mathematics and many programming languages. We will see how to +define precedences in GF, how to include built-in integers in grammars, and +how to deal with spaces between tokens in desired ways. As an alternative concrete +syntax, we will generate code for a JVM-like stack machine. We will conclude +by extending the language with variable declarations and assignments, which +are handled in a type-safe way by using higher-order abstract syntax. +</P> +<P> +To write grammars for formal languages is usually less challenging than for +natural languages. There are standard tools for this task, such as the YACC +family of parser generators. Using GF would be overkill for many projects, +and come with a penalty in efficiency. However, it is still worth while to +look at this task. A typical application of GF are natural-language interfaces +to formal systems: in such applications, the translation between natural and +formal language can be defined as a multilingual grammar. The use of higher-order +abstract syntax, together with dependent types, provides a way to define a +complete compiler in GF. +</P> +<A NAME="toc117"></A> +<H2>Arithmetic expressions</H2> +<A NAME="toc118"></A> +<H3>Abstract syntax</H3> +<P> +We want to write a grammar for what is usually called <B>expressions</B> +in programming languages. The expressions are built from integers by +the binary operations of addition, subtraction, multiplication, and +division. The abstract syntax is easy to write. We call it <CODE>Calculator</CODE>, +since it can be used as the basis of a calculator. +</P> +<PRE> + abstract Calculator = { + + cat Exp ; + + fun + EPlus, EMinus, ETimes, EDiv : Exp -> Exp -> Exp ; + EInt : Int -> Exp ; + } +</PRE> +<P> +Notice the use of the category <CODE>Int</CODE>. It is a built-in category of +integers. Its syntax trees are denoted by <B>integer literals</B>, which are +sequences of digits. For instance, +</P> +<PRE> + 5457455814608954681 : Int +</PRE> +<P> +These are the only objects of type <CODE>Int</CODE>: +grammars are not allowed to declare functions with <CODE>Int</CODE> as value type. +</P> +<A NAME="toc119"></A> +<H3>Concrete syntax: a simple approach</H3> +<P> +Arithmetic expressions should be unambiguous. If we write +</P> +<PRE> + 2 + 3 * 4 +</PRE> +<P> +it should be parsed as one, but not both, of +</P> +<PRE> + EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) + ETimes (EPlus (EInt 2) (EInt 3)) (EInt 4) +</PRE> +<P> +Under normal conventions, the former is chosen, because +multiplication has <B>higher precedence</B> than addition. +If we want to express the latter tree, we have to use +parentheses: +</P> +<PRE> + (2 + 3) * 4 +</PRE> +<P> +However, it is not completely trivial to decide when to use +parentheses and when not. We will therefore begin with a +concrete syntax that always uses parentheses around binary +operator applications. +</P> +<PRE> + concrete CalculatorP of Calculator = { + + lincat + Exp = SS ; + lin + EPlus = infix "+" ; + EMinus = infix "-" ; + ETimes = infix "*" ; + EDiv = infix "/" ; + EInt i = i ; + + oper + infix : Str -> SS -> SS -> SS = \f,x,y -> + ss ("(" ++ x.s ++ f ++ y.s ++ ")") ; + } +</PRE> +<P> +Now we will obtain +</P> +<PRE> + > linearize EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) + ( 2 + ( 3 * 4 ) ) +</PRE> +<P> +The first problem, even more urgent than superfluous parentheses, is +to get rid of superfluous spaces and to recognize integer literals +in the parser. +</P> +<A NAME="toc120"></A> +<H2>Lexing and unlexing</H2> +<P> +<a name="seclexing"></a> +</P> +<P> +The input of parsing in GF is not just a string, but a list of +<B>tokens</B>. By default, a list of tokens is obtained from a string +by analysing it into <B>words</B>, which means chunks separated by +spaces. Thus for instance +</P> +<PRE> + "(12 + (3 * 4))" +</PRE> +<P> +is split into the tokens +</P> +<PRE> + "(12", "+", "(3". "*". "4))" +</PRE> +<P> +The parser then tries to find each of these tokens among the terminals +of the grammar, i.e. among the strings that can appear in linearizations. +In our example, only the tokens <CODE>"+"</CODE> and <CODE>"*"</CODE> can be found, and +parsing therefore fails. +</P> +<P> +The proper way to split the above string into tokens would be +</P> +<PRE> + "(", "12", "+", "(", "3", "*", "4", ")", ")" +</PRE> +<P> +Moreover, the tokens <CODE>"12"</CODE>, <CODE>"3"</CODE>, and <CODE>"4"</CODE> should not be sought +among the terminals in the grammar, but treated as integer tokens, which +are defined outside the grammar. Since GF aims to be fully general, such +conventions are not built in: it must be possible for a grammar to have +tokens such as <CODE>"12"</CODE> and <CODE>"12)"</CODE>. Therefore, GF has a way to select +a <B>lexer</B>, a function that splits strings into tokens and classifies +them into terminals, literalts, etc. +</P> +<P> +A lexer can be given as a flag to the parsing command: +</P> +<PRE> + > parse -cat=Exp -lexer=codelit "(2 + (3 * 4))" + EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) +</PRE> +<P> +Since the lexer is usually a part of the language specification, it +makes sense to put it in the concrete syntax by using the judgement +</P> +<PRE> + flags lexer = codelit ; +</PRE> +<P> +The problem of getting correct spacing after linearization is likewise solved +by an <B>unlexer</B>: +</P> +<PRE> + > l -unlexer=code EPlus (EInt 2) (ETimes (EInt 3) (EInt 4)) + (2 + (3 * 4)) +</PRE> +<P> +Also this flag is usually put into the concrete syntax file. +</P> +<P> +The lexers and unlexers that are available in the GF system can be +seen by +</P> +<PRE> + > help -lexer + > help -unlexer +</PRE> +<P> +A table of the most common lexers and unlexers is given in the Summary +section 7.8. +</P> +<A NAME="toc121"></A> +<H2>Precedence and fixity</H2> +<P> +<a name="secprecedence"></a> +</P> +<P> +Here is a summary of the usual +precedence rules in mathematics and programming languages: +</P> +<UL> +<LI>Integer constants and expressions in parentheses have the highest precedence. +<LI>Multiplication and division have equal precedence, lower than the highest + but higher than addition and subtraction, which are again equal. +<LI>All the four binary operations are <B>left-associative</B>, which means that + e.g. <CODE>1 + 2 + 3</CODE> means the same as <CODE>(1 + 2) + 3</CODE>. +</UL> + +<P> +One way of dealing with precedences in compiler books is by dividing expressions +into three categories: +</P> +<UL> +<LI>expressions: addition and subtraction +<LI>terms: multiplication and division +<LI>factors: constants and expressions in parentheses +</UL> + +<P> +The context-free grammar, also taking care of associativity, is the following: +</P> +<PRE> + Exp ::= Exp "+" Term | Exp "-" Term | Term ; + Term ::= Term "*" Fact | Term "/" Fact | Fact ; + Fact ::= Int | "(" Exp ")" ; +</PRE> +<P> +A compiler, however, does not want to make a semantic distinction between the +three categories. Nor does it want to build syntax trees with the +<B>coercions</B> that enable the use of a higher level expressions on a lower, and +encode the use of parentheses. In compiler tools such as YACC, building abstract +syntax trees is performed as a <B>semantic action</B>. For instance, if the parser +recognizes an expression in parentheses, the action is to return only the +expression, without encoding the parentheses. +</P> +<P> +In GF, semantic actions could be encoded by using <CODE>def</CODE> definitions introduced +<a href="#secdefdef">here</a>. But there is a more straightforward way of thinking about +precedences: we introduce a parameter for precedence, and treat it as +an inherent feature of expressions: +</P> +<PRE> + oper + param Prec = Ints 2 ; + TermPrec : Type = {s : Str ; p : Prec} ; + + mkPrec : Prec -> Str -> TermPrec = \p,s -> {s = s ; p = p} ; + + lincat + Exp = TermPrec ; +</PRE> +<P> +This example shows another way to use built-in integers in GF: +the type <CODE>Ints 2</CODE> is a parameter type, whose values are the integers +<CODE>0,1,2</CODE>. These are the three precedence levels we need. The main idea +is to compare the inherent precedence of an expression with the context +in which it is used. If the precedence is higher than or equal to +the expected, then +no parentheses are needed. Otherwise they are. We encode this rule in +the operation +</P> +<PRE> + oper usePrec : TermPrec -> Prec -> Str = \x,p -> + case lessPrec x.p p of { + True => "(" x.s ")" ; + False => x.s + } ; +</PRE> +<P> +With this operation, we can build another one, that can be used for +defining left-associative infix expressions: +</P> +<PRE> + infixl : Prec -> Str -> (_,_ : TermPrec) -> TermPrec = \p,f,x,y -> + mkPrec p (usePrec x p ++ f ++ usePrec y (nextPrec p)) ; +</PRE> +<P> +Constant-like expressions (the highest level) can be built simply by +</P> +<PRE> + constant : Str -> TermPrec = mkPrec 2 ; +</PRE> +<P> +All these operations can be found in the library module <CODE>lib/prelude/Formal</CODE>, +so we don't have to define them in our own code. Also the auxiliary operations +<CODE>nextPrec</CODE> and <CODE>lessPrec</CODE> used in their definitions are defined there. +The library has 5 levels instead of 3. +</P> +<P> +Now we can express the whole concrete syntax of <CODE>Calculator</CODE> compactly: +</P> +<PRE> + concrete CalculatorC of Calculator = open Formal, Prelude in { + + flags lexer = codelit ; unlexer = code ; startcat = Exp ; + + lincat Exp = TermPrec ; + + lin + EPlus = infixl 0 "+" ; + EMinus = infixl 0 "-" ; + ETimes = infixl 1 "*" ; + EDiv = infixl 1 "/" ; + EInt i = constant i.s ; + } +</PRE> +<P> +Let us just take one more look at the operation <CODE>usePrec</CODE>, which decides whether +to put parentheses around a term or not. The case where parentheses are not needed +around a string was defined as the string itself. +However, this would imply that superfluous parentheses +are never correct. A more liberal grammar is obtained by using the operation +</P> +<PRE> + parenthOpt : Str -> Str = \s -> variants {s ; "(" ++ s ++ ")"} ; +</PRE> +<P> +which is actually used in the <CODE>Formal</CODE> library. +But even in this way, we can only allow one pair of superfluous parentheses. +Thus the parameter-based grammar has not quite reached the goal +of implementing the same language as the expression-term-factor grammar. +But it has the advantage of eliminating precedence distinctions from the +abstract syntax. +</P> +<P> +<B>Exercise</B>. Define non-associative and right-associative infix operations +analogous to <CODE>infixl</CODE>. +</P> +<P> +<B>Exercise</B>. Add a constructor that puts parentheses around expressions +to raise their precedence, but that is eliminated by a <CODE>def</CODE> definition. +Test parsing with and without a pipe to <CODE>pt -transform=compute</CODE>. +</P> +<A NAME="toc122"></A> +<H2>Code generation as linearization</H2> +<P> +The classical use of grammars of programming languages is in <B>compilers</B>, +which translate one language into another. Typically the source language of +a compiler is a high-level language and the target language is a machine +language. The hub of a compiler is abstract syntax: the <B>front end</B> of +the compiler parses source language strings into abstract syntax trees, and +the <B>back end</B> linearizes these trees into the target language. This processing +model is of course what GF uses for natural language translation; the main +difference is that, in GF, the compiler could run in the opposite direction as +well, that is, function as a <B>decompiler</B>. (In full-size compilers, the +abstract syntax is also transformed by several layers of semantic analysis +and optimizations, before the target code is generated; this can destroy +reversibility and hence decompilation.) +</P> +<P> +More for the sake of illustration +than as a serious compiler, let us write a concrete +syntax of <CODE>Calculator</CODE> that generates machine code similar to JVM (Java Virtual +Machine). JVM is a so-called <B>stack machine</B>, whose code follows the +<B>postfix</B> notation, also known as <B>reverse Polish</B> notation. Thus the +expression +</P> +<PRE> + 2 + 3 * 4 +</PRE> +<P> +is translated to +</P> +<PRE> + iconst 2 : iconst 3 ; iconst 4 ; imul ; iadd +</PRE> +<P> +The linearization rules are not difficult to give: +</P> +<PRE> + lin + EPlus = postfix "iadd" ; + EMinus = postfix "isub" ; + ETimes = postfix "imul" ; + EDiv = postfix "idiv" ; + EInt i = ss ("iconst" ++ i.s) ; + oper + postfix : Str -> SS -> SS -> SS = \op,x,y -> + ss (x.s ++ ";" ++ y.s ++ ";" ++ op) ; +</PRE> +<P></P> +<A NAME="toc123"></A> +<H2>Speaking aloud arithmetic expressions</H2> +<P> +Natural languages have sometimes difficulties in expressing mathematical +formulas unambiguously, because they have no universal device of parentheses. +For arithmetic formulas, a solution exists: +</P> +<PRE> + 2 + 3 * 4 +</PRE> +<P> +can be expressed +</P> +<PRE> + the sum of 2 and the product of 3 and 4 +</PRE> +<P> +However, this format is very verbose and unnatural, and becomes +impossible to understand when the complexity of expressions grows. +Fortunately, spoken language +has a very nice way of using <B>pauses</B> for disambiguation. This device was +introduced by Torbjörn Lager (personal communication, 2003) +as an input mechanism to a calculator dialogue +system; it seems to correspond very closely to how we actually speak when we +want to communicate arithmetic expressions. Another application would be as +a part of a programming assistant that reads aloud code. +</P> +<P> +The idea is that, after every completed operation, there is a pause. Try this +by speaking aloud the following lines, making a pause instead of pronouncing the +word <CODE>PAUSE</CODE>: +</P> +<PRE> + 2 plus 3 times 4 PAUSE + 2 plus 3 PAUSE times 4 PAUSE +</PRE> +<P> +A grammar implementing this convention is again simple to write: +</P> +<PRE> + lin + EPlus = infix "plus" ; + EMinus = infix "minus" ; + ETimes = infix "times" ; + EDiv = infix ["divided by"] ; + EInt i = i ; + oper + infix : Str -> SS -> SS -> SS = \op,x,y -> + ss (x.s ++ op ++ y.s ++ "PAUSE") ; +</PRE> +<P> +Intuitively, a pause is taken to give the hearer time to compute an +intermediate result. +</P> +<P> +<B>Exercise</B>. Is the pause-based grammar unambiguous? Test with random examples! +</P> +<A NAME="toc124"></A> +<H2>Programs with variables</H2> +<P> +A useful extension of arithmetic expressions is a <B>straight code</B> programming +language. The programs of this language are <B>assignments</B> of the form <CODE>x = exp</CODE>, +which assign expressions to variables. Expressions can moreover contain variables +that have been given values in previous assignments. +</P> +<P> +In this language, we use two new categories: programs and variables. +A program is a sequence of assignments, where a variable is given a value. +Logically, we want to distinguish <B>initializations</B> from other assignments: +these are the assignments where a variable is given a value for the first time. +Just like in C-like languages, +we prefix an initializing assignment with the type of the variable. +Here is an example of a piece of code written in the language: +</P> +<PRE> + int x = 2 + 3 ; + int y = x + 1 ; + x = x + 9 * y ; +</PRE> +<P> +We define programs by the following constructors: +</P> +<PRE> + fun + PEmpty : Prog ; + PInit : Exp -> (Var -> Prog) -> Prog ; + PAss : Var -> Exp -> Prog -> Prog ; +</PRE> +<P> +The interesting constructor is <CODE>PInit</CODE>, which uses +higher-order abstract syntax for making the initialized variable available in +the <B>continuation</B> of the program. The abstract syntax tree for the above code +is +</P> +<PRE> + PInit (EPlus (EInt 2) (EInt 3)) (\x -> + PInit (EPlus (EVar x) (EInt 1)) (\y -> + PAss x (EPlus (EVar x) (ETimes (EInt 9) (EVar y))) + PEmpty)) +</PRE> +<P> +Since we want to prevent the use of uninitialized variables in programs, we +don't give any constructors for <CODE>Var</CODE>! We just have a rule for using variables +as expressions: +</P> +<PRE> + fun EVar : Var -> Exp ; +</PRE> +<P> +The rest of the grammar is just the same as for arithmetic expressions +<a href="#secprecedence">here</a>. The best way to implement it is perhaps by writing a +module that extends the expression module. The most natural start category +of the extension is <CODE>Prog</CODE>. +</P> +<P> +<B>Exercise</B>. Extend the straight-code language to expressions of type <CODE>float</CODE>. +To guarantee type safety, you can define a category <CODE>Typ</CODE> of types, and +make <CODE>Exp</CODE> and <CODE>Var</CODE> dependent on <CODE>Typ</CODE>. Basic floating point expressions +can be formed from literal of the built-in GF type <CODE>Float</CODE>. The arithmetic +operations should be made polymorphic (as <a href="#secpolymorphic">here</a>). +</P> +<A NAME="toc125"></A> +<H3>The concrete syntax of assignments</H3> +<P> +We can define a C-like concrete syntax by using GF's <CODE>$</CODE> variables, as explained +<a href="#secbinding">here</a>. +</P> +<P> +In a JVM-like syntax, we need two more instructions: <CODE>iload</CODE> <I>x</I>, which +loads (pushes on the stack) the value of the variable <I>x</I>, and <CODE>istore</CODE> <I>x</I>, +which stores the value of the currently topmost expression in the variable <I>x</I>. +Thus the code for the example in the previous section is +</P> +<PRE> + iconst 2 ; iconst 3 ; iadd ; istore x ; + iload x ; iconst 1 ; iadd ; istore y ; + iload x ; iconst 9 ; iload y ; imul ; iadd ; istore x ; +</PRE> +<P> +Those familiar with JVM will notice that we are using <B>symbolic addresses</B>, i.e. +variable names instead of integer offsets in the memory. Neither real JVM nor +our variant makes any distinction between the initialization and reassignment +of a variable. +</P> +<P> +<B>Exercise</B>. Finish the implementation of the +C-to-JVM compiler by extending the expression modules +to straight code programs. +</P> +<P> +<B>Exercise</B>. If you made the exercise of adding floating point numbers to +the language, you can now cash out the main advantage of type checking +for code generation: selecting type-correct JVM instructions. The floating +point instructions are precisely the same as the integer one, except that +the prefix is <CODE>f</CODE> instead of <CODE>i</CODE>, and that <CODE>fconst</CODE> takes floating +point literals as arguments. +</P> +<A NAME="toc126"></A> +<H3>A liberal syntax of variables</H3> +<P> +In many applications, the task of GF is just linearization and parsing; +keeping track of bound variables and other semantic constraints is +the task of other parts of the program. For instance, if we want to +write a natural language interface that reads aloud C code, we can +quite as well use a context-free grammar of C, and leave it to the C +compiler to check that variables make sense. In such a program, we may +want to treat variables as <I>Strings</I>, i.e. to have a constructor +</P> +<PRE> + fun VString : String -> Var ; +</PRE> +<P> +The built-in category <CODE>String</CODE> has as its values <B>string literals</B>, +which are strings in double quotes. The lexer and unlexer <CODE>codelit</CODE> +restore and remove the quotes; when the lexer finds a token that is +neither a terminal in the grammar nor an integer literal, it sends +it to the parser as a string literal. +</P> +<P> +<B>Exercise</B>. Write a grammar for straight code without higher-order +abstract syntax. +</P> +<P> +<B>Exercise</B>. Extend the liberal straight code grammar to <CODE>while</CODE> loops and +some other program constructs, and investigate if you can build a reasonable spoken +language generator for this fragment. +</P> +<A NAME="toc127"></A> +<H2>Conclusion</H2> +<P> +Since formal languages are syntactically simpler than natural languages, it +is no wonder that their grammars can be defined in GF. Some thought is needed +for dealing with precedences and spacing, but much of it is encoded in GF's +libraries and built-in lexers and unlexers. If the sole purpose of a grammar +is to implement a programming language, then the <B>BNF Converter</B> tool +(BNFC) is more appropriate than GF: +<center> +<CODE>www.cs.chalmers.se/~markus/BNFC/</CODE> +</center> +BNFC uses standard YACC-like parser tools. GF has flags for printing +grammars in the BNFC format. +</P> +<P> +The most common applications of GF grammars of formal languages +are in natural-language interfaces of various kinds. +These systems don't usually need semantic control in GF abstract +syntax. However, the situation can be different if the interface also comprises +an interactive syntax editor, as in the GF-Key system +(Beckert & al. 2006, Burke & Johannisson 2005). +In that system, the editor is used for guiding programmers only to write +type-correct code. +</P> +<P> +The technique of continuations in modelling programming languages has recently +been applied to natural language, for processing <B>anaphoric reference</B>, +e.g. pronouns. It may be good to know that GF has the machinery available; +for the time being, however (GF 2.8), dependent types and +higher-order abstract syntax are not supported by the embedded GF implementations +in Haskell and Java. +</P> +<P> +<B>Exercise</B>. The book <I>C programming language</I> by Kernighan and Ritchie +(p. 123, 2nd edition, 1988) describes an English-like syntax for pointer and +array declarations, and a C program for translating between English and C. +The following example pair shows all the expression forms needed: +</P> +<PRE> + char (*(*x[3])())[5] + + x: array[3] of pointer to function returning + pointer to array[5] of char +</PRE> +<P> +Implement these translations by a GF grammar. +</P> +<P> +<B>Exercise</B>. Design a natural-language interface to Unix command lines. +It should be able to express verbally commands such as +<CODE>cat, cd, grep, ls, mv, rm, wc</CODE> and also +pipes built from them. +</P> +<A NAME="toc128"></A> +<H2>Summary of GF language constructs</H2> +<A NAME="toc129"></A> +<H3>Lexers and unlexers</H3> +<P> +Lexers are set by the flag <CODE>lexer</CODE> and unlexers by the flag <CODE>unlexer</CODE>. +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>lexer</TH> +<TH COLSPAN="2">description</TH> +</TR> +<TR> +<TD><CODE>words</CODE></TD> +<TD>(default) tokens are separated by spaces or newlines</TD> +</TR> +<TR> +<TD><CODE>literals</CODE></TD> +<TD>like words, but integer and string literals recognized</TD> +</TR> +<TR> +<TD><CODE>chars</CODE></TD> +<TD>each character is a token</TD> +</TR> +<TR> +<TD><CODE>code</CODE></TD> +<TD>program code conventions (uses Haskell's lex)</TD> +</TR> +<TR> +<TD><CODE>text</CODE></TD> +<TD>with conventions on punctuation and capital letters</TD> +</TR> +<TR> +<TD><CODE>codelit</CODE></TD> +<TD>like code, but recognize literals (unknown words as strings)</TD> +</TR> +<TR> +<TD><CODE>textlit</CODE></TD> +<TD>like text, but recognize literals (unknown words as strings)</TD> +</TR> +</TABLE> + +<P></P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>unlexer</TH> +<TH COLSPAN="2">description</TH> +</TR> +<TR> +<TD><CODE>unwords</CODE></TD> +<TD>(default) space-separated token list</TD> +</TR> +<TR> +<TD><CODE>text</CODE></TD> +<TD>format as text: punctuation, capitals, paragraph <p></TD> +</TR> +<TR> +<TD><CODE>code</CODE></TD> +<TD>format as code (spacing, indentation)</TD> +</TR> +<TR> +<TD><CODE>textlit</CODE></TD> +<TD>like text, but remove string literal quotes</TD> +</TR> +<TR> +<TD><CODE>codelit</CODE></TD> +<TD>like code, but remove string literal quotes</TD> +</TR> +<TR> +<TD><CODE>concat</CODE></TD> +<TD>remove all spaces</TD> +</TR> +</TABLE> + +<P></P> +<A NAME="toc130"></A> +<H3>Built-in abstract syntax types</H3> +<P> +There are three built-in types. Their syntax trees are literals of corresponding kinds: +</P> +<UL> +<LI><CODE>Int</CODE>, with nonnegative integer literals e.g. <CODE>987031434</CODE> +<LI><CODE>Float</CODE>, with nonnegative floating-point literals e.g. <CODE>907.219807</CODE> +<LI><CODE>String</CODE>, with string literals e.g. <CODE>"foo"</CODE> +</UL> + +<P> +Their linearization type is uniformly <CODE>{s : Str}</CODE>. +</P> +<A NAME="toc131"></A> +<H1>Embedded grammars</H1> +<P> +<a name="chapeight"></a> +</P> +<P> +GF grammars can be used as parts of programs written in other programming +languages. Haskell and Java. +This facility is based on several components: +</P> +<UL> +<LI>a portable format for multilingual GF grammars +<LI>an interpreter for this format written in the host language +<LI>an API that enables reading grammar files and calling the interpreter +<LI>a way to manipulate abstract syntax trees in the host language +</UL> + +<P> +In this chapter, we will show the basic ways of producing such +<B>embedded grammars</B> and using them in Haskell, Java, and JavaScript programs. +We will build a simple example application in each language: +</P> +<UL> +<LI>a question-answering system in Haskell +<LI>a translator GUI in Java +<LI>a multilingual syntax editor in JavaScript +</UL> + +<P> +Moreover, we will use how grammar applications can be extended to +spoken language by generating <B>language models</B> for speech recognition +in various standard formats. +</P> +<A NAME="toc132"></A> +<H2>The portable grammar format</H2> +<P> +The portable format is called GFCC, "GF Canonical Compiled". A file +of this form can be produced from GF by the command +</P> +<PRE> + > print_multi -printer=gfcc | write_file FILE.gfcc +</PRE> +<P> +Files written in this format can also be imported in the GF system, +which recognizes the suffix <CODE>.gfcc</CODE> and builds the multilingual +grammar in memory. +</P> +<P> +<I>This applies to GF version 3 and upwards. Older GF used a format suffixed</I> +<CODE>.gfcm</CODE>. +<I>At the moment of writing, also the Java interpreter still uses the GFCM format.</I> +</P> +<P> +GFCC is, in fact, the recommended format in +which final grammar products are distributed, because they +are stripped from superfluous information and can be started and applied +faster than sets of separate modules. +</P> +<P> +Application programmers have never any need to read or modify GFCC files. +Also in this sense, they play the same role as machine code in +general-purpose programming. +</P> +<A NAME="toc133"></A> +<H2>The embedded interpreter and its API</H2> +<P> +The interpreter is a kind of a miniature GF system, which can parse and +linearize with grammars. But it can only perform a subset of the commands of +the GF system. For instance, it +cannot compile source grammars into the GFCC format; the compiler is the most +heavy-weight component of the GF system, and should not be carried around +in end-user applications. +Since GFCC is much +simpler than source GF, building an interpreter is relatively easy. +Full-scale interpreters currently exist in Haskell and Java, and partial +ones in C++, JavaScript, and Prolog. We will in this chapter focus +on Haskell, Java, and JavaScript. +</P> +<P> +Application programmers never need to read or modify the interpreter. +They only need to access it via its API. +</P> +<A NAME="toc134"></A> +<H2>Embedded GF applications in Haskell</H2> +<P> +Readers unfamiliar with Haskell, or who just want to program in Java, can safely +skip this section. Everything will be repeated in the corresponding Java +section. However, seeing the Haskell code may still be helpful because +Haskell is in many ways closer to GF than Java is. In particular, recursive +types of syntax trees and pattern matching over them are very similar in +Haskell and GF, +but require a complex encoding with classes and visitors in Java. +</P> +<A NAME="toc135"></A> +<H3>The EmbedAPI module</H3> +<P> +The Haskell API contains (among other things) the following types and functions: +</P> +<PRE> + module EmbedAPI where + + type MultiGrammar + type Language + type Category + type Tree + + file2grammar :: FilePath -> IO MultiGrammar + + linearize :: MultiGrammar -> Language -> Tree -> String + parse :: MultiGrammar -> Language -> Category -> String -> [Tree] + + linearizeAll :: MultiGrammar -> Tree -> [String] + linearizeAllLang :: MultiGrammar -> Tree -> [(Language,String)] + + parseAll :: MultiGrammar -> Category -> String -> [[Tree]] + parseAllLang :: MultiGrammar -> Category -> String -> [(Language,[Tree])] + + languages :: MultiGrammar -> [Language] + categories :: MultiGrammar -> [Category] + startCat :: MultiGrammar -> Category +</PRE> +<P> +This is the only module that needs to be imported in the Haskell application. +It is available as a part of the GF distribution, in the file +<CODE>src/GF/GFCC/API.hs</CODE>. +</P> +<A NAME="toc136"></A> +<H3>First application: a translator</H3> +<P> +Let us first build a stand-alone translator, which can translate +in any multilingual grammar between any languages in the grammar. +The whole code for this translator is here: +</P> +<PRE> + module Main where + + import GF.GFCC.API + import System (getArgs) + + main :: IO () + main = do + file:_ <- getArgs + gr <- file2grammar file + interact (translate gr) + + translate :: MultiGrammar -> String -> String + translate gr = case parseAllLang gr (startCat gr) s of + (lg,t:_):_ -> unlines [linearize gr l t | l <- languages gr, l /= lg] + _ -> "NO PARSE" +</PRE> +<P> +To run the translator, first compile it by +</P> +<PRE> + % ghc --make -o trans Translator.hs +</PRE> +<P> +Then produce a GFCC file. For instance, the <CODE>Food</CODE> grammar set can be +compiled as follows: +</P> +<PRE> + % gfc --make FoodEng.gf FoodIta.gf +</PRE> +<P> +This produces the file <CODE>Food.gfcc</CODE> (its name comes from the abstract syntax). +</P> +<P> +<I>The gfc batch compiler program is available in GF 3 and upwards.</I> +<I>In earlier versions, the appropriate command can be piped to gf:</I> +</P> +<PRE> + % echo "pm -printer=gfcc | wf Food.gfcc" | gf FoodEng.gf FoodIta.gf +</PRE> +<P> +Equivalently, the grammars could be read into GF shell and the <CODE>pm</CODE> command +issued from there. But the unix command has the advantage that it can +be put into a <CODE>Makefile</CODE> to automate the compilation of an application. +</P> +<P> +The Haskell library function <CODE>interact</CODE> makes the <CODE>trans</CODE> program work +like a Unix filter, which reads from standard input and writes to standard +output. Therefore it can be a part of a pipe and read and write files. +The simplest way to translate is to <CODE>echo</CODE> input to the program: +</P> +<PRE> + % echo "this wine is delicious" | ./trans Food.gfcc + questo vino è delizioso +</PRE> +<P> +The result is given in all languages except the input language. +</P> +<A NAME="toc137"></A> +<H3>A looping translator</H3> +<P> +If the user wants to translate many expressions in a sequence, it +is cumbersome to have to start the translator over and over again, +because reading the grammar and building the parser always takes +time. The translator of the previous section is easy to modify +to enable this: just change <CODE>interact</CODE> in the main function to +<CODE>loop</CODE>. It is not a standard Haskell function, so its definition has +to be included: +</P> +<PRE> + loop :: (String -> String) -> IO () + loop trans = do + s <- getLine + if s == "quit" then putStrLn "bye" else do + putStrLn $ trans s + loop trans +</PRE> +<P> +The loop keeps on translating line by line until the input line +is <CODE>quit</CODE>. +</P> +<A NAME="toc138"></A> +<H3>A question-answer system</H3> +<P> +<a name="secmathprogram"></a> +</P> +<P> +The next application is also a translator, but it adds a +<B>transfer</B> component to the grammar. Transfer is a function that +takes the input syntax tree into some other syntax tree, which is +then linearized and shown back to the user. The transfer function we +are going to use is one that computes a question into an answer. +The program accepts simple questions about arithmetic and answers +"yes" or "no" in the language in which the question was made: +</P> +<PRE> + Is 123 prime? + No. + 77 est impair ? + Oui. +</PRE> +<P> +The main change that is needed to the pure translator is to give +the type of <CODE>translate</CODE> an extra argument: a transfer function. +</P> +<PRE> + translate :: (Tree -> Tree) -> MultiGrammar -> String -> String +</PRE> +<P> +You can think of ordinary translation as a special case where +transfer is the identity function (<CODE>id</CODE> in Haskell). +</P> +<P> +Also the behaviour of returning the reply in different languages +should be changed so that the reply is returned in the <I>same</I> language. +Here is the complete definition of <CODE>translate</CODE> in the new form. +</P> +<PRE> + translate tr gr = case parseAllLang gr (startCat gr) s of + (lg,t:_):_ -> linearize gr lg (tr t) + _ -> "NO PARSE" +</PRE> +<P> +To complete the system, we have to define the transfer function. +So, how can we write a function from from abstract syntax trees +to abstract syntax trees? The embedded API does not make +the constructors of the type <CODE>Tree</CODE> available for users. Even if it did, it would +be quite complicated to use the type, and programs would be likely +to produce trees that are ill-typed in GF and therefore cannot +be linearized. +</P> +<A NAME="toc139"></A> +<H3>Exporting GF datatypes</H3> +<P> +The way to go in defining transfer is to use GF's tree constructors, that +is, the <CODE>fun</CODE> functions, as if they were Haskell's data constructors. +There is enough resemblance between GF and Haskell to make this possible +in most cases. It is even possible in Java, as we shall see later. +</P> +<P> +Thus every category of GF is translated into a Haskell datatype, where the +functions producing a value of that category are treated as constructors. +The translation is obtained by using the batch compiler with the command +</P> +<PRE> + % gfc -haskell Food.gfcc +</PRE> +<P> +It is also possible to produce the Haskell file together with GFCC, by +</P> +<PRE> + % gfc --make -haskell FoodEng.gf FoodIta.gf +</PRE> +<P> +The result is a file named <CODE>GSyntax.hs</CODE>, containing a +module named <CODE>GSyntax</CODE>. +</P> +<P> +<I>In GF before version 3, the same result is obtained from within GF, by the command</I> +</P> +<PRE> + > print_grammar -printer=gfcc_haskell | write_file GSyntax.hs +</PRE> +<P></P> +<P> +As an example, we take +the grammar we are going to use for queries. The abstract syntax is +</P> +<PRE> + abstract Math = { + + flags startcat = Question ; + + cat Answer ; Question ; Object ; + + fun + Even : Object -> Question ; + Odd : Object -> Question ; + Prime : Object -> Question ; + Number : Int -> Object ; + + Yes : Answer ; + No : Answer ; + } +</PRE> +<P> +It is translated to the following system of datatypes: +</P> +<PRE> + newtype GInt = GInt Integer + + data GAnswer = + GYes + | GNo + + data GObject = GNumber GInt + + data GQuestion = + GPrime GObject + | GOdd GObject + | GEven GObject +</PRE> +<P> +All type and constructor names are prefixed with a <CODE>G</CODE> to prevent clashes. +</P> +<P> +Now it is possible to define functions from and to these datatype, in Haskell. +Haskell's type checker guarantees that the functions are well-typed also with +respect to GF. Here is a question-to-answer function for this language: +</P> +<PRE> + answer :: GQuestion -> GAnswer + answer p = case p of + GOdd x -> test odd x + GEven x -> test even x + GPrime x -> test prime x + + value :: GObject -> Int + value e = case e of + GNumber (GInt i) -> fromInteger i + + test :: (Int -> Bool) -> GObject -> GAnswer + test f x = if f (value x) then GYes else GNo +</PRE> +<P> +So it is the function <CODE>answer</CODE> that we want to apply as transfer. +The only problem is the <I>type</I> of this function: the parsing and +linearization method of <CODE>API</CODE> work with <CODE>Tree</CODE>s and not +with <CODE>GQuestion</CODE>s and <CODE>GAnswers</CODE>. +</P> +<P> +Fortunately the Haskell translation of GF takes care of translating +between trees and the generated datatypes. This is done by using +a class with the required translation methods: +</P> +<PRE> + class Gf a where + gf :: a -> Tree + fg :: Tree -> a +</PRE> +<P> +The Haskell code generator also generates instances of these classes +for each datatype, for example, +</P> +<PRE> + instance Gf GQuestion where + gf (GEven x1) = DTr [] (AC (CId "Even")) [gf x1] + gf (GOdd x1) = DTr [] (AC (CId "Odd")) [gf x1] + gf (GPrime x1) = DTr [] (AC (CId "Prime")) [gf x1] + fg t = + case t of + DTr [] (AC (CId "Even")) [x1] -> GEven (fg x1) + DTr [] (AC (CId "Odd")) [x1] -> GOdd (fg x1) + DTr [] (AC (CId "Prime")) [x1] -> GPrime (fg x1) + _ -> error ("no Question " ++ show t) +</PRE> +<P> +Needless to say, <CODE>GSyntax</CODE> is a module that a programmer +never needs to look into, let alone change: it is enough to know that it +contains a systematic encoding and decoding between an abstract syntax +and Haskell datatypes, where +</P> +<UL> +<LI>all GF names are in Haskell prefixed with <CODE>G</CODE> +<LI><CODE>gf</CODE> translates from Haskell to GF +<LI><CODE>fg</CODE> translates from GF to Haskell +</UL> + +<A NAME="toc140"></A> +<H3>Putting it all together</H3> +<P> +Here is the complete code for the Haskell module <CODE>TransferLoop.hs</CODE>. +</P> +<PRE> + module Main where + + import GF.GFCC.API + import TransferDef (transfer) + + main :: IO () + main = do + gr <- file2grammar "Math.gfcc" + loop (translate transfer gr) + + loop :: (String -> String) -> IO () + loop trans = do + s <- getLine + if s == "quit" then putStrLn "bye" else do + putStrLn $ trans s + loop trans + + translate :: (Tree -> Tree) -> MultiGrammar -> String -> String + translate tr gr = case parseAllLang gr (startCat gr) s of + (lg,t:_):_ -> linearize gr lg (tr t) + _ -> "NO PARSE" +</PRE> +<P> +This is the <CODE>Main</CODE> module, which just needs a function <CODE>transfer</CODE> from +<CODE>TransferDef</CODE> in order to compile. In the current application, this module +looks as follows: +</P> +<PRE> + module TransferDef where + + import GF.GFCC.API (Tree) + import GSyntax + + transfer :: Tree -> Tree + transfer = gf . answer . fg + + answer :: GQuestion -> GAnswer + answer p = case p of + GOdd x -> test odd x + GEven x -> test even x + GPrime x -> test prime x + + value :: GObject -> Int + value e = case e of + GNumber (GInt i) -> fromInteger i + + test :: (Int -> Bool) -> GObject -> GAnswer + test f x = if f (value x) then GYes else GNo + + prime :: Int -> Bool + prime x = elem x primes where + primes = sieve [2 .. x] + sieve (p:xs) = p : sieve [ n | n <- xs, n `mod` p > 0 ] + sieve [] = [] +</PRE> +<P> +This module, in turn, needs <CODE>GSyntax</CODE> to compile, and the main module +needs <CODE>Math.gfcc</CODE> to run. To automate the production of the system, +we write a <CODE>Makefile</CODE> as follows: +</P> +<PRE> + all: + gfc --make -haskell MathEng.gf MathFre.gf + ghc --make -o ./math TransferLoop.hs + strip math +</PRE> +<P> +(Notice that the empty segments starting the command lines in a Makefile must be tabs.) +Now we can compile the whole system by just typing +</P> +<PRE> + make +</PRE> +<P> +Then you can run it by typing +</P> +<PRE> + ./math +</PRE> +<P> +Well --- you will of course need some concrete syntaxes of <CODE>Math</CODE> in order +to succeed. We have defined ours by using the resource functor design pattern, +as explained <a href="#secfunctor">here</a>. +</P> +<P> +Just to summarize, the source of the application consists of the following files: +</P> +<PRE> + Makefile -- a makefile + Math.gf -- abstract syntax + Math???.gf -- concrete syntaxes + TransferDef.hs -- definition of question-to-answer function + TransferLoop.hs -- Haskell Main module +</PRE> +<P></P> +<A NAME="toc141"></A> +<H2>Embedded GF applications in Java</H2> +<P> +When an API for GFCC in Java is available, +we will write the same applications in Java as +were written in Haskell above. Until then, we will +build another kind of application, which does not require +modification of generated Java code. +</P> +<P> +More information on embedded GF grammars in Java can be found in the document +</P> +<PRE> + www.cs.chalmers.se/~bringert/gf/gf-java.html +</PRE> +<P> +by Björn Bringert. +</P> +<A NAME="toc142"></A> +<H3>Translets</H3> +<P> +A Java system needs many more files than a Haskell system. +To get started, one can fetch the package <CODE>gfc2java</CODE> from +</P> +<PRE> + www.cs.chalmers.se/~bringert/darcs/gfc2java/ +</PRE> +<P> +by using the Darcs version control system as described in the <CODE>gf-java</CODE> home page. +</P> +<P> +The <CODE>gfc2java</CODE> package contains a script <CODE>build-translet</CODE>, which can be applied +to any <CODE>.gfcm</CODE> file to create a <B>translet</B>, a small translation GUI. Foor the <CODE>Food</CODE> +grammars of <a href="#chapthree">the third chapter</a>, we first create a file <CODE>food.gfcm</CODE> by +</P> +<PRE> + % echo "pm | wf food.gfcm" | gf FoodEng.gf FoodIta.gf +</PRE> +<P> +and then run +</P> +<PRE> + % build_translet food.gfcm +</PRE> +<P> +The resulting file <CODE>translate-food.jar</CODE> can be run with +</P> +<PRE> + % java -jar translate-food.jar +</PRE> +<P> +The translet looks like this: +</P> +<P> + <IMG ALIGN="right" SRC="food-translet.png" BORDER="0" ALT=""> +</P> +<A NAME="toc143"></A> +<H3>Dialogue systems</H3> +<P> +A question-answer system is a special case of a <B>dialogue system</B>, where the user and +the computer communicate by writing or, even more properly, by speech. The <CODE>gf-java</CODE> +homepage provides an example of a most simple dialogue system imaginable, where two +the conversation has just two rules: +</P> +<UL> +<LI>if the user says <I>here you go</I>, the system says <I>thanks</I> +<LI>if the user says <I>thanks</I>, the system says <I>you are welcome</I> +</UL> + +<P> +The conversation can be made in both English and Swedish; the user's initiative +decides which language the system replies in. Thus the structure is very similar +to the <CODE>math</CODE> program <a href="#secmathprogram">here</a>. The GF and +Java sources of the program can be +found in +</P> +<PRE> + www.cs.chalmers.se/~bringert/darcs/simpledemo +</PRE> +<P> +again accessible with the Darcs version control system. +</P> +<A NAME="toc144"></A> +<H2>Language models for speech recognition</H2> +<P> +The standard way of using GF in speech recognition is by building +<B>grammar-based language models</B>. To this end, GF comes with compilers +into several formats that are used in speech recognition systems. +One such format is GSL, used in the <A HREF="http://www.nuance.com">Nuance speech recognizer</A>. +It is produced from GF simply by printing a grammar with the flag +<CODE>-printer=gsl</CODE>. The following example uses the smart house grammar defined +<a href="#secsmarthouse">here</a>. +</P> +<PRE> + > import -conversion=finite SmartEng.gf + > print_grammar -printer=gsl + + ;GSL2.0 + ; Nuance speech recognition grammar for SmartEng + ; Generated by GF + + .MAIN SmartEng_2 + + SmartEng_0 [("switch" "off") ("switch" "on")] + SmartEng_1 ["dim" ("switch" "off") + ("switch" "on")] + SmartEng_2 [(SmartEng_0 SmartEng_3) + (SmartEng_1 SmartEng_4)] + SmartEng_3 ("the" SmartEng_5) + SmartEng_4 ("the" SmartEng_6) + SmartEng_5 "fan" + SmartEng_6 "light" +</PRE> +<P> +Other formats available via the <CODE>-printer</CODE> flag include: +</P> +<TABLE ALIGN="center" CELLPADDING="4" BORDER="1"> +<TR> +<TH>Format</TH> +<TH COLSPAN="2">Description</TH> +</TR> +<TR> +<TD><CODE>gsl</CODE></TD> +<TD>Nuance GSL speech recognition grammar</TD> +</TR> +<TR> +<TD><CODE>jsgf</CODE></TD> +<TD>Java Speech Grammar Format (JSGF)</TD> +</TR> +<TR> +<TD><CODE>jsgf_sisr_old</CODE></TD> +<TD>JSGF with semantic tags in SISR WD 20030401 format</TD> +</TR> +<TR> +<TD><CODE>srgs_abnf</CODE></TD> +<TD>SRGS ABNF format</TD> +</TR> +<TR> +<TD><CODE>srgs_xml</CODE></TD> +<TD>SRGS XML format</TD> +</TR> +<TR> +<TD><CODE>srgs_xml_prob</CODE></TD> +<TD>SRGS XML format, with weights</TD> +</TR> +<TR> +<TD><CODE>slf</CODE></TD> +<TD>finite automaton in the HTK SLF format</TD> +</TR> +<TR> +<TD><CODE>slf_sub</CODE></TD> +<TD>finite automaton with sub-automata in HTK SLF</TD> +</TR> +</TABLE> + +<P></P> +<P> +All currently available formats can be seen in gf with <CODE>help -printer</CODE>. +</P> +<A NAME="toc145"></A> +<H2>Dependent types and spoken language models</H2> +<P> +We have used dependent types to control semantic well-formedness +in grammars. This is important in traditional type theory +applications such as proof assistants, where only mathematically +meaningful formulas should be constructed. But semantic filtering has +also proved important in speech recognition, because it reduces the +ambiguity of the results. +</P> +<P> +Now, GSL is a context-free format, so how does it cope with dependent types? +In general, dependent types can give rise to infinitely many basic types +(exercise!), whereas a context-free grammar can by definition only have +finitely many nonterminals. +</P> +<P> +This is where the flag <CODE>-conversion=finite</CODE> is needed in the <CODE>import</CODE> +command. Its effect is to convert a GF grammar with dependent types to +one without, so that each instance of a dependent type is replaced by +an atomic type. This can then be used as a nonterminal in a context-free +grammar. The <CODE>finite</CODE> conversion presupposes that every +dependent type has only finitely many instances, which is in fact +the case in the <CODE>Smart</CODE> grammar. +</P> +<P> +<B>Exercise</B>. If you have access to the Nuance speech recognizer, +test it with GF-generated language models for <CODE>SmartEng</CODE>. Do this +both with and without <CODE>-conversion=finite</CODE>. +</P> +<P> +<B>Exercise</B>. Construct an abstract syntax with infinitely many instances +of dependent types. +</P> +<A NAME="toc146"></A> +<H3>Statistical language models</H3> +<P> +An alternative to grammar-based language models are +<B>statistical language models</B> (<B>SLM</B>s). An SLM is +built from a <B>corpus</B>, i.e. a set of utterances. It specifies the +probability of each <B>n-gram</B>, i.e. sequence of <I>n</I> words. The +typical value of <I>n</I> is 2 (bigrams) or 3 (trigrams). +</P> +<P> +One advantage of SLMs over grammar-based models is that they are +<B>robust</B>, i.e. they can be used to recognize sequences that would +be out of the grammar or the corpus. Another advantage is that +an SLM can be built "for free" if a corpus is available. +</P> +<P> +However, collecting a corpus can require a lot of work, and writing +a grammar can be less demanding, especially with tools such as GF or +Regulus. This advantage of grammars can be combined with robustness +by creating a back-up SLM from a <B>synthesized corpus</B>. This means +simply that the grammar is used for generating such a corpus. +In GF, this can be done with the <CODE>generate_trees</CODE> command. +As with grammar-based models, the quality of the SLM is better +if meaningless utterances are excluded from the corpus. Thus +a good way to generate an SLM from a GF grammar is by using +dependent types and filter the results through the type checker: +</P> +<PRE> + > generate_trees | put_trees -transform=solve | linearize +</PRE> +<P> +The method of creating statistical language model from corpora synthesized +from GF grammars is applied and evaluated in (Jonson 2006). +</P> +<P> +<B>Exercise</B>. Measure the size of the corpus generated from +<CODE>SmartEng</CODE> (defined <a href="#secsmarthouse">here</a>), with and without type checker filtering. +</P> + +<!-- html code generated by txt2tags 2.3 (http://txt2tags.sf.net) --> +<!-- cmdline: txt2tags -thtml -\-toc gf-tutorial.txt --> +</BODY></HTML> |
