moved parts of doc to deprecated/doc

author: aarne <aarne@chalmers.se> 2010-12-22 14:08:42 +0000
committer: aarne <aarne@chalmers.se> 2010-12-22 14:08:42 +0000
commit: ce15ec7b787479ca4c7295863ea7fa5cfdd16755 (patch)
tree: f47d9227ab535781d44d00e6232b8c62902167df /deprecated/doc/compiling-gf.txt
parent: fb722fe8e2cedee3b42d7fb0c9da61ace74f3e22 (diff)
1 files changed, 750 insertions, 0 deletions
diff --git a/deprecated/doc/compiling-gf.txt b/deprecated/doc/compiling-gf.txt
new file mode 100644
index 000000000..9e438f40f
--- /dev/null
+++ b/deprecated/doc/compiling-gf.txt
@@ -0,0 +1,750 @@
+Compiling GF
+Aarne Ranta
+Proglog meeting, 1 November 2006
+
+% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html
+
+%!target:html
+%!postproc(html): #NEW <!-- NEW -->
+
+#NEW
+
+==The compilation task==
+
+GF is a grammar formalism, i.e. a special purpose programming language
+for writing grammars.
+
+Other grammar formalisms: 
+- BNF, YACC, Happy (grammars for programming languages); 
+- PATR, HPSG, LFG (grammars for natural languages).
+
+
+The grammar compiler prepares a GF grammar for two computational tasks:
+- linearization: take syntax trees to strings
+- parsing: take strings to syntax trees
+
+
+The grammar gives a declarative description of these functionalities,
+on a high abstraction level that improves grammar writing
+productivity.
+
+For efficiency, the grammar is compiled to lower-level formats.
+
+Type checking is another essential compilation phase. Its purpose is
+twofold, as usual:
+- checking the correctness of the grammar
+- type-annotating expressions for code generation
+
+
+#NEW
+
+==Characteristics of GF language==
+
+Functional language with types, both built-in and user-defined.
+```
+  Str : Type
+
+  param Number = Sg | Pl
+
+  param AdjForm = ASg Gender | APl
+
+  Noun : Type = {s : Number => Str ; g : Gender}
+```
+Pattern matching.
+```
+  svart_A = table {
+    ASg _ => "svart" ;
+    _ => "svarta"
+    }
+```
+Higher-order functions.
+
+Dependent types.
+```
+  flip : (a, b, c : Type) -> (a -> b -> c) -> b -> a -> c = 
+    \_,_,_,f,y,x -> f x y ;
+```
+
+
+#NEW
+
+==The module system of GF==
+
+Main division: abstract syntax and concrete syntax
+```
+  abstract Greeting = {
+    cat Greet ;
+    fun Hello : Greet ;
+  }
+
+  concrete GreetingEng of Greeting = {
+    lincat Greet = {s : Str} ;
+    lin Hello = {s = "hello"} ;
+  }
+
+  concrete GreetingIta of Greeting = {
+    param Politeness = Familiar | Polite ;
+    lincat Greet = {s : Politeness => Str} ;
+    lin Hello = {s = table {
+      Familiar => "ciao" ;
+      Polite => "buongiorno"
+    } ;
+  }
+```
+Other features of the module system:
+- extension and opening
+- parametrized modules (cf. ML: signatures, structures, functors)
+
+
+
+
+#NEW
+
+==GF vs. Haskell==
+
+Some things that (standard) Haskell hasn't:
+- records and record subtyping
+- regular expression patterns
+- dependent types
+- ML-style modules 
+
+
+Some things that GF hasn't:
+- infinite (recursive) data types
+- recursive functions
+- classes, polymorphism
+
+
+#NEW
+
+==GF vs. most linguistic grammar formalisms==
+
+GF separates abstract syntax from concrete syntax.
+
+GF has a module system with separate compilation.
+
+GF is generation-oriented (as opposed to parsing).
+
+GF has unidirectional matching (as opposed to unification).
+
+GF has a static type system (as opposed to a type-free universe).
+
+"I was - and I still am - firmly convinced that a program composed
+out of statically type-checked parts is more likely to faithfully
+express a well-thought-out design than a program relying on
+weakly-typed interfaces or dynamically-checked interfaces."
+(B. Stroustrup, 1994, p. 107)
+
+
+
+#NEW
+
+==The computation model: abstract syntax==
+
+An abstract syntax defines a free algebra of trees (using
+dependent types, recursion, higher-order abstract syntax: 
+GF includes a complete Logical Framework).
+```
+  cat C (x_1 : A_1)...(x_n : A_n)  
+  a_1 : A_1 
+  ... 
+  a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1}
+  ----------------------------------------------------
+  (C a_1 ... a_n) : Type 
+
+
+  fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A
+  a_1 : A_1 
+  ... 
+  a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1}
+  ----------------------------------------------------
+  (f a_1 ... a_n) : A{x_1 : A_1,...,x_n : A_n}
+
+
+ A : Type  x : A |- B : Type       x : A |- b : B        f : (x : A) -> B   a : A
+ ----------------------------   ----------------------   ------------------------
+     (x : A) -> B : Type        \x -> b : (x : A) -> B       f a : B{x := A}
+```
+Notice that all syntax trees are in eta-long form.
+
+
+#NEW
+
+==The computation model: concrete syntax==
+
+A concrete syntax defines a homomorphism (compositional mapping)
+from the abstract syntax to a system of concrete syntax objects.
+```
+  cat C _
+  --------------------
+  lincat C = C* : Type
+
+  fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A
+  -----------------------------------------------
+  lin f = f* : A_1* -> ... -> A_n* -> A*
+
+  (f a_1 ... a_n)*  =  f* a_1* ... a_n*
+```
+The homomorphism can as such be used as linearization function.
+
+It is a functional program, but a restricted one, since it works 
+in the end on finite data structures only.
+
+But a more efficient program is obtained via compilation to 
+GFC = Canonical GF: the "machine code" of GF.
+
+The parsing problem of GFC can be reduced to that of MPCFG (Multiple
+Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).
+
+
+
+#NEW
+
+==The core type system of concrete syntax: basic types==
+
+```
+                   param P      P : PType
+  PType : Type    ---------     ---------
+                  P : PType      P : Type
+
+                                          s : Str  t : Str
+  Str : type     "foo" : Str   [] : Str   ----------------
+                                            s ++ t : Str
+```
+
+
+#NEW
+
+==The core type system of concrete syntax: functions and tables==
+
+```
+ A : Type  x : A |- B : Type       x : A |- b : B        f : (x : A) -> B   a : A
+ ----------------------------   ----------------------   ------------------------
+     (x : A) -> B : Type        \x -> b : (x : A) -> B       f a : B{x := A}
+
+
+        P : PType   A : Type      t : P => A  p : p
+        --------------------      -----------------
+            P => A : Type            t ! p : A
+
+                   v_1,...,v_n : A
+       ---------------------------------------------- P = {C_1,...,C_n}
+       table {C_1 => v_1 ; ... ; C_n => v_n} : P => A 
+```
+Pattern matching is treated as an abbreviation for tables. Notice that
+```
+  case e of {...}  ==  table {...} ! e
+```
+
+
+#NEW
+
+==The core type system of concrete syntax: records==
+
+```
+              A_1,...,A_n : Type
+     ------------------------------------ n >= 0
+     {r_1 : A_1 ; ... ; r_n : A_n} : Type
+
+
+                 a_1 : A_1  ...  a_n : A_n
+     ------------------------------------------------------------
+     {r_1 = a_1 ; ... ; r_n = a_n} : {r_1 : A_1 ; ... ; r_n : A_n}
+
+
+             r : {r_1 : A_1 ; ... ; r_n : A_n}
+            ----------------------------------- i = 1,...,n
+                          r.r_1 : A_1
+```
+Subtyping: if ``r : R`` then ``r : R ** {r : A}``
+
+
+
+#NEW
+
+==Computation rules==
+
+```
+  (\x -> b) a = b{x := a}
+
+  (table {C_1 => v_1 ; ... ; C_n => v_n} : P => A) ! C_i  =  v_i
+
+  {r_1 = a_1 ; ... ; r_n = a_n}.r_i  =  a_i
+```
+
+
+
+#NEW
+
+==Canonical GF==
+
+Concrete syntax type system:
+```
+                             A_1 : Type ... A_n : Type
+  Str : Type   Int : Type    -------------------------   $i : A
+                               [A_1, ..., A_n] : Type
+
+
+     a_1 : A_1  ...  a_n : A_n         t : [A_1, ..., A_n]
+  ---------------------------------    ------------------- i = 1,..,n
+  [a_1, ..., a_n] : [A_1, ..., A_n]        t ! i : A_i 
+```
+Tuples represent both records and tables.
+
+There are no functions.
+
+Linearization:
+```
+  lin f = f*
+
+  (f a_1 ... a_n)*  =  f*{$1 = a_1*, ..., $n = a_n*} 
+```
+
+
+#NEW
+
+==The compilation task, again==
+
+1. From a GF source grammar, derive a canonical GF grammar.
+
+2. From the canonical GF grammar derive an MPCFG grammar
+
+The canonical GF grammar can be used for linearization, with
+linear time complexity (w.r.t. the size of the tree).
+
+The MPCFG grammar can be used for parsing, with (unbounded)
+polynomial time complexity (w.r.t. the size of the string).
+
+For these target formats, we have also built interpreters in
+different programming languages (C, C++, Haskell, Java, Prolog).
+
+Moreover, we generate supplementary formats such as grammars
+required by various speech recognition systems.
+
+
+#NEW
+
+==An overview of compilation phases==
+
+Legend:
+- ellipse node: representation saved in a file
+- plain text node: internal representation
+- solid arrow or ellipse: essential phare or format
+- dashed arrow or ellipse: optional phase or format
+- arrow label: the module implementing the phase
+
+
+[gf-compiler.png]
+
+
+#NEW
+
+==Using the compiler==
+
+Batch mode (cf. GHC).
+
+Interactive mode, building the grammar incrementally from
+different files, with the possibility of testing them
+(cf. GHCI).
+
+The interactive mode was first, built on the model of ALF-2
+(L. Magnusson), and there was no file output of compiled
+grammars.
+
+
+#NEW
+
+==Modules and separate compilation==
+
+The above diagram shows what happens to each module.
+(But not quite, since some of the back-end formats must be
+built for sets of modules: GFCC and the parser formats.)
+
+When the grammar compiler is called, it has a main module as its
+argument. It then builds recursively a dependency graph with all
+the other modules, and decides which ones must be recompiled.
+The behaviour is rather similar to GHC.
+
+Separate compilation is //extremely important// when developing
+big grammars, especially when using grammar libraries. Example: compiling
+the GF resource grammar library takes 5 minutes, whereas reading
+in the compiled image takes 10 seconds.
+
+
+#NEW
+
+==Module dependencies and recompilation==
+
+(For later use, not for the Proglog talk)
+
+For each module M, there are 3 kinds of files:
+- M.gf, source file
+- M.gfc, compiled file ("object file")
+- M.gfr, type-checked and optimized source file (for resource modules only)
+
+
+The compiler reads gf files and writes gfc files (and gfr files if appropriate)
+
+The Main module is the one used as argument when calling GF.
+
+A module M (immediately) depends on the module K, if either
+- M is a concrete of K
+- M is an instance of K
+- M extends K
+- M opens K
+- M is a completion of K with something
+- M is a completion of some module with K instantiated with something
+
+
+A module M (transitively) depends on the module K, if either
+- M immediately depends on K
+- M depends on some L such that L immediately depends on K
+
+
+Immediate dependence is readable from the module header without parsing
+the whole module.
+
+The compiler reads recursively the headers of all modules that Main depends on.
+
+These modules are arranged in a dependency graph, which is checked to be acyclic.
+
+To decide whether a module M has to be compiled, do:
++ Get the time stamps t() of M.gf and M.gfc (if a file doesn't exist, its
+  time is minus infinity).
++ If t(M.gf) > t(M.gfc), M must be compiled.
++ If M depends on K and K must be compiled, then M must be compiled.
++ If M depends on K and t(K.gf) > t(M.gfc), then M must be compiled.
+
+
+Decorate the dependency graph by information on whether the gf or the gfc (and gfr)
+format is to be read.
+
+Topologically sort the decorated graph, and read each file in the chosen format.
+
+The gfr file is generated for these module types only:
+- resource
+- instance
+
+
+When reading K.gfc, also K.gfr is read if some M depending on K has to be compiled.
+In other cases, it is enough to read K.gfc.
+
+In an interactive GF session, some modules may be in memory already.
+When read to the memory, each module M is given time stamp t(M.m).
+The additional rule now is:
+- If M.gfc is to be read, and t(M.m) > t(M.gfc), don't read M.gfc.
+
+
+
+
+#NEW
+
+==Techniques used==
+
+The compiler is written in Haskell, with some C foreign function calls
+in the interactive version (readline, killing threads).
+
+BNFC is used for generating both the parsers and printers.
+This has helped to make the formats portable.
+
+"Almost compositional functions" (``composOp``) are used in
+many compiler passes, making them easier to write and understand. 
+A ``grep`` on the sources reveals 40 uses (outside the definition 
+of ``composOp`` itself).
+
+The key algorithmic ideas are
+- type-driven partial evaluation in GF-to-GFC generation
+- common subexpression elimination as back-end optimization
+- some ideas in GFC-to-MCFG encoding
+
+
+#NEW
+
+==Type-driven partial evaluation==
+
+Each abstract syntax category in GF has a corresponding linearization type:
+```
+  cat C
+  lincat C = T
+```
+The general form of a GF rule pair is
+```
+  fun f : C1 -> ... -> Cn -> C
+  lin f = t
+```
+with the typing condition following the ``lincat`` definitions
+```
+  t : T1 -> ... -> Tn -> T
+```
+The term ``t`` is in general built by using abstraction methods such
+as pattern matching, higher-order functions, local definitions,
+and library functions.
+
+The compilation technique proceeds as follows:
+- use eta-expansion on ``t`` to determine the canonical form of the term
+```
+  \ $C1, ...., $Cn -> (t $C1 .... $Cn)
+```
+with unique variables ``$C1 .... $Cn`` for the arguments; repeat this
+inside the term for records and tables
+- evaluate the resulting term using the computation rules of GF
+- what remains is a canonical term with ``$C1 .... $Cn`` the only
+variables (the run-time input of the linearization function)
+
+
+#NEW
+
+==Eta-expanding records and tables==
+
+For records that are valied via subtyping, eta expansion 
+eliminates superfluous fields:
+```
+  {r1 = t1 ; r2 = t2} : {r1 : T1}  ---->  {r1 = t1}
+```
+For tables, the effect is always expansion, since
+pattern matching can be used to represent tables
+compactly:
+```
+  table {n => "fish"} : Number => Str   --->
+
+  table {
+    Sg => "fish" ;
+    Pl => "fish"
+    }
+```
+This can be helped by back-end optimizations (see below).
+
+
+#NEW
+
+==Eliminating functions==
+
+"Everything is finite": parameter types, records, tables;
+finite number of string tokens per grammar.
+
+But "inifinite types" such as function types are useful when
+writing grammars, to enable abstractions.
+
+Since function types do not appear in linearization types,
+we want functions to be eliminated from linearization terms.
+
+This is similar to the **subformula property** in logic.
+Also the main problem is similar: function depending on
+a run-time variable,
+```
+  (table {P => f ; Q = g} ! x) a
+```
+This is not a redex, but we can make it closer to one by moving
+the application inside the table,
+```
+  table {P => f a ; Q = g a} ! x
+```
+This transformation is the same as Prawitz's (1965) elimination
+of maximal segments in natural deduction:
+```
+                                            A           B
+                                          C -> D  C   C -> D  C
+           A       B                      ---------   ---------
+  A v B  C -> D  C -> D            A v B       D          D
+  ---------------------     ===>   -------------------------
+      C -> D             C                    D
+      --------------------
+              D
+```
+
+
+
+#NEW
+
+==Size effects of partial evaluation==
+
+Irrelevant table branches are thrown away, which can reduce the size.
+
+But, since tables are expanded and auxiliary functions are inlined,
+the size can grow exponentially.
+
+How can we keep the first property and eliminate the second?
+
+
+#NEW
+
+==Parametrization of tables==
+
+Algorithm: for each branch in a table, consider replacing the
+argument by a variable:
+```
+  table {             table {
+    P => t ;    --->    x => t[P->x] ;
+    Q => u              x => u[Q->x]
+    }                   }
+```
+If the resulting branches are all equal, you can replace the table
+by a lambda abstract
+```
+  \\x => t[P->x]
+```
+If each created variable ``x`` is unique in the grammar, computation
+with the lambda abstract is efficient.
+
+
+
+#NEW
+
+==Course-of-values tables==
+
+By maintaining a canonical order of parameters in a type, we can
+eliminate the left hand sides of branches.
+```
+  table {              table T [
+    P => t ;    --->     t ;
+    Q => u               u
+    }                    ]
+```
+The treatment is similar to ``Enum`` instances in Haskell.
+
+In the end, all parameter types can be translated to
+initial segments of integers.
+
+
+#NEW
+
+==Common subexpression elimination==
+
+Algorithm: 
++ Go through all terms and subterms in a module, creating
+  a symbol table mapping terms to the number of occurrences.
++ For each subterm appearing at least twice, create a fresh
+  constant defined as that subterm.
++ Go through all rules (incl. rules for the new constants),
+  replacing largest possible subterms with such new constants.
+
+
+This algorithm, in a way, creates the strongest possible abstractions.
+
+In general, the new constants have open terms as definitions.
+But since all variables (and constants) are unique, they can
+be computed by simple replacement.
+
+
+
+#NEW
+
+==Size effects of optimizations==
+
+Example: the German resource grammar
+``LangGer``
+
+|| optimization |  lines |  characters |  size % | blow-up |
+| none       |  5394 |  3208435 |   100 |  25 |
+| all        |  5394 |   750277 |    23 |   6 |
+| none_subs  |  5772 |  1290866 |    40 |  10 |
+| all_subs   |  5644 |   414119 |    13 |   3 |
+| gfcc       |  3279 |   190004 |     6 |   1.5 |
+| gf source  |  3976 |   121939 |     4 |   1 |
+
+
+Optimization "all" means parametrization + course-of-values.
+
+The source code size is an estimate, since it includes
+potentially irrelevant library modules, and comments.
+
+The GFCC format is not reusable in separate compilation.
+
+
+
+#NEW
+
+==The shared prefix optimization==
+
+This is currently performed in GFCC only.
+
+The idea works for languages that have a rich morphology
+based on suffixes. Then we can replace a course of values
+with a pair of a prefix and a suffix set:
+```
+  ["apa", "apan", "apor", "aporna"] ---> 
+  ("ap" + ["a", "an", "or", "orna"])
+```
+The real gain comes via common subexpression elimination:
+```
+  _34 = ["a", "an", "or", "orna"]
+  apa = ("ap" + _34)
+  blomma = ("blomm" + _34)
+  flicka = ("flick" + _34)
+```
+Notice that it now matters a lot how grammars are written.
+For instance, if German verbs are treated as a one-dimensional
+table,
+```
+  ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...]
+```
+no shared prefix optimization is possible. A better form is
+separate tables for non-"ge" and "ge" forms:
+```
+  [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]]
+```
+
+
+#NEW
+
+==Reuse of grammars as libraries==
+
+The idea of resource grammars: take care of all aspects of
+surface grammaticality (inflection, agreement, word order).
+
+Reuse in application grammar: via translations
+```
+  cat C          --->  oper C : Type = T
+  lincat C = T 
+
+  fun f : A      --->  oper f : A* = t
+  lin f = t 
+```
+The user only needs to know the type signatures (abstract syntax).
+
+However, this does not quite guarantee grammaticality, because
+different categories can have the same lincat:
+```
+  lincat Conj = {s : Str}
+  lincat Adv  = {s : Str}
+```
+Thus someone may by accident use "and" as an adverb!
+
+
+#NEW
+
+==Forcing the type checker to act as a grammar checker==
+
+We just have to make linearization types unique for each category.
+
+The technique is reminiscent of Haskell's ``newtype`` but uses
+records instead: we add **lock fields** e.g.
+```
+  lincat Conj = {s : Str ; lock_Conj : {}}
+  lincat Adv  = {s : Str ; lock_Adv  : {}}
+```
+Thanks to record subtyping, the translation is simple:
+```
+  fun f : C1 -> ... -> Cn -> C        
+  lin f = t
+
+                  --->
+
+  oper f : C1* -> ... -> Cn* -> C* = 
+    \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}}
+```
+
+#NEW
+
+==Things to do==
+
+Better compression of gfc file format.
+
+Type checking of dependent-type pattern matching in abstract syntax.
+
+Compilation-related modules that need rewriting
+- ``ReadFiles``: clarify the logic of dependencies
+- ``Compile``: clarify the logic of what to do with each module
+- ``Compute``: make the evaluation more efficient
+- ``Parsing/*``, ``OldParsing/*``, ``Conversion/*``: reduce the number
+  of parser formats and algorithms
author	aarne <aarne@chalmers.se>	2010-12-22 14:08:42 +0000
committer	aarne <aarne@chalmers.se>	2010-12-22 14:08:42 +0000
commit	ce15ec7b787479ca4c7295863ea7fa5cfdd16755 (patch)
tree	f47d9227ab535781d44d00e6232b8c62902167df /deprecated/doc/compiling-gf.txt
parent	fb722fe8e2cedee3b42d7fb0c9da61ace74f3e22 (diff)