summaryrefslogtreecommitdiff
path: root/deprecated/doc/compiling-gf.txt
diff options
context:
space:
mode:
Diffstat (limited to 'deprecated/doc/compiling-gf.txt')
-rw-r--r--deprecated/doc/compiling-gf.txt750
1 files changed, 750 insertions, 0 deletions
diff --git a/deprecated/doc/compiling-gf.txt b/deprecated/doc/compiling-gf.txt
new file mode 100644
index 000000000..9e438f40f
--- /dev/null
+++ b/deprecated/doc/compiling-gf.txt
@@ -0,0 +1,750 @@
+Compiling GF
+Aarne Ranta
+Proglog meeting, 1 November 2006
+
+% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html
+
+%!target:html
+%!postproc(html): #NEW <!-- NEW -->
+
+#NEW
+
+==The compilation task==
+
+GF is a grammar formalism, i.e. a special purpose programming language
+for writing grammars.
+
+Other grammar formalisms:
+- BNF, YACC, Happy (grammars for programming languages);
+- PATR, HPSG, LFG (grammars for natural languages).
+
+
+The grammar compiler prepares a GF grammar for two computational tasks:
+- linearization: take syntax trees to strings
+- parsing: take strings to syntax trees
+
+
+The grammar gives a declarative description of these functionalities,
+on a high abstraction level that improves grammar writing
+productivity.
+
+For efficiency, the grammar is compiled to lower-level formats.
+
+Type checking is another essential compilation phase. Its purpose is
+twofold, as usual:
+- checking the correctness of the grammar
+- type-annotating expressions for code generation
+
+
+#NEW
+
+==Characteristics of GF language==
+
+Functional language with types, both built-in and user-defined.
+```
+ Str : Type
+
+ param Number = Sg | Pl
+
+ param AdjForm = ASg Gender | APl
+
+ Noun : Type = {s : Number => Str ; g : Gender}
+```
+Pattern matching.
+```
+ svart_A = table {
+ ASg _ => "svart" ;
+ _ => "svarta"
+ }
+```
+Higher-order functions.
+
+Dependent types.
+```
+ flip : (a, b, c : Type) -> (a -> b -> c) -> b -> a -> c =
+ \_,_,_,f,y,x -> f x y ;
+```
+
+
+#NEW
+
+==The module system of GF==
+
+Main division: abstract syntax and concrete syntax
+```
+ abstract Greeting = {
+ cat Greet ;
+ fun Hello : Greet ;
+ }
+
+ concrete GreetingEng of Greeting = {
+ lincat Greet = {s : Str} ;
+ lin Hello = {s = "hello"} ;
+ }
+
+ concrete GreetingIta of Greeting = {
+ param Politeness = Familiar | Polite ;
+ lincat Greet = {s : Politeness => Str} ;
+ lin Hello = {s = table {
+ Familiar => "ciao" ;
+ Polite => "buongiorno"
+ } ;
+ }
+```
+Other features of the module system:
+- extension and opening
+- parametrized modules (cf. ML: signatures, structures, functors)
+
+
+
+
+#NEW
+
+==GF vs. Haskell==
+
+Some things that (standard) Haskell hasn't:
+- records and record subtyping
+- regular expression patterns
+- dependent types
+- ML-style modules
+
+
+Some things that GF hasn't:
+- infinite (recursive) data types
+- recursive functions
+- classes, polymorphism
+
+
+#NEW
+
+==GF vs. most linguistic grammar formalisms==
+
+GF separates abstract syntax from concrete syntax.
+
+GF has a module system with separate compilation.
+
+GF is generation-oriented (as opposed to parsing).
+
+GF has unidirectional matching (as opposed to unification).
+
+GF has a static type system (as opposed to a type-free universe).
+
+"I was - and I still am - firmly convinced that a program composed
+out of statically type-checked parts is more likely to faithfully
+express a well-thought-out design than a program relying on
+weakly-typed interfaces or dynamically-checked interfaces."
+(B. Stroustrup, 1994, p. 107)
+
+
+
+#NEW
+
+==The computation model: abstract syntax==
+
+An abstract syntax defines a free algebra of trees (using
+dependent types, recursion, higher-order abstract syntax:
+GF includes a complete Logical Framework).
+```
+ cat C (x_1 : A_1)...(x_n : A_n)
+ a_1 : A_1
+ ...
+ a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1}
+ ----------------------------------------------------
+ (C a_1 ... a_n) : Type
+
+
+ fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A
+ a_1 : A_1
+ ...
+ a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1}
+ ----------------------------------------------------
+ (f a_1 ... a_n) : A{x_1 : A_1,...,x_n : A_n}
+
+
+ A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A
+ ---------------------------- ---------------------- ------------------------
+ (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A}
+```
+Notice that all syntax trees are in eta-long form.
+
+
+#NEW
+
+==The computation model: concrete syntax==
+
+A concrete syntax defines a homomorphism (compositional mapping)
+from the abstract syntax to a system of concrete syntax objects.
+```
+ cat C _
+ --------------------
+ lincat C = C* : Type
+
+ fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A
+ -----------------------------------------------
+ lin f = f* : A_1* -> ... -> A_n* -> A*
+
+ (f a_1 ... a_n)* = f* a_1* ... a_n*
+```
+The homomorphism can as such be used as linearization function.
+
+It is a functional program, but a restricted one, since it works
+in the end on finite data structures only.
+
+But a more efficient program is obtained via compilation to
+GFC = Canonical GF: the "machine code" of GF.
+
+The parsing problem of GFC can be reduced to that of MPCFG (Multiple
+Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).
+
+
+
+#NEW
+
+==The core type system of concrete syntax: basic types==
+
+```
+ param P P : PType
+ PType : Type --------- ---------
+ P : PType P : Type
+
+ s : Str t : Str
+ Str : type "foo" : Str [] : Str ----------------
+ s ++ t : Str
+```
+
+
+#NEW
+
+==The core type system of concrete syntax: functions and tables==
+
+```
+ A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A
+ ---------------------------- ---------------------- ------------------------
+ (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A}
+
+
+ P : PType A : Type t : P => A p : p
+ -------------------- -----------------
+ P => A : Type t ! p : A
+
+ v_1,...,v_n : A
+ ---------------------------------------------- P = {C_1,...,C_n}
+ table {C_1 => v_1 ; ... ; C_n => v_n} : P => A
+```
+Pattern matching is treated as an abbreviation for tables. Notice that
+```
+ case e of {...} == table {...} ! e
+```
+
+
+#NEW
+
+==The core type system of concrete syntax: records==
+
+```
+ A_1,...,A_n : Type
+ ------------------------------------ n >= 0
+ {r_1 : A_1 ; ... ; r_n : A_n} : Type
+
+
+ a_1 : A_1 ... a_n : A_n
+ ------------------------------------------------------------
+ {r_1 = a_1 ; ... ; r_n = a_n} : {r_1 : A_1 ; ... ; r_n : A_n}
+
+
+ r : {r_1 : A_1 ; ... ; r_n : A_n}
+ ----------------------------------- i = 1,...,n
+ r.r_1 : A_1
+```
+Subtyping: if ``r : R`` then ``r : R ** {r : A}``
+
+
+
+#NEW
+
+==Computation rules==
+
+```
+ (\x -> b) a = b{x := a}
+
+ (table {C_1 => v_1 ; ... ; C_n => v_n} : P => A) ! C_i = v_i
+
+ {r_1 = a_1 ; ... ; r_n = a_n}.r_i = a_i
+```
+
+
+
+#NEW
+
+==Canonical GF==
+
+Concrete syntax type system:
+```
+ A_1 : Type ... A_n : Type
+ Str : Type Int : Type ------------------------- $i : A
+ [A_1, ..., A_n] : Type
+
+
+ a_1 : A_1 ... a_n : A_n t : [A_1, ..., A_n]
+ --------------------------------- ------------------- i = 1,..,n
+ [a_1, ..., a_n] : [A_1, ..., A_n] t ! i : A_i
+```
+Tuples represent both records and tables.
+
+There are no functions.
+
+Linearization:
+```
+ lin f = f*
+
+ (f a_1 ... a_n)* = f*{$1 = a_1*, ..., $n = a_n*}
+```
+
+
+#NEW
+
+==The compilation task, again==
+
+1. From a GF source grammar, derive a canonical GF grammar.
+
+2. From the canonical GF grammar derive an MPCFG grammar
+
+The canonical GF grammar can be used for linearization, with
+linear time complexity (w.r.t. the size of the tree).
+
+The MPCFG grammar can be used for parsing, with (unbounded)
+polynomial time complexity (w.r.t. the size of the string).
+
+For these target formats, we have also built interpreters in
+different programming languages (C, C++, Haskell, Java, Prolog).
+
+Moreover, we generate supplementary formats such as grammars
+required by various speech recognition systems.
+
+
+#NEW
+
+==An overview of compilation phases==
+
+Legend:
+- ellipse node: representation saved in a file
+- plain text node: internal representation
+- solid arrow or ellipse: essential phare or format
+- dashed arrow or ellipse: optional phase or format
+- arrow label: the module implementing the phase
+
+
+[gf-compiler.png]
+
+
+#NEW
+
+==Using the compiler==
+
+Batch mode (cf. GHC).
+
+Interactive mode, building the grammar incrementally from
+different files, with the possibility of testing them
+(cf. GHCI).
+
+The interactive mode was first, built on the model of ALF-2
+(L. Magnusson), and there was no file output of compiled
+grammars.
+
+
+#NEW
+
+==Modules and separate compilation==
+
+The above diagram shows what happens to each module.
+(But not quite, since some of the back-end formats must be
+built for sets of modules: GFCC and the parser formats.)
+
+When the grammar compiler is called, it has a main module as its
+argument. It then builds recursively a dependency graph with all
+the other modules, and decides which ones must be recompiled.
+The behaviour is rather similar to GHC.
+
+Separate compilation is //extremely important// when developing
+big grammars, especially when using grammar libraries. Example: compiling
+the GF resource grammar library takes 5 minutes, whereas reading
+in the compiled image takes 10 seconds.
+
+
+#NEW
+
+==Module dependencies and recompilation==
+
+(For later use, not for the Proglog talk)
+
+For each module M, there are 3 kinds of files:
+- M.gf, source file
+- M.gfc, compiled file ("object file")
+- M.gfr, type-checked and optimized source file (for resource modules only)
+
+
+The compiler reads gf files and writes gfc files (and gfr files if appropriate)
+
+The Main module is the one used as argument when calling GF.
+
+A module M (immediately) depends on the module K, if either
+- M is a concrete of K
+- M is an instance of K
+- M extends K
+- M opens K
+- M is a completion of K with something
+- M is a completion of some module with K instantiated with something
+
+
+A module M (transitively) depends on the module K, if either
+- M immediately depends on K
+- M depends on some L such that L immediately depends on K
+
+
+Immediate dependence is readable from the module header without parsing
+the whole module.
+
+The compiler reads recursively the headers of all modules that Main depends on.
+
+These modules are arranged in a dependency graph, which is checked to be acyclic.
+
+To decide whether a module M has to be compiled, do:
++ Get the time stamps t() of M.gf and M.gfc (if a file doesn't exist, its
+ time is minus infinity).
++ If t(M.gf) > t(M.gfc), M must be compiled.
++ If M depends on K and K must be compiled, then M must be compiled.
++ If M depends on K and t(K.gf) > t(M.gfc), then M must be compiled.
+
+
+Decorate the dependency graph by information on whether the gf or the gfc (and gfr)
+format is to be read.
+
+Topologically sort the decorated graph, and read each file in the chosen format.
+
+The gfr file is generated for these module types only:
+- resource
+- instance
+
+
+When reading K.gfc, also K.gfr is read if some M depending on K has to be compiled.
+In other cases, it is enough to read K.gfc.
+
+In an interactive GF session, some modules may be in memory already.
+When read to the memory, each module M is given time stamp t(M.m).
+The additional rule now is:
+- If M.gfc is to be read, and t(M.m) > t(M.gfc), don't read M.gfc.
+
+
+
+
+#NEW
+
+==Techniques used==
+
+The compiler is written in Haskell, with some C foreign function calls
+in the interactive version (readline, killing threads).
+
+BNFC is used for generating both the parsers and printers.
+This has helped to make the formats portable.
+
+"Almost compositional functions" (``composOp``) are used in
+many compiler passes, making them easier to write and understand.
+A ``grep`` on the sources reveals 40 uses (outside the definition
+of ``composOp`` itself).
+
+The key algorithmic ideas are
+- type-driven partial evaluation in GF-to-GFC generation
+- common subexpression elimination as back-end optimization
+- some ideas in GFC-to-MCFG encoding
+
+
+#NEW
+
+==Type-driven partial evaluation==
+
+Each abstract syntax category in GF has a corresponding linearization type:
+```
+ cat C
+ lincat C = T
+```
+The general form of a GF rule pair is
+```
+ fun f : C1 -> ... -> Cn -> C
+ lin f = t
+```
+with the typing condition following the ``lincat`` definitions
+```
+ t : T1 -> ... -> Tn -> T
+```
+The term ``t`` is in general built by using abstraction methods such
+as pattern matching, higher-order functions, local definitions,
+and library functions.
+
+The compilation technique proceeds as follows:
+- use eta-expansion on ``t`` to determine the canonical form of the term
+```
+ \ $C1, ...., $Cn -> (t $C1 .... $Cn)
+```
+with unique variables ``$C1 .... $Cn`` for the arguments; repeat this
+inside the term for records and tables
+- evaluate the resulting term using the computation rules of GF
+- what remains is a canonical term with ``$C1 .... $Cn`` the only
+variables (the run-time input of the linearization function)
+
+
+#NEW
+
+==Eta-expanding records and tables==
+
+For records that are valied via subtyping, eta expansion
+eliminates superfluous fields:
+```
+ {r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1}
+```
+For tables, the effect is always expansion, since
+pattern matching can be used to represent tables
+compactly:
+```
+ table {n => "fish"} : Number => Str --->
+
+ table {
+ Sg => "fish" ;
+ Pl => "fish"
+ }
+```
+This can be helped by back-end optimizations (see below).
+
+
+#NEW
+
+==Eliminating functions==
+
+"Everything is finite": parameter types, records, tables;
+finite number of string tokens per grammar.
+
+But "inifinite types" such as function types are useful when
+writing grammars, to enable abstractions.
+
+Since function types do not appear in linearization types,
+we want functions to be eliminated from linearization terms.
+
+This is similar to the **subformula property** in logic.
+Also the main problem is similar: function depending on
+a run-time variable,
+```
+ (table {P => f ; Q = g} ! x) a
+```
+This is not a redex, but we can make it closer to one by moving
+the application inside the table,
+```
+ table {P => f a ; Q = g a} ! x
+```
+This transformation is the same as Prawitz's (1965) elimination
+of maximal segments in natural deduction:
+```
+ A B
+ C -> D C C -> D C
+ A B --------- ---------
+ A v B C -> D C -> D A v B D D
+ --------------------- ===> -------------------------
+ C -> D C D
+ --------------------
+ D
+```
+
+
+
+#NEW
+
+==Size effects of partial evaluation==
+
+Irrelevant table branches are thrown away, which can reduce the size.
+
+But, since tables are expanded and auxiliary functions are inlined,
+the size can grow exponentially.
+
+How can we keep the first property and eliminate the second?
+
+
+#NEW
+
+==Parametrization of tables==
+
+Algorithm: for each branch in a table, consider replacing the
+argument by a variable:
+```
+ table { table {
+ P => t ; ---> x => t[P->x] ;
+ Q => u x => u[Q->x]
+ } }
+```
+If the resulting branches are all equal, you can replace the table
+by a lambda abstract
+```
+ \\x => t[P->x]
+```
+If each created variable ``x`` is unique in the grammar, computation
+with the lambda abstract is efficient.
+
+
+
+#NEW
+
+==Course-of-values tables==
+
+By maintaining a canonical order of parameters in a type, we can
+eliminate the left hand sides of branches.
+```
+ table { table T [
+ P => t ; ---> t ;
+ Q => u u
+ } ]
+```
+The treatment is similar to ``Enum`` instances in Haskell.
+
+In the end, all parameter types can be translated to
+initial segments of integers.
+
+
+#NEW
+
+==Common subexpression elimination==
+
+Algorithm:
++ Go through all terms and subterms in a module, creating
+ a symbol table mapping terms to the number of occurrences.
++ For each subterm appearing at least twice, create a fresh
+ constant defined as that subterm.
++ Go through all rules (incl. rules for the new constants),
+ replacing largest possible subterms with such new constants.
+
+
+This algorithm, in a way, creates the strongest possible abstractions.
+
+In general, the new constants have open terms as definitions.
+But since all variables (and constants) are unique, they can
+be computed by simple replacement.
+
+
+
+#NEW
+
+==Size effects of optimizations==
+
+Example: the German resource grammar
+``LangGer``
+
+|| optimization | lines | characters | size % | blow-up |
+| none | 5394 | 3208435 | 100 | 25 |
+| all | 5394 | 750277 | 23 | 6 |
+| none_subs | 5772 | 1290866 | 40 | 10 |
+| all_subs | 5644 | 414119 | 13 | 3 |
+| gfcc | 3279 | 190004 | 6 | 1.5 |
+| gf source | 3976 | 121939 | 4 | 1 |
+
+
+Optimization "all" means parametrization + course-of-values.
+
+The source code size is an estimate, since it includes
+potentially irrelevant library modules, and comments.
+
+The GFCC format is not reusable in separate compilation.
+
+
+
+#NEW
+
+==The shared prefix optimization==
+
+This is currently performed in GFCC only.
+
+The idea works for languages that have a rich morphology
+based on suffixes. Then we can replace a course of values
+with a pair of a prefix and a suffix set:
+```
+ ["apa", "apan", "apor", "aporna"] --->
+ ("ap" + ["a", "an", "or", "orna"])
+```
+The real gain comes via common subexpression elimination:
+```
+ _34 = ["a", "an", "or", "orna"]
+ apa = ("ap" + _34)
+ blomma = ("blomm" + _34)
+ flicka = ("flick" + _34)
+```
+Notice that it now matters a lot how grammars are written.
+For instance, if German verbs are treated as a one-dimensional
+table,
+```
+ ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...]
+```
+no shared prefix optimization is possible. A better form is
+separate tables for non-"ge" and "ge" forms:
+```
+ [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]]
+```
+
+
+#NEW
+
+==Reuse of grammars as libraries==
+
+The idea of resource grammars: take care of all aspects of
+surface grammaticality (inflection, agreement, word order).
+
+Reuse in application grammar: via translations
+```
+ cat C ---> oper C : Type = T
+ lincat C = T
+
+ fun f : A ---> oper f : A* = t
+ lin f = t
+```
+The user only needs to know the type signatures (abstract syntax).
+
+However, this does not quite guarantee grammaticality, because
+different categories can have the same lincat:
+```
+ lincat Conj = {s : Str}
+ lincat Adv = {s : Str}
+```
+Thus someone may by accident use "and" as an adverb!
+
+
+#NEW
+
+==Forcing the type checker to act as a grammar checker==
+
+We just have to make linearization types unique for each category.
+
+The technique is reminiscent of Haskell's ``newtype`` but uses
+records instead: we add **lock fields** e.g.
+```
+ lincat Conj = {s : Str ; lock_Conj : {}}
+ lincat Adv = {s : Str ; lock_Adv : {}}
+```
+Thanks to record subtyping, the translation is simple:
+```
+ fun f : C1 -> ... -> Cn -> C
+ lin f = t
+
+ --->
+
+ oper f : C1* -> ... -> Cn* -> C* =
+ \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}}
+```
+
+#NEW
+
+==Things to do==
+
+Better compression of gfc file format.
+
+Type checking of dependent-type pattern matching in abstract syntax.
+
+Compilation-related modules that need rewriting
+- ``ReadFiles``: clarify the logic of dependencies
+- ``Compile``: clarify the logic of what to do with each module
+- ``Compute``: make the evaluation more efficient
+- ``Parsing/*``, ``OldParsing/*``, ``Conversion/*``: reduce the number
+ of parser formats and algorithms