diff options
| author | aarne <aarne@chalmers.se> | 2010-12-22 14:08:42 +0000 |
|---|---|---|
| committer | aarne <aarne@chalmers.se> | 2010-12-22 14:08:42 +0000 |
| commit | ce15ec7b787479ca4c7295863ea7fa5cfdd16755 (patch) | |
| tree | f47d9227ab535781d44d00e6232b8c62902167df /deprecated/doc/compiling-gf.txt | |
| parent | fb722fe8e2cedee3b42d7fb0c9da61ace74f3e22 (diff) | |
moved parts of doc to deprecated/doc
Diffstat (limited to 'deprecated/doc/compiling-gf.txt')
| -rw-r--r-- | deprecated/doc/compiling-gf.txt | 750 |
1 files changed, 750 insertions, 0 deletions
diff --git a/deprecated/doc/compiling-gf.txt b/deprecated/doc/compiling-gf.txt new file mode 100644 index 000000000..9e438f40f --- /dev/null +++ b/deprecated/doc/compiling-gf.txt @@ -0,0 +1,750 @@ +Compiling GF +Aarne Ranta +Proglog meeting, 1 November 2006 + +% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html + +%!target:html +%!postproc(html): #NEW <!-- NEW --> + +#NEW + +==The compilation task== + +GF is a grammar formalism, i.e. a special purpose programming language +for writing grammars. + +Other grammar formalisms: +- BNF, YACC, Happy (grammars for programming languages); +- PATR, HPSG, LFG (grammars for natural languages). + + +The grammar compiler prepares a GF grammar for two computational tasks: +- linearization: take syntax trees to strings +- parsing: take strings to syntax trees + + +The grammar gives a declarative description of these functionalities, +on a high abstraction level that improves grammar writing +productivity. + +For efficiency, the grammar is compiled to lower-level formats. + +Type checking is another essential compilation phase. Its purpose is +twofold, as usual: +- checking the correctness of the grammar +- type-annotating expressions for code generation + + +#NEW + +==Characteristics of GF language== + +Functional language with types, both built-in and user-defined. +``` + Str : Type + + param Number = Sg | Pl + + param AdjForm = ASg Gender | APl + + Noun : Type = {s : Number => Str ; g : Gender} +``` +Pattern matching. +``` + svart_A = table { + ASg _ => "svart" ; + _ => "svarta" + } +``` +Higher-order functions. + +Dependent types. +``` + flip : (a, b, c : Type) -> (a -> b -> c) -> b -> a -> c = + \_,_,_,f,y,x -> f x y ; +``` + + +#NEW + +==The module system of GF== + +Main division: abstract syntax and concrete syntax +``` + abstract Greeting = { + cat Greet ; + fun Hello : Greet ; + } + + concrete GreetingEng of Greeting = { + lincat Greet = {s : Str} ; + lin Hello = {s = "hello"} ; + } + + concrete GreetingIta of Greeting = { + param Politeness = Familiar | Polite ; + lincat Greet = {s : Politeness => Str} ; + lin Hello = {s = table { + Familiar => "ciao" ; + Polite => "buongiorno" + } ; + } +``` +Other features of the module system: +- extension and opening +- parametrized modules (cf. ML: signatures, structures, functors) + + + + +#NEW + +==GF vs. Haskell== + +Some things that (standard) Haskell hasn't: +- records and record subtyping +- regular expression patterns +- dependent types +- ML-style modules + + +Some things that GF hasn't: +- infinite (recursive) data types +- recursive functions +- classes, polymorphism + + +#NEW + +==GF vs. most linguistic grammar formalisms== + +GF separates abstract syntax from concrete syntax. + +GF has a module system with separate compilation. + +GF is generation-oriented (as opposed to parsing). + +GF has unidirectional matching (as opposed to unification). + +GF has a static type system (as opposed to a type-free universe). + +"I was - and I still am - firmly convinced that a program composed +out of statically type-checked parts is more likely to faithfully +express a well-thought-out design than a program relying on +weakly-typed interfaces or dynamically-checked interfaces." +(B. Stroustrup, 1994, p. 107) + + + +#NEW + +==The computation model: abstract syntax== + +An abstract syntax defines a free algebra of trees (using +dependent types, recursion, higher-order abstract syntax: +GF includes a complete Logical Framework). +``` + cat C (x_1 : A_1)...(x_n : A_n) + a_1 : A_1 + ... + a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1} + ---------------------------------------------------- + (C a_1 ... a_n) : Type + + + fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A + a_1 : A_1 + ... + a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1} + ---------------------------------------------------- + (f a_1 ... a_n) : A{x_1 : A_1,...,x_n : A_n} + + + A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A + ---------------------------- ---------------------- ------------------------ + (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A} +``` +Notice that all syntax trees are in eta-long form. + + +#NEW + +==The computation model: concrete syntax== + +A concrete syntax defines a homomorphism (compositional mapping) +from the abstract syntax to a system of concrete syntax objects. +``` + cat C _ + -------------------- + lincat C = C* : Type + + fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A + ----------------------------------------------- + lin f = f* : A_1* -> ... -> A_n* -> A* + + (f a_1 ... a_n)* = f* a_1* ... a_n* +``` +The homomorphism can as such be used as linearization function. + +It is a functional program, but a restricted one, since it works +in the end on finite data structures only. + +But a more efficient program is obtained via compilation to +GFC = Canonical GF: the "machine code" of GF. + +The parsing problem of GFC can be reduced to that of MPCFG (Multiple +Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004). + + + +#NEW + +==The core type system of concrete syntax: basic types== + +``` + param P P : PType + PType : Type --------- --------- + P : PType P : Type + + s : Str t : Str + Str : type "foo" : Str [] : Str ---------------- + s ++ t : Str +``` + + +#NEW + +==The core type system of concrete syntax: functions and tables== + +``` + A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A + ---------------------------- ---------------------- ------------------------ + (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A} + + + P : PType A : Type t : P => A p : p + -------------------- ----------------- + P => A : Type t ! p : A + + v_1,...,v_n : A + ---------------------------------------------- P = {C_1,...,C_n} + table {C_1 => v_1 ; ... ; C_n => v_n} : P => A +``` +Pattern matching is treated as an abbreviation for tables. Notice that +``` + case e of {...} == table {...} ! e +``` + + +#NEW + +==The core type system of concrete syntax: records== + +``` + A_1,...,A_n : Type + ------------------------------------ n >= 0 + {r_1 : A_1 ; ... ; r_n : A_n} : Type + + + a_1 : A_1 ... a_n : A_n + ------------------------------------------------------------ + {r_1 = a_1 ; ... ; r_n = a_n} : {r_1 : A_1 ; ... ; r_n : A_n} + + + r : {r_1 : A_1 ; ... ; r_n : A_n} + ----------------------------------- i = 1,...,n + r.r_1 : A_1 +``` +Subtyping: if ``r : R`` then ``r : R ** {r : A}`` + + + +#NEW + +==Computation rules== + +``` + (\x -> b) a = b{x := a} + + (table {C_1 => v_1 ; ... ; C_n => v_n} : P => A) ! C_i = v_i + + {r_1 = a_1 ; ... ; r_n = a_n}.r_i = a_i +``` + + + +#NEW + +==Canonical GF== + +Concrete syntax type system: +``` + A_1 : Type ... A_n : Type + Str : Type Int : Type ------------------------- $i : A + [A_1, ..., A_n] : Type + + + a_1 : A_1 ... a_n : A_n t : [A_1, ..., A_n] + --------------------------------- ------------------- i = 1,..,n + [a_1, ..., a_n] : [A_1, ..., A_n] t ! i : A_i +``` +Tuples represent both records and tables. + +There are no functions. + +Linearization: +``` + lin f = f* + + (f a_1 ... a_n)* = f*{$1 = a_1*, ..., $n = a_n*} +``` + + +#NEW + +==The compilation task, again== + +1. From a GF source grammar, derive a canonical GF grammar. + +2. From the canonical GF grammar derive an MPCFG grammar + +The canonical GF grammar can be used for linearization, with +linear time complexity (w.r.t. the size of the tree). + +The MPCFG grammar can be used for parsing, with (unbounded) +polynomial time complexity (w.r.t. the size of the string). + +For these target formats, we have also built interpreters in +different programming languages (C, C++, Haskell, Java, Prolog). + +Moreover, we generate supplementary formats such as grammars +required by various speech recognition systems. + + +#NEW + +==An overview of compilation phases== + +Legend: +- ellipse node: representation saved in a file +- plain text node: internal representation +- solid arrow or ellipse: essential phare or format +- dashed arrow or ellipse: optional phase or format +- arrow label: the module implementing the phase + + +[gf-compiler.png] + + +#NEW + +==Using the compiler== + +Batch mode (cf. GHC). + +Interactive mode, building the grammar incrementally from +different files, with the possibility of testing them +(cf. GHCI). + +The interactive mode was first, built on the model of ALF-2 +(L. Magnusson), and there was no file output of compiled +grammars. + + +#NEW + +==Modules and separate compilation== + +The above diagram shows what happens to each module. +(But not quite, since some of the back-end formats must be +built for sets of modules: GFCC and the parser formats.) + +When the grammar compiler is called, it has a main module as its +argument. It then builds recursively a dependency graph with all +the other modules, and decides which ones must be recompiled. +The behaviour is rather similar to GHC. + +Separate compilation is //extremely important// when developing +big grammars, especially when using grammar libraries. Example: compiling +the GF resource grammar library takes 5 minutes, whereas reading +in the compiled image takes 10 seconds. + + +#NEW + +==Module dependencies and recompilation== + +(For later use, not for the Proglog talk) + +For each module M, there are 3 kinds of files: +- M.gf, source file +- M.gfc, compiled file ("object file") +- M.gfr, type-checked and optimized source file (for resource modules only) + + +The compiler reads gf files and writes gfc files (and gfr files if appropriate) + +The Main module is the one used as argument when calling GF. + +A module M (immediately) depends on the module K, if either +- M is a concrete of K +- M is an instance of K +- M extends K +- M opens K +- M is a completion of K with something +- M is a completion of some module with K instantiated with something + + +A module M (transitively) depends on the module K, if either +- M immediately depends on K +- M depends on some L such that L immediately depends on K + + +Immediate dependence is readable from the module header without parsing +the whole module. + +The compiler reads recursively the headers of all modules that Main depends on. + +These modules are arranged in a dependency graph, which is checked to be acyclic. + +To decide whether a module M has to be compiled, do: ++ Get the time stamps t() of M.gf and M.gfc (if a file doesn't exist, its + time is minus infinity). ++ If t(M.gf) > t(M.gfc), M must be compiled. ++ If M depends on K and K must be compiled, then M must be compiled. ++ If M depends on K and t(K.gf) > t(M.gfc), then M must be compiled. + + +Decorate the dependency graph by information on whether the gf or the gfc (and gfr) +format is to be read. + +Topologically sort the decorated graph, and read each file in the chosen format. + +The gfr file is generated for these module types only: +- resource +- instance + + +When reading K.gfc, also K.gfr is read if some M depending on K has to be compiled. +In other cases, it is enough to read K.gfc. + +In an interactive GF session, some modules may be in memory already. +When read to the memory, each module M is given time stamp t(M.m). +The additional rule now is: +- If M.gfc is to be read, and t(M.m) > t(M.gfc), don't read M.gfc. + + + + +#NEW + +==Techniques used== + +The compiler is written in Haskell, with some C foreign function calls +in the interactive version (readline, killing threads). + +BNFC is used for generating both the parsers and printers. +This has helped to make the formats portable. + +"Almost compositional functions" (``composOp``) are used in +many compiler passes, making them easier to write and understand. +A ``grep`` on the sources reveals 40 uses (outside the definition +of ``composOp`` itself). + +The key algorithmic ideas are +- type-driven partial evaluation in GF-to-GFC generation +- common subexpression elimination as back-end optimization +- some ideas in GFC-to-MCFG encoding + + +#NEW + +==Type-driven partial evaluation== + +Each abstract syntax category in GF has a corresponding linearization type: +``` + cat C + lincat C = T +``` +The general form of a GF rule pair is +``` + fun f : C1 -> ... -> Cn -> C + lin f = t +``` +with the typing condition following the ``lincat`` definitions +``` + t : T1 -> ... -> Tn -> T +``` +The term ``t`` is in general built by using abstraction methods such +as pattern matching, higher-order functions, local definitions, +and library functions. + +The compilation technique proceeds as follows: +- use eta-expansion on ``t`` to determine the canonical form of the term +``` + \ $C1, ...., $Cn -> (t $C1 .... $Cn) +``` +with unique variables ``$C1 .... $Cn`` for the arguments; repeat this +inside the term for records and tables +- evaluate the resulting term using the computation rules of GF +- what remains is a canonical term with ``$C1 .... $Cn`` the only +variables (the run-time input of the linearization function) + + +#NEW + +==Eta-expanding records and tables== + +For records that are valied via subtyping, eta expansion +eliminates superfluous fields: +``` + {r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1} +``` +For tables, the effect is always expansion, since +pattern matching can be used to represent tables +compactly: +``` + table {n => "fish"} : Number => Str ---> + + table { + Sg => "fish" ; + Pl => "fish" + } +``` +This can be helped by back-end optimizations (see below). + + +#NEW + +==Eliminating functions== + +"Everything is finite": parameter types, records, tables; +finite number of string tokens per grammar. + +But "inifinite types" such as function types are useful when +writing grammars, to enable abstractions. + +Since function types do not appear in linearization types, +we want functions to be eliminated from linearization terms. + +This is similar to the **subformula property** in logic. +Also the main problem is similar: function depending on +a run-time variable, +``` + (table {P => f ; Q = g} ! x) a +``` +This is not a redex, but we can make it closer to one by moving +the application inside the table, +``` + table {P => f a ; Q = g a} ! x +``` +This transformation is the same as Prawitz's (1965) elimination +of maximal segments in natural deduction: +``` + A B + C -> D C C -> D C + A B --------- --------- + A v B C -> D C -> D A v B D D + --------------------- ===> ------------------------- + C -> D C D + -------------------- + D +``` + + + +#NEW + +==Size effects of partial evaluation== + +Irrelevant table branches are thrown away, which can reduce the size. + +But, since tables are expanded and auxiliary functions are inlined, +the size can grow exponentially. + +How can we keep the first property and eliminate the second? + + +#NEW + +==Parametrization of tables== + +Algorithm: for each branch in a table, consider replacing the +argument by a variable: +``` + table { table { + P => t ; ---> x => t[P->x] ; + Q => u x => u[Q->x] + } } +``` +If the resulting branches are all equal, you can replace the table +by a lambda abstract +``` + \\x => t[P->x] +``` +If each created variable ``x`` is unique in the grammar, computation +with the lambda abstract is efficient. + + + +#NEW + +==Course-of-values tables== + +By maintaining a canonical order of parameters in a type, we can +eliminate the left hand sides of branches. +``` + table { table T [ + P => t ; ---> t ; + Q => u u + } ] +``` +The treatment is similar to ``Enum`` instances in Haskell. + +In the end, all parameter types can be translated to +initial segments of integers. + + +#NEW + +==Common subexpression elimination== + +Algorithm: ++ Go through all terms and subterms in a module, creating + a symbol table mapping terms to the number of occurrences. ++ For each subterm appearing at least twice, create a fresh + constant defined as that subterm. ++ Go through all rules (incl. rules for the new constants), + replacing largest possible subterms with such new constants. + + +This algorithm, in a way, creates the strongest possible abstractions. + +In general, the new constants have open terms as definitions. +But since all variables (and constants) are unique, they can +be computed by simple replacement. + + + +#NEW + +==Size effects of optimizations== + +Example: the German resource grammar +``LangGer`` + +|| optimization | lines | characters | size % | blow-up | +| none | 5394 | 3208435 | 100 | 25 | +| all | 5394 | 750277 | 23 | 6 | +| none_subs | 5772 | 1290866 | 40 | 10 | +| all_subs | 5644 | 414119 | 13 | 3 | +| gfcc | 3279 | 190004 | 6 | 1.5 | +| gf source | 3976 | 121939 | 4 | 1 | + + +Optimization "all" means parametrization + course-of-values. + +The source code size is an estimate, since it includes +potentially irrelevant library modules, and comments. + +The GFCC format is not reusable in separate compilation. + + + +#NEW + +==The shared prefix optimization== + +This is currently performed in GFCC only. + +The idea works for languages that have a rich morphology +based on suffixes. Then we can replace a course of values +with a pair of a prefix and a suffix set: +``` + ["apa", "apan", "apor", "aporna"] ---> + ("ap" + ["a", "an", "or", "orna"]) +``` +The real gain comes via common subexpression elimination: +``` + _34 = ["a", "an", "or", "orna"] + apa = ("ap" + _34) + blomma = ("blomm" + _34) + flicka = ("flick" + _34) +``` +Notice that it now matters a lot how grammars are written. +For instance, if German verbs are treated as a one-dimensional +table, +``` + ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...] +``` +no shared prefix optimization is possible. A better form is +separate tables for non-"ge" and "ge" forms: +``` + [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]] +``` + + +#NEW + +==Reuse of grammars as libraries== + +The idea of resource grammars: take care of all aspects of +surface grammaticality (inflection, agreement, word order). + +Reuse in application grammar: via translations +``` + cat C ---> oper C : Type = T + lincat C = T + + fun f : A ---> oper f : A* = t + lin f = t +``` +The user only needs to know the type signatures (abstract syntax). + +However, this does not quite guarantee grammaticality, because +different categories can have the same lincat: +``` + lincat Conj = {s : Str} + lincat Adv = {s : Str} +``` +Thus someone may by accident use "and" as an adverb! + + +#NEW + +==Forcing the type checker to act as a grammar checker== + +We just have to make linearization types unique for each category. + +The technique is reminiscent of Haskell's ``newtype`` but uses +records instead: we add **lock fields** e.g. +``` + lincat Conj = {s : Str ; lock_Conj : {}} + lincat Adv = {s : Str ; lock_Adv : {}} +``` +Thanks to record subtyping, the translation is simple: +``` + fun f : C1 -> ... -> Cn -> C + lin f = t + + ---> + + oper f : C1* -> ... -> Cn* -> C* = + \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}} +``` + +#NEW + +==Things to do== + +Better compression of gfc file format. + +Type checking of dependent-type pattern matching in abstract syntax. + +Compilation-related modules that need rewriting +- ``ReadFiles``: clarify the logic of dependencies +- ``Compile``: clarify the logic of what to do with each module +- ``Compute``: make the evaluation more efficient +- ``Parsing/*``, ``OldParsing/*``, ``Conversion/*``: reduce the number + of parser formats and algorithms |
