diff options
Diffstat (limited to 'doc/compiling-gf.txt')
| -rw-r--r-- | doc/compiling-gf.txt | 750 |
1 files changed, 0 insertions, 750 deletions
diff --git a/doc/compiling-gf.txt b/doc/compiling-gf.txt deleted file mode 100644 index 9e438f40f..000000000 --- a/doc/compiling-gf.txt +++ /dev/null @@ -1,750 +0,0 @@ -Compiling GF -Aarne Ranta -Proglog meeting, 1 November 2006 - -% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html - -%!target:html -%!postproc(html): #NEW <!-- NEW --> - -#NEW - -==The compilation task== - -GF is a grammar formalism, i.e. a special purpose programming language -for writing grammars. - -Other grammar formalisms: -- BNF, YACC, Happy (grammars for programming languages); -- PATR, HPSG, LFG (grammars for natural languages). - - -The grammar compiler prepares a GF grammar for two computational tasks: -- linearization: take syntax trees to strings -- parsing: take strings to syntax trees - - -The grammar gives a declarative description of these functionalities, -on a high abstraction level that improves grammar writing -productivity. - -For efficiency, the grammar is compiled to lower-level formats. - -Type checking is another essential compilation phase. Its purpose is -twofold, as usual: -- checking the correctness of the grammar -- type-annotating expressions for code generation - - -#NEW - -==Characteristics of GF language== - -Functional language with types, both built-in and user-defined. -``` - Str : Type - - param Number = Sg | Pl - - param AdjForm = ASg Gender | APl - - Noun : Type = {s : Number => Str ; g : Gender} -``` -Pattern matching. -``` - svart_A = table { - ASg _ => "svart" ; - _ => "svarta" - } -``` -Higher-order functions. - -Dependent types. -``` - flip : (a, b, c : Type) -> (a -> b -> c) -> b -> a -> c = - \_,_,_,f,y,x -> f x y ; -``` - - -#NEW - -==The module system of GF== - -Main division: abstract syntax and concrete syntax -``` - abstract Greeting = { - cat Greet ; - fun Hello : Greet ; - } - - concrete GreetingEng of Greeting = { - lincat Greet = {s : Str} ; - lin Hello = {s = "hello"} ; - } - - concrete GreetingIta of Greeting = { - param Politeness = Familiar | Polite ; - lincat Greet = {s : Politeness => Str} ; - lin Hello = {s = table { - Familiar => "ciao" ; - Polite => "buongiorno" - } ; - } -``` -Other features of the module system: -- extension and opening -- parametrized modules (cf. ML: signatures, structures, functors) - - - - -#NEW - -==GF vs. Haskell== - -Some things that (standard) Haskell hasn't: -- records and record subtyping -- regular expression patterns -- dependent types -- ML-style modules - - -Some things that GF hasn't: -- infinite (recursive) data types -- recursive functions -- classes, polymorphism - - -#NEW - -==GF vs. most linguistic grammar formalisms== - -GF separates abstract syntax from concrete syntax. - -GF has a module system with separate compilation. - -GF is generation-oriented (as opposed to parsing). - -GF has unidirectional matching (as opposed to unification). - -GF has a static type system (as opposed to a type-free universe). - -"I was - and I still am - firmly convinced that a program composed -out of statically type-checked parts is more likely to faithfully -express a well-thought-out design than a program relying on -weakly-typed interfaces or dynamically-checked interfaces." -(B. Stroustrup, 1994, p. 107) - - - -#NEW - -==The computation model: abstract syntax== - -An abstract syntax defines a free algebra of trees (using -dependent types, recursion, higher-order abstract syntax: -GF includes a complete Logical Framework). -``` - cat C (x_1 : A_1)...(x_n : A_n) - a_1 : A_1 - ... - a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1} - ---------------------------------------------------- - (C a_1 ... a_n) : Type - - - fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A - a_1 : A_1 - ... - a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1} - ---------------------------------------------------- - (f a_1 ... a_n) : A{x_1 : A_1,...,x_n : A_n} - - - A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A - ---------------------------- ---------------------- ------------------------ - (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A} -``` -Notice that all syntax trees are in eta-long form. - - -#NEW - -==The computation model: concrete syntax== - -A concrete syntax defines a homomorphism (compositional mapping) -from the abstract syntax to a system of concrete syntax objects. -``` - cat C _ - -------------------- - lincat C = C* : Type - - fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A - ----------------------------------------------- - lin f = f* : A_1* -> ... -> A_n* -> A* - - (f a_1 ... a_n)* = f* a_1* ... a_n* -``` -The homomorphism can as such be used as linearization function. - -It is a functional program, but a restricted one, since it works -in the end on finite data structures only. - -But a more efficient program is obtained via compilation to -GFC = Canonical GF: the "machine code" of GF. - -The parsing problem of GFC can be reduced to that of MPCFG (Multiple -Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004). - - - -#NEW - -==The core type system of concrete syntax: basic types== - -``` - param P P : PType - PType : Type --------- --------- - P : PType P : Type - - s : Str t : Str - Str : type "foo" : Str [] : Str ---------------- - s ++ t : Str -``` - - -#NEW - -==The core type system of concrete syntax: functions and tables== - -``` - A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A - ---------------------------- ---------------------- ------------------------ - (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A} - - - P : PType A : Type t : P => A p : p - -------------------- ----------------- - P => A : Type t ! p : A - - v_1,...,v_n : A - ---------------------------------------------- P = {C_1,...,C_n} - table {C_1 => v_1 ; ... ; C_n => v_n} : P => A -``` -Pattern matching is treated as an abbreviation for tables. Notice that -``` - case e of {...} == table {...} ! e -``` - - -#NEW - -==The core type system of concrete syntax: records== - -``` - A_1,...,A_n : Type - ------------------------------------ n >= 0 - {r_1 : A_1 ; ... ; r_n : A_n} : Type - - - a_1 : A_1 ... a_n : A_n - ------------------------------------------------------------ - {r_1 = a_1 ; ... ; r_n = a_n} : {r_1 : A_1 ; ... ; r_n : A_n} - - - r : {r_1 : A_1 ; ... ; r_n : A_n} - ----------------------------------- i = 1,...,n - r.r_1 : A_1 -``` -Subtyping: if ``r : R`` then ``r : R ** {r : A}`` - - - -#NEW - -==Computation rules== - -``` - (\x -> b) a = b{x := a} - - (table {C_1 => v_1 ; ... ; C_n => v_n} : P => A) ! C_i = v_i - - {r_1 = a_1 ; ... ; r_n = a_n}.r_i = a_i -``` - - - -#NEW - -==Canonical GF== - -Concrete syntax type system: -``` - A_1 : Type ... A_n : Type - Str : Type Int : Type ------------------------- $i : A - [A_1, ..., A_n] : Type - - - a_1 : A_1 ... a_n : A_n t : [A_1, ..., A_n] - --------------------------------- ------------------- i = 1,..,n - [a_1, ..., a_n] : [A_1, ..., A_n] t ! i : A_i -``` -Tuples represent both records and tables. - -There are no functions. - -Linearization: -``` - lin f = f* - - (f a_1 ... a_n)* = f*{$1 = a_1*, ..., $n = a_n*} -``` - - -#NEW - -==The compilation task, again== - -1. From a GF source grammar, derive a canonical GF grammar. - -2. From the canonical GF grammar derive an MPCFG grammar - -The canonical GF grammar can be used for linearization, with -linear time complexity (w.r.t. the size of the tree). - -The MPCFG grammar can be used for parsing, with (unbounded) -polynomial time complexity (w.r.t. the size of the string). - -For these target formats, we have also built interpreters in -different programming languages (C, C++, Haskell, Java, Prolog). - -Moreover, we generate supplementary formats such as grammars -required by various speech recognition systems. - - -#NEW - -==An overview of compilation phases== - -Legend: -- ellipse node: representation saved in a file -- plain text node: internal representation -- solid arrow or ellipse: essential phare or format -- dashed arrow or ellipse: optional phase or format -- arrow label: the module implementing the phase - - -[gf-compiler.png] - - -#NEW - -==Using the compiler== - -Batch mode (cf. GHC). - -Interactive mode, building the grammar incrementally from -different files, with the possibility of testing them -(cf. GHCI). - -The interactive mode was first, built on the model of ALF-2 -(L. Magnusson), and there was no file output of compiled -grammars. - - -#NEW - -==Modules and separate compilation== - -The above diagram shows what happens to each module. -(But not quite, since some of the back-end formats must be -built for sets of modules: GFCC and the parser formats.) - -When the grammar compiler is called, it has a main module as its -argument. It then builds recursively a dependency graph with all -the other modules, and decides which ones must be recompiled. -The behaviour is rather similar to GHC. - -Separate compilation is //extremely important// when developing -big grammars, especially when using grammar libraries. Example: compiling -the GF resource grammar library takes 5 minutes, whereas reading -in the compiled image takes 10 seconds. - - -#NEW - -==Module dependencies and recompilation== - -(For later use, not for the Proglog talk) - -For each module M, there are 3 kinds of files: -- M.gf, source file -- M.gfc, compiled file ("object file") -- M.gfr, type-checked and optimized source file (for resource modules only) - - -The compiler reads gf files and writes gfc files (and gfr files if appropriate) - -The Main module is the one used as argument when calling GF. - -A module M (immediately) depends on the module K, if either -- M is a concrete of K -- M is an instance of K -- M extends K -- M opens K -- M is a completion of K with something -- M is a completion of some module with K instantiated with something - - -A module M (transitively) depends on the module K, if either -- M immediately depends on K -- M depends on some L such that L immediately depends on K - - -Immediate dependence is readable from the module header without parsing -the whole module. - -The compiler reads recursively the headers of all modules that Main depends on. - -These modules are arranged in a dependency graph, which is checked to be acyclic. - -To decide whether a module M has to be compiled, do: -+ Get the time stamps t() of M.gf and M.gfc (if a file doesn't exist, its - time is minus infinity). -+ If t(M.gf) > t(M.gfc), M must be compiled. -+ If M depends on K and K must be compiled, then M must be compiled. -+ If M depends on K and t(K.gf) > t(M.gfc), then M must be compiled. - - -Decorate the dependency graph by information on whether the gf or the gfc (and gfr) -format is to be read. - -Topologically sort the decorated graph, and read each file in the chosen format. - -The gfr file is generated for these module types only: -- resource -- instance - - -When reading K.gfc, also K.gfr is read if some M depending on K has to be compiled. -In other cases, it is enough to read K.gfc. - -In an interactive GF session, some modules may be in memory already. -When read to the memory, each module M is given time stamp t(M.m). -The additional rule now is: -- If M.gfc is to be read, and t(M.m) > t(M.gfc), don't read M.gfc. - - - - -#NEW - -==Techniques used== - -The compiler is written in Haskell, with some C foreign function calls -in the interactive version (readline, killing threads). - -BNFC is used for generating both the parsers and printers. -This has helped to make the formats portable. - -"Almost compositional functions" (``composOp``) are used in -many compiler passes, making them easier to write and understand. -A ``grep`` on the sources reveals 40 uses (outside the definition -of ``composOp`` itself). - -The key algorithmic ideas are -- type-driven partial evaluation in GF-to-GFC generation -- common subexpression elimination as back-end optimization -- some ideas in GFC-to-MCFG encoding - - -#NEW - -==Type-driven partial evaluation== - -Each abstract syntax category in GF has a corresponding linearization type: -``` - cat C - lincat C = T -``` -The general form of a GF rule pair is -``` - fun f : C1 -> ... -> Cn -> C - lin f = t -``` -with the typing condition following the ``lincat`` definitions -``` - t : T1 -> ... -> Tn -> T -``` -The term ``t`` is in general built by using abstraction methods such -as pattern matching, higher-order functions, local definitions, -and library functions. - -The compilation technique proceeds as follows: -- use eta-expansion on ``t`` to determine the canonical form of the term -``` - \ $C1, ...., $Cn -> (t $C1 .... $Cn) -``` -with unique variables ``$C1 .... $Cn`` for the arguments; repeat this -inside the term for records and tables -- evaluate the resulting term using the computation rules of GF -- what remains is a canonical term with ``$C1 .... $Cn`` the only -variables (the run-time input of the linearization function) - - -#NEW - -==Eta-expanding records and tables== - -For records that are valied via subtyping, eta expansion -eliminates superfluous fields: -``` - {r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1} -``` -For tables, the effect is always expansion, since -pattern matching can be used to represent tables -compactly: -``` - table {n => "fish"} : Number => Str ---> - - table { - Sg => "fish" ; - Pl => "fish" - } -``` -This can be helped by back-end optimizations (see below). - - -#NEW - -==Eliminating functions== - -"Everything is finite": parameter types, records, tables; -finite number of string tokens per grammar. - -But "inifinite types" such as function types are useful when -writing grammars, to enable abstractions. - -Since function types do not appear in linearization types, -we want functions to be eliminated from linearization terms. - -This is similar to the **subformula property** in logic. -Also the main problem is similar: function depending on -a run-time variable, -``` - (table {P => f ; Q = g} ! x) a -``` -This is not a redex, but we can make it closer to one by moving -the application inside the table, -``` - table {P => f a ; Q = g a} ! x -``` -This transformation is the same as Prawitz's (1965) elimination -of maximal segments in natural deduction: -``` - A B - C -> D C C -> D C - A B --------- --------- - A v B C -> D C -> D A v B D D - --------------------- ===> ------------------------- - C -> D C D - -------------------- - D -``` - - - -#NEW - -==Size effects of partial evaluation== - -Irrelevant table branches are thrown away, which can reduce the size. - -But, since tables are expanded and auxiliary functions are inlined, -the size can grow exponentially. - -How can we keep the first property and eliminate the second? - - -#NEW - -==Parametrization of tables== - -Algorithm: for each branch in a table, consider replacing the -argument by a variable: -``` - table { table { - P => t ; ---> x => t[P->x] ; - Q => u x => u[Q->x] - } } -``` -If the resulting branches are all equal, you can replace the table -by a lambda abstract -``` - \\x => t[P->x] -``` -If each created variable ``x`` is unique in the grammar, computation -with the lambda abstract is efficient. - - - -#NEW - -==Course-of-values tables== - -By maintaining a canonical order of parameters in a type, we can -eliminate the left hand sides of branches. -``` - table { table T [ - P => t ; ---> t ; - Q => u u - } ] -``` -The treatment is similar to ``Enum`` instances in Haskell. - -In the end, all parameter types can be translated to -initial segments of integers. - - -#NEW - -==Common subexpression elimination== - -Algorithm: -+ Go through all terms and subterms in a module, creating - a symbol table mapping terms to the number of occurrences. -+ For each subterm appearing at least twice, create a fresh - constant defined as that subterm. -+ Go through all rules (incl. rules for the new constants), - replacing largest possible subterms with such new constants. - - -This algorithm, in a way, creates the strongest possible abstractions. - -In general, the new constants have open terms as definitions. -But since all variables (and constants) are unique, they can -be computed by simple replacement. - - - -#NEW - -==Size effects of optimizations== - -Example: the German resource grammar -``LangGer`` - -|| optimization | lines | characters | size % | blow-up | -| none | 5394 | 3208435 | 100 | 25 | -| all | 5394 | 750277 | 23 | 6 | -| none_subs | 5772 | 1290866 | 40 | 10 | -| all_subs | 5644 | 414119 | 13 | 3 | -| gfcc | 3279 | 190004 | 6 | 1.5 | -| gf source | 3976 | 121939 | 4 | 1 | - - -Optimization "all" means parametrization + course-of-values. - -The source code size is an estimate, since it includes -potentially irrelevant library modules, and comments. - -The GFCC format is not reusable in separate compilation. - - - -#NEW - -==The shared prefix optimization== - -This is currently performed in GFCC only. - -The idea works for languages that have a rich morphology -based on suffixes. Then we can replace a course of values -with a pair of a prefix and a suffix set: -``` - ["apa", "apan", "apor", "aporna"] ---> - ("ap" + ["a", "an", "or", "orna"]) -``` -The real gain comes via common subexpression elimination: -``` - _34 = ["a", "an", "or", "orna"] - apa = ("ap" + _34) - blomma = ("blomm" + _34) - flicka = ("flick" + _34) -``` -Notice that it now matters a lot how grammars are written. -For instance, if German verbs are treated as a one-dimensional -table, -``` - ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...] -``` -no shared prefix optimization is possible. A better form is -separate tables for non-"ge" and "ge" forms: -``` - [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]] -``` - - -#NEW - -==Reuse of grammars as libraries== - -The idea of resource grammars: take care of all aspects of -surface grammaticality (inflection, agreement, word order). - -Reuse in application grammar: via translations -``` - cat C ---> oper C : Type = T - lincat C = T - - fun f : A ---> oper f : A* = t - lin f = t -``` -The user only needs to know the type signatures (abstract syntax). - -However, this does not quite guarantee grammaticality, because -different categories can have the same lincat: -``` - lincat Conj = {s : Str} - lincat Adv = {s : Str} -``` -Thus someone may by accident use "and" as an adverb! - - -#NEW - -==Forcing the type checker to act as a grammar checker== - -We just have to make linearization types unique for each category. - -The technique is reminiscent of Haskell's ``newtype`` but uses -records instead: we add **lock fields** e.g. -``` - lincat Conj = {s : Str ; lock_Conj : {}} - lincat Adv = {s : Str ; lock_Adv : {}} -``` -Thanks to record subtyping, the translation is simple: -``` - fun f : C1 -> ... -> Cn -> C - lin f = t - - ---> - - oper f : C1* -> ... -> Cn* -> C* = - \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}} -``` - -#NEW - -==Things to do== - -Better compression of gfc file format. - -Type checking of dependent-type pattern matching in abstract syntax. - -Compilation-related modules that need rewriting -- ``ReadFiles``: clarify the logic of dependencies -- ``Compile``: clarify the logic of what to do with each module -- ``Compute``: make the evaluation more efficient -- ``Parsing/*``, ``OldParsing/*``, ``Conversion/*``: reduce the number - of parser formats and algorithms |
