summaryrefslogtreecommitdiff
path: root/doc/compiling-gf.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/compiling-gf.txt')
-rw-r--r--doc/compiling-gf.txt750
1 files changed, 0 insertions, 750 deletions
diff --git a/doc/compiling-gf.txt b/doc/compiling-gf.txt
deleted file mode 100644
index 9e438f40f..000000000
--- a/doc/compiling-gf.txt
+++ /dev/null
@@ -1,750 +0,0 @@
-Compiling GF
-Aarne Ranta
-Proglog meeting, 1 November 2006
-
-% to compile: txt2tags -thtml compiling-gf.txt ; htmls compiling-gf.html
-
-%!target:html
-%!postproc(html): #NEW <!-- NEW -->
-
-#NEW
-
-==The compilation task==
-
-GF is a grammar formalism, i.e. a special purpose programming language
-for writing grammars.
-
-Other grammar formalisms:
-- BNF, YACC, Happy (grammars for programming languages);
-- PATR, HPSG, LFG (grammars for natural languages).
-
-
-The grammar compiler prepares a GF grammar for two computational tasks:
-- linearization: take syntax trees to strings
-- parsing: take strings to syntax trees
-
-
-The grammar gives a declarative description of these functionalities,
-on a high abstraction level that improves grammar writing
-productivity.
-
-For efficiency, the grammar is compiled to lower-level formats.
-
-Type checking is another essential compilation phase. Its purpose is
-twofold, as usual:
-- checking the correctness of the grammar
-- type-annotating expressions for code generation
-
-
-#NEW
-
-==Characteristics of GF language==
-
-Functional language with types, both built-in and user-defined.
-```
- Str : Type
-
- param Number = Sg | Pl
-
- param AdjForm = ASg Gender | APl
-
- Noun : Type = {s : Number => Str ; g : Gender}
-```
-Pattern matching.
-```
- svart_A = table {
- ASg _ => "svart" ;
- _ => "svarta"
- }
-```
-Higher-order functions.
-
-Dependent types.
-```
- flip : (a, b, c : Type) -> (a -> b -> c) -> b -> a -> c =
- \_,_,_,f,y,x -> f x y ;
-```
-
-
-#NEW
-
-==The module system of GF==
-
-Main division: abstract syntax and concrete syntax
-```
- abstract Greeting = {
- cat Greet ;
- fun Hello : Greet ;
- }
-
- concrete GreetingEng of Greeting = {
- lincat Greet = {s : Str} ;
- lin Hello = {s = "hello"} ;
- }
-
- concrete GreetingIta of Greeting = {
- param Politeness = Familiar | Polite ;
- lincat Greet = {s : Politeness => Str} ;
- lin Hello = {s = table {
- Familiar => "ciao" ;
- Polite => "buongiorno"
- } ;
- }
-```
-Other features of the module system:
-- extension and opening
-- parametrized modules (cf. ML: signatures, structures, functors)
-
-
-
-
-#NEW
-
-==GF vs. Haskell==
-
-Some things that (standard) Haskell hasn't:
-- records and record subtyping
-- regular expression patterns
-- dependent types
-- ML-style modules
-
-
-Some things that GF hasn't:
-- infinite (recursive) data types
-- recursive functions
-- classes, polymorphism
-
-
-#NEW
-
-==GF vs. most linguistic grammar formalisms==
-
-GF separates abstract syntax from concrete syntax.
-
-GF has a module system with separate compilation.
-
-GF is generation-oriented (as opposed to parsing).
-
-GF has unidirectional matching (as opposed to unification).
-
-GF has a static type system (as opposed to a type-free universe).
-
-"I was - and I still am - firmly convinced that a program composed
-out of statically type-checked parts is more likely to faithfully
-express a well-thought-out design than a program relying on
-weakly-typed interfaces or dynamically-checked interfaces."
-(B. Stroustrup, 1994, p. 107)
-
-
-
-#NEW
-
-==The computation model: abstract syntax==
-
-An abstract syntax defines a free algebra of trees (using
-dependent types, recursion, higher-order abstract syntax:
-GF includes a complete Logical Framework).
-```
- cat C (x_1 : A_1)...(x_n : A_n)
- a_1 : A_1
- ...
- a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1}
- ----------------------------------------------------
- (C a_1 ... a_n) : Type
-
-
- fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A
- a_1 : A_1
- ...
- a_n : A_n{x_1 : A_1,...,x_n-1 : A_n-1}
- ----------------------------------------------------
- (f a_1 ... a_n) : A{x_1 : A_1,...,x_n : A_n}
-
-
- A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A
- ---------------------------- ---------------------- ------------------------
- (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A}
-```
-Notice that all syntax trees are in eta-long form.
-
-
-#NEW
-
-==The computation model: concrete syntax==
-
-A concrete syntax defines a homomorphism (compositional mapping)
-from the abstract syntax to a system of concrete syntax objects.
-```
- cat C _
- --------------------
- lincat C = C* : Type
-
- fun f : (x_1 : A_1) -> ... -> (x_n : A_n) -> A
- -----------------------------------------------
- lin f = f* : A_1* -> ... -> A_n* -> A*
-
- (f a_1 ... a_n)* = f* a_1* ... a_n*
-```
-The homomorphism can as such be used as linearization function.
-
-It is a functional program, but a restricted one, since it works
-in the end on finite data structures only.
-
-But a more efficient program is obtained via compilation to
-GFC = Canonical GF: the "machine code" of GF.
-
-The parsing problem of GFC can be reduced to that of MPCFG (Multiple
-Parallel Context Free Grammars), see P. Ljunglöf's thesis (2004).
-
-
-
-#NEW
-
-==The core type system of concrete syntax: basic types==
-
-```
- param P P : PType
- PType : Type --------- ---------
- P : PType P : Type
-
- s : Str t : Str
- Str : type "foo" : Str [] : Str ----------------
- s ++ t : Str
-```
-
-
-#NEW
-
-==The core type system of concrete syntax: functions and tables==
-
-```
- A : Type x : A |- B : Type x : A |- b : B f : (x : A) -> B a : A
- ---------------------------- ---------------------- ------------------------
- (x : A) -> B : Type \x -> b : (x : A) -> B f a : B{x := A}
-
-
- P : PType A : Type t : P => A p : p
- -------------------- -----------------
- P => A : Type t ! p : A
-
- v_1,...,v_n : A
- ---------------------------------------------- P = {C_1,...,C_n}
- table {C_1 => v_1 ; ... ; C_n => v_n} : P => A
-```
-Pattern matching is treated as an abbreviation for tables. Notice that
-```
- case e of {...} == table {...} ! e
-```
-
-
-#NEW
-
-==The core type system of concrete syntax: records==
-
-```
- A_1,...,A_n : Type
- ------------------------------------ n >= 0
- {r_1 : A_1 ; ... ; r_n : A_n} : Type
-
-
- a_1 : A_1 ... a_n : A_n
- ------------------------------------------------------------
- {r_1 = a_1 ; ... ; r_n = a_n} : {r_1 : A_1 ; ... ; r_n : A_n}
-
-
- r : {r_1 : A_1 ; ... ; r_n : A_n}
- ----------------------------------- i = 1,...,n
- r.r_1 : A_1
-```
-Subtyping: if ``r : R`` then ``r : R ** {r : A}``
-
-
-
-#NEW
-
-==Computation rules==
-
-```
- (\x -> b) a = b{x := a}
-
- (table {C_1 => v_1 ; ... ; C_n => v_n} : P => A) ! C_i = v_i
-
- {r_1 = a_1 ; ... ; r_n = a_n}.r_i = a_i
-```
-
-
-
-#NEW
-
-==Canonical GF==
-
-Concrete syntax type system:
-```
- A_1 : Type ... A_n : Type
- Str : Type Int : Type ------------------------- $i : A
- [A_1, ..., A_n] : Type
-
-
- a_1 : A_1 ... a_n : A_n t : [A_1, ..., A_n]
- --------------------------------- ------------------- i = 1,..,n
- [a_1, ..., a_n] : [A_1, ..., A_n] t ! i : A_i
-```
-Tuples represent both records and tables.
-
-There are no functions.
-
-Linearization:
-```
- lin f = f*
-
- (f a_1 ... a_n)* = f*{$1 = a_1*, ..., $n = a_n*}
-```
-
-
-#NEW
-
-==The compilation task, again==
-
-1. From a GF source grammar, derive a canonical GF grammar.
-
-2. From the canonical GF grammar derive an MPCFG grammar
-
-The canonical GF grammar can be used for linearization, with
-linear time complexity (w.r.t. the size of the tree).
-
-The MPCFG grammar can be used for parsing, with (unbounded)
-polynomial time complexity (w.r.t. the size of the string).
-
-For these target formats, we have also built interpreters in
-different programming languages (C, C++, Haskell, Java, Prolog).
-
-Moreover, we generate supplementary formats such as grammars
-required by various speech recognition systems.
-
-
-#NEW
-
-==An overview of compilation phases==
-
-Legend:
-- ellipse node: representation saved in a file
-- plain text node: internal representation
-- solid arrow or ellipse: essential phare or format
-- dashed arrow or ellipse: optional phase or format
-- arrow label: the module implementing the phase
-
-
-[gf-compiler.png]
-
-
-#NEW
-
-==Using the compiler==
-
-Batch mode (cf. GHC).
-
-Interactive mode, building the grammar incrementally from
-different files, with the possibility of testing them
-(cf. GHCI).
-
-The interactive mode was first, built on the model of ALF-2
-(L. Magnusson), and there was no file output of compiled
-grammars.
-
-
-#NEW
-
-==Modules and separate compilation==
-
-The above diagram shows what happens to each module.
-(But not quite, since some of the back-end formats must be
-built for sets of modules: GFCC and the parser formats.)
-
-When the grammar compiler is called, it has a main module as its
-argument. It then builds recursively a dependency graph with all
-the other modules, and decides which ones must be recompiled.
-The behaviour is rather similar to GHC.
-
-Separate compilation is //extremely important// when developing
-big grammars, especially when using grammar libraries. Example: compiling
-the GF resource grammar library takes 5 minutes, whereas reading
-in the compiled image takes 10 seconds.
-
-
-#NEW
-
-==Module dependencies and recompilation==
-
-(For later use, not for the Proglog talk)
-
-For each module M, there are 3 kinds of files:
-- M.gf, source file
-- M.gfc, compiled file ("object file")
-- M.gfr, type-checked and optimized source file (for resource modules only)
-
-
-The compiler reads gf files and writes gfc files (and gfr files if appropriate)
-
-The Main module is the one used as argument when calling GF.
-
-A module M (immediately) depends on the module K, if either
-- M is a concrete of K
-- M is an instance of K
-- M extends K
-- M opens K
-- M is a completion of K with something
-- M is a completion of some module with K instantiated with something
-
-
-A module M (transitively) depends on the module K, if either
-- M immediately depends on K
-- M depends on some L such that L immediately depends on K
-
-
-Immediate dependence is readable from the module header without parsing
-the whole module.
-
-The compiler reads recursively the headers of all modules that Main depends on.
-
-These modules are arranged in a dependency graph, which is checked to be acyclic.
-
-To decide whether a module M has to be compiled, do:
-+ Get the time stamps t() of M.gf and M.gfc (if a file doesn't exist, its
- time is minus infinity).
-+ If t(M.gf) > t(M.gfc), M must be compiled.
-+ If M depends on K and K must be compiled, then M must be compiled.
-+ If M depends on K and t(K.gf) > t(M.gfc), then M must be compiled.
-
-
-Decorate the dependency graph by information on whether the gf or the gfc (and gfr)
-format is to be read.
-
-Topologically sort the decorated graph, and read each file in the chosen format.
-
-The gfr file is generated for these module types only:
-- resource
-- instance
-
-
-When reading K.gfc, also K.gfr is read if some M depending on K has to be compiled.
-In other cases, it is enough to read K.gfc.
-
-In an interactive GF session, some modules may be in memory already.
-When read to the memory, each module M is given time stamp t(M.m).
-The additional rule now is:
-- If M.gfc is to be read, and t(M.m) > t(M.gfc), don't read M.gfc.
-
-
-
-
-#NEW
-
-==Techniques used==
-
-The compiler is written in Haskell, with some C foreign function calls
-in the interactive version (readline, killing threads).
-
-BNFC is used for generating both the parsers and printers.
-This has helped to make the formats portable.
-
-"Almost compositional functions" (``composOp``) are used in
-many compiler passes, making them easier to write and understand.
-A ``grep`` on the sources reveals 40 uses (outside the definition
-of ``composOp`` itself).
-
-The key algorithmic ideas are
-- type-driven partial evaluation in GF-to-GFC generation
-- common subexpression elimination as back-end optimization
-- some ideas in GFC-to-MCFG encoding
-
-
-#NEW
-
-==Type-driven partial evaluation==
-
-Each abstract syntax category in GF has a corresponding linearization type:
-```
- cat C
- lincat C = T
-```
-The general form of a GF rule pair is
-```
- fun f : C1 -> ... -> Cn -> C
- lin f = t
-```
-with the typing condition following the ``lincat`` definitions
-```
- t : T1 -> ... -> Tn -> T
-```
-The term ``t`` is in general built by using abstraction methods such
-as pattern matching, higher-order functions, local definitions,
-and library functions.
-
-The compilation technique proceeds as follows:
-- use eta-expansion on ``t`` to determine the canonical form of the term
-```
- \ $C1, ...., $Cn -> (t $C1 .... $Cn)
-```
-with unique variables ``$C1 .... $Cn`` for the arguments; repeat this
-inside the term for records and tables
-- evaluate the resulting term using the computation rules of GF
-- what remains is a canonical term with ``$C1 .... $Cn`` the only
-variables (the run-time input of the linearization function)
-
-
-#NEW
-
-==Eta-expanding records and tables==
-
-For records that are valied via subtyping, eta expansion
-eliminates superfluous fields:
-```
- {r1 = t1 ; r2 = t2} : {r1 : T1} ----> {r1 = t1}
-```
-For tables, the effect is always expansion, since
-pattern matching can be used to represent tables
-compactly:
-```
- table {n => "fish"} : Number => Str --->
-
- table {
- Sg => "fish" ;
- Pl => "fish"
- }
-```
-This can be helped by back-end optimizations (see below).
-
-
-#NEW
-
-==Eliminating functions==
-
-"Everything is finite": parameter types, records, tables;
-finite number of string tokens per grammar.
-
-But "inifinite types" such as function types are useful when
-writing grammars, to enable abstractions.
-
-Since function types do not appear in linearization types,
-we want functions to be eliminated from linearization terms.
-
-This is similar to the **subformula property** in logic.
-Also the main problem is similar: function depending on
-a run-time variable,
-```
- (table {P => f ; Q = g} ! x) a
-```
-This is not a redex, but we can make it closer to one by moving
-the application inside the table,
-```
- table {P => f a ; Q = g a} ! x
-```
-This transformation is the same as Prawitz's (1965) elimination
-of maximal segments in natural deduction:
-```
- A B
- C -> D C C -> D C
- A B --------- ---------
- A v B C -> D C -> D A v B D D
- --------------------- ===> -------------------------
- C -> D C D
- --------------------
- D
-```
-
-
-
-#NEW
-
-==Size effects of partial evaluation==
-
-Irrelevant table branches are thrown away, which can reduce the size.
-
-But, since tables are expanded and auxiliary functions are inlined,
-the size can grow exponentially.
-
-How can we keep the first property and eliminate the second?
-
-
-#NEW
-
-==Parametrization of tables==
-
-Algorithm: for each branch in a table, consider replacing the
-argument by a variable:
-```
- table { table {
- P => t ; ---> x => t[P->x] ;
- Q => u x => u[Q->x]
- } }
-```
-If the resulting branches are all equal, you can replace the table
-by a lambda abstract
-```
- \\x => t[P->x]
-```
-If each created variable ``x`` is unique in the grammar, computation
-with the lambda abstract is efficient.
-
-
-
-#NEW
-
-==Course-of-values tables==
-
-By maintaining a canonical order of parameters in a type, we can
-eliminate the left hand sides of branches.
-```
- table { table T [
- P => t ; ---> t ;
- Q => u u
- } ]
-```
-The treatment is similar to ``Enum`` instances in Haskell.
-
-In the end, all parameter types can be translated to
-initial segments of integers.
-
-
-#NEW
-
-==Common subexpression elimination==
-
-Algorithm:
-+ Go through all terms and subterms in a module, creating
- a symbol table mapping terms to the number of occurrences.
-+ For each subterm appearing at least twice, create a fresh
- constant defined as that subterm.
-+ Go through all rules (incl. rules for the new constants),
- replacing largest possible subterms with such new constants.
-
-
-This algorithm, in a way, creates the strongest possible abstractions.
-
-In general, the new constants have open terms as definitions.
-But since all variables (and constants) are unique, they can
-be computed by simple replacement.
-
-
-
-#NEW
-
-==Size effects of optimizations==
-
-Example: the German resource grammar
-``LangGer``
-
-|| optimization | lines | characters | size % | blow-up |
-| none | 5394 | 3208435 | 100 | 25 |
-| all | 5394 | 750277 | 23 | 6 |
-| none_subs | 5772 | 1290866 | 40 | 10 |
-| all_subs | 5644 | 414119 | 13 | 3 |
-| gfcc | 3279 | 190004 | 6 | 1.5 |
-| gf source | 3976 | 121939 | 4 | 1 |
-
-
-Optimization "all" means parametrization + course-of-values.
-
-The source code size is an estimate, since it includes
-potentially irrelevant library modules, and comments.
-
-The GFCC format is not reusable in separate compilation.
-
-
-
-#NEW
-
-==The shared prefix optimization==
-
-This is currently performed in GFCC only.
-
-The idea works for languages that have a rich morphology
-based on suffixes. Then we can replace a course of values
-with a pair of a prefix and a suffix set:
-```
- ["apa", "apan", "apor", "aporna"] --->
- ("ap" + ["a", "an", "or", "orna"])
-```
-The real gain comes via common subexpression elimination:
-```
- _34 = ["a", "an", "or", "orna"]
- apa = ("ap" + _34)
- blomma = ("blomm" + _34)
- flicka = ("flick" + _34)
-```
-Notice that it now matters a lot how grammars are written.
-For instance, if German verbs are treated as a one-dimensional
-table,
-```
- ["lieben", "liebe", "liebst", ...., "geliebt", "geliebter",...]
-```
-no shared prefix optimization is possible. A better form is
-separate tables for non-"ge" and "ge" forms:
-```
- [["lieben", "liebe", "liebst", ....], ["geliebt", "geliebter",...]]
-```
-
-
-#NEW
-
-==Reuse of grammars as libraries==
-
-The idea of resource grammars: take care of all aspects of
-surface grammaticality (inflection, agreement, word order).
-
-Reuse in application grammar: via translations
-```
- cat C ---> oper C : Type = T
- lincat C = T
-
- fun f : A ---> oper f : A* = t
- lin f = t
-```
-The user only needs to know the type signatures (abstract syntax).
-
-However, this does not quite guarantee grammaticality, because
-different categories can have the same lincat:
-```
- lincat Conj = {s : Str}
- lincat Adv = {s : Str}
-```
-Thus someone may by accident use "and" as an adverb!
-
-
-#NEW
-
-==Forcing the type checker to act as a grammar checker==
-
-We just have to make linearization types unique for each category.
-
-The technique is reminiscent of Haskell's ``newtype`` but uses
-records instead: we add **lock fields** e.g.
-```
- lincat Conj = {s : Str ; lock_Conj : {}}
- lincat Adv = {s : Str ; lock_Adv : {}}
-```
-Thanks to record subtyping, the translation is simple:
-```
- fun f : C1 -> ... -> Cn -> C
- lin f = t
-
- --->
-
- oper f : C1* -> ... -> Cn* -> C* =
- \x1,...,xn -> (t x1 ... xn) ** {lock_C = {}}
-```
-
-#NEW
-
-==Things to do==
-
-Better compression of gfc file format.
-
-Type checking of dependent-type pattern matching in abstract syntax.
-
-Compilation-related modules that need rewriting
-- ``ReadFiles``: clarify the logic of dependencies
-- ``Compile``: clarify the logic of what to do with each module
-- ``Compute``: make the evaluation more efficient
-- ``Parsing/*``, ``OldParsing/*``, ``Conversion/*``: reduce the number
- of parser formats and algorithms