1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.sf.net">
<TITLE>Resource grammar writing HOWTO</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<P ALIGN="center"><CENTER><H1>Resource grammar writing HOWTO</H1>
<FONT SIZE="4">
<I>Author: Aarne Ranta <aarne (at) cs.chalmers.se></I><BR>
Last update: Mon Sep 22 14:28:01 2008
</FONT></CENTER>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<UL>
<LI><A HREF="#toc1">The resource grammar structure</A>
<UL>
<LI><A HREF="#toc2">Library API modules</A>
<LI><A HREF="#toc3">Phrase category modules</A>
<LI><A HREF="#toc4">Infrastructure modules</A>
<LI><A HREF="#toc5">Lexical modules</A>
</UL>
<LI><A HREF="#toc6">Language-dependent syntax modules</A>
<UL>
<LI><A HREF="#toc7">The present-tense fragment</A>
</UL>
<LI><A HREF="#toc8">Phases of the work</A>
<UL>
<LI><A HREF="#toc9">Putting up a directory</A>
<LI><A HREF="#toc10">Direction of work</A>
<LI><A HREF="#toc11">The develop-test cycle</A>
<LI><A HREF="#toc12">Auxiliary modules</A>
<LI><A HREF="#toc13">Morphology and lexicon</A>
<LI><A HREF="#toc14">Lock fields</A>
<LI><A HREF="#toc15">Lexicon construction</A>
</UL>
<LI><A HREF="#toc16">Lexicon extension</A>
<UL>
<LI><A HREF="#toc17">The irregularity lexicon</A>
<LI><A HREF="#toc18">Lexicon extraction from a word list</A>
<LI><A HREF="#toc19">Lexicon extraction from raw text data</A>
<LI><A HREF="#toc20">Bootstrapping with smart paradigms</A>
</UL>
<LI><A HREF="#toc21">Extending the resource grammar API</A>
<LI><A HREF="#toc22">Using parametrized modules</A>
<UL>
<LI><A HREF="#toc23">Writing an instance of parametrized resource grammar implementation</A>
<LI><A HREF="#toc24">Parametrizing a resource grammar implementation</A>
</UL>
<LI><A HREF="#toc25">Character encoding and transliterations</A>
<LI><A HREF="#toc26">Coding conventions in GF</A>
<LI><A HREF="#toc27">Transliterations</A>
</UL>
<P></P>
<HR NOSHADE SIZE=1>
<P></P>
<P>
<B>History</B>
</P>
<P>
September 2008: updated for Version 1.5.
</P>
<P>
October 2007: updated for Version 1.2.
</P>
<P>
January 2006: first version.
</P>
<P>
The purpose of this document is to tell how to implement the GF
resource grammar API for a new language. We will <I>not</I> cover how
to use the resource grammar, nor how to change the API. But we
will give some hints how to extend the API.
</P>
<P>
A manual for using the resource grammar is found in
</P>
<P>
<A HREF="../lib/resource/doc/synopsis.html"><CODE>www.cs.chalmers.se/Cs/Research/Language-technology/GF/lib/resource/doc/synopsis.html</CODE></A>.
</P>
<P>
A tutorial on GF, also introducing the idea of resource grammars, is found in
</P>
<P>
<A HREF="./gf-tutorial.html"><CODE>www.cs.chalmers.se/Cs/Research/Language-technology/GF/doc/gf-tutorial.html</CODE></A>.
</P>
<P>
This document concerns the API v. 1.5, while the current stable release is 1.4.
You can find the code for the stable release in
</P>
<P>
<A HREF="../lib/resource"><CODE>www.cs.chalmers.se/Cs/Research/Language-technology/GF/lib/resource/</CODE></A>
</P>
<P>
and the next release in
</P>
<P>
<A HREF="../next-lib/src"><CODE>www.cs.chalmers.se/Cs/Research/Language-technology/GF/next-lib/src/</CODE></A>
</P>
<P>
It is recommended to build new grammars to match the next release.
</P>
<A NAME="toc1"></A>
<H2>The resource grammar structure</H2>
<P>
The library is divided into a bunch of modules, whose dependencies
are given in the following figure.
</P>
<P>
<IMG ALIGN="left" SRC="Syntax.png" BORDER="0" ALT="">
</P>
<P>
Modules of different kinds are distinguished as follows:
</P>
<UL>
<LI>solid contours: module seen by end users
<LI>dashed contours: internal module
<LI>ellipse: abstract/concrete pair of modules
<LI>rectangle: resource or instance
<LI>diamond: interface
</UL>
<P>
Put in another way:
</P>
<UL>
<LI>solid rectangles and diamonds: user-accessible library API
<LI>solid ellipses: user-accessible top-level grammar for parsing and linearization
<LI>dashed contours: not visible to users
</UL>
<P>
The dashed ellipses form the main parts of the implementation, on which the resource
grammar programmer has to work with. She also has to work on the <CODE>Paradigms</CODE>
module. The rest of the modules can be produced mechanically from corresponding
modules for other languages, by just changing the language codes appearing in
their module headers.
</P>
<P>
The module structure is rather flat: most modules are direct
parents of <CODE>Grammar</CODE>. The idea
is that the implementors can concentrate on one linguistic aspect at a time, or
also distribute the work among several authors. The module <CODE>Cat</CODE>
defines the "glue" that ties the aspects together - a type system
to which all the other modules conform, so that e.g. <CODE>NP</CODE> means
the same thing in those modules that use <CODE>NP</CODE>s and those that
constructs them.
</P>
<A NAME="toc2"></A>
<H3>Library API modules</H3>
<P>
For the user of the library, these modules are the most important ones.
In a typical application, it is enough to open <CODE>Paradigms</CODE> and <CODE>Syntax</CODE>.
The module <CODE>Try</CODE> combines these two, making it possible to experiment
with combinations of syntactic and lexical constructors by using the
<CODE>cc</CODE> command in the GF shell. Here are short explanations of each API module:
</P>
<UL>
<LI><CODE>Try</CODE>: the whole resource library for a language (<CODE>Paradigms</CODE>, <CODE>Syntax</CODE>,
<CODE>Irreg</CODE>, and <CODE>Extra</CODE>);
produced mechanically as a collection of modules
<LI><CODE>Syntax</CODE>: language-independent categories, syntax functions, and structural words;
produced mechanically as a collection of modules
<LI><CODE>Constructors</CODE>: language-independent syntax functions and structural words;
produced mechanically via functor instantiation
<LI><CODE>Paradigms</CODE>: language-dependent morphological paradigms
</UL>
<A NAME="toc3"></A>
<H3>Phrase category modules</H3>
<P>
The immediate parents of <CODE>Grammar</CODE> will be called <B>phrase category modules</B>,
since each of them concentrates on a particular phrase category (nouns, verbs,
adjectives, sentences,...). A phrase category module tells
<I>how to construct phrases in that category</I>. You will find out that
all functions in any of these modules have the same value type (or maybe
one of a small number of different types). Thus we have
</P>
<UL>
<LI><CODE>Noun</CODE>: construction of nouns and noun phrases
<LI><CODE>Adjective</CODE>: construction of adjectival phrases
<LI><CODE>Verb</CODE>: construction of verb phrases
<LI><CODE>Adverb</CODE>: construction of adverbial phrases
<LI><CODE>Numeral</CODE>: construction of cardinal and ordinal numerals
<LI><CODE>Sentence</CODE>: construction of sentences and imperatives
<LI><CODE>Question</CODE>: construction of questions
<LI><CODE>Relative</CODE>: construction of relative clauses
<LI><CODE>Conjunction</CODE>: coordination of phrases
<LI><CODE>Phrase</CODE>: construction of the major units of text and speech
<LI><CODE>Text</CODE>: construction of texts as sequences of phrases
<LI><CODE>Idiom</CODE>: idiomatic expressions such as existentials
</UL>
<A NAME="toc4"></A>
<H3>Infrastructure modules</H3>
<P>
Expressions of each phrase category are constructed in the corresponding
phrase category module. But their <I>use</I> takes mostly place in other modules.
For instance, noun phrases, which are constructed in <CODE>Noun</CODE>, are
used as arguments of functions of almost all other phrase category modules.
How can we build all these modules independently of each other?
</P>
<P>
As usual in typeful programming, the <I>only</I> thing you need to know
about an object you use is its type. When writing a linearization rule
for a GF abstract syntax function, the only thing you need to know is
the linearization types of its value and argument categories. To achieve
the division of the resource grammar to several parallel phrase category modules,
what we need is an underlying definition of the linearization types. This
definition is given as the implementation of
</P>
<UL>
<LI><CODE>Cat</CODE>: syntactic categories of the resource grammar
</UL>
<P>
Any resource grammar implementation has first to agree on how to implement
<CODE>Cat</CODE>. Luckily enough, even this can be done incrementally: you
can skip the <CODE>lincat</CODE> definition of a category and use the default
<CODE>{s : Str}</CODE> until you need to change it to something else. In
English, for instance, many categories do have this linearization type.
</P>
<A NAME="toc5"></A>
<H3>Lexical modules</H3>
<P>
What is lexical and what is syntactic is not as clearcut in GF as in
some other grammar formalisms. Logically, lexical means atom, i.e. a
<CODE>fun</CODE> with no arguments. Linguistically, one may add to this
that the <CODE>lin</CODE> consists of only one token (or of a table whose values
are single tokens). Even in the restricted lexicon included in the resource
API, the latter rule is sometimes violated in some languages. For instance,
<CODE>Structural.both7and_DConj</CODE> is an atom, but its linearization is
two words e.g. <I>both - and</I>.
</P>
<P>
Another characterization of lexical is that lexical units can be added
almost <I>ad libitum</I>, and they cannot be defined in terms of already
given rules. The lexical modules of the resource API are thus more like
samples than complete lists. There are two such modules:
</P>
<UL>
<LI><CODE>Structural</CODE>: structural words (determiners, conjunctions,...)
<LI><CODE>Lexicon</CODE>: basic everyday content words (nouns, verbs,...)
</UL>
<P>
The module <CODE>Structural</CODE> aims for completeness, and is likely to
be extended in future releases of the resource. The module <CODE>Lexicon</CODE>
gives a "random" list of words, which enables testing the syntax.
It also provides a check list for morphology, since those words are likely to include
most morphological patterns of the language.
</P>
<P>
In the case of <CODE>Lexicon</CODE> it may come out clearer than anywhere else
in the API that it is impossible to give exact translation equivalents in
different languages on the level of a resource grammar. This is no problem,
since application grammars can use the resource in different ways for
different languages.
</P>
<A NAME="toc6"></A>
<H2>Language-dependent syntax modules</H2>
<P>
In addition to the common API, there is room for language-dependent extensions
of the resource. The top level of each languages looks as follows (with German
as example):
</P>
<PRE>
abstract AllGerAbs = Lang, ExtraGerAbs, IrregGerAbs
</PRE>
<P>
where <CODE>ExtraGerAbs</CODE> is a collection of syntactic structures specific to German,
and <CODE>IrregGerAbs</CODE> is a dictionary of irregular words of German
(at the moment, just verbs). Each of these language-specific grammars has
the potential to grow into a full-scale grammar of the language. These grammar
can also be used as libraries, but the possibility of using functors is lost.
</P>
<P>
To give a better overview of language-specific structures,
modules like <CODE>ExtraGerAbs</CODE>
are built from a language-independent module <CODE>ExtraAbs</CODE>
by restricted inheritance:
</P>
<PRE>
abstract ExtraGerAbs = Extra [f,g,...]
</PRE>
<P>
Thus any category and function in <CODE>Extra</CODE> may be shared by a subset of all
languages. One can see this set-up as a matrix, which tells
what <CODE>Extra</CODE> structures
are implemented in what languages. For the common API in <CODE>Grammar</CODE>, the matrix
is filled with 1's (everything is implemented in every language).
</P>
<P>
In a minimal resource grammar implementation, the language-dependent
extensions are just empty modules, but it is good to provide them for
the sake of uniformity.
</P>
<A NAME="toc7"></A>
<H3>The present-tense fragment</H3>
<P>
Some lines in the resource library are suffixed with the comment
</P>
<PRE>
--# notpresent
</PRE>
<P>
which is used by a preprocessor to exclude those lines from
a reduced version of the full resource. This present-tense-only
version is useful for applications in most technical text, since
they reduce the grammar size and compilation time. It can also
be useful to exclude those lines in a first version of resource
implementation. To compile a grammar with present-tense-only, use
</P>
<PRE>
make Present
</PRE>
<P>
with <CODE>resource/Makefile</CODE>.
</P>
<A NAME="toc8"></A>
<H2>Phases of the work</H2>
<A NAME="toc9"></A>
<H3>Putting up a directory</H3>
<P>
Unless you are writing an instance of a parametrized implementation
(Romance or Scandinavian), which will be covered later, the
simplest way is to follow roughly the following procedure. Assume you
are building a grammar for the German language. Here are the first steps,
which we actually followed ourselves when building the German implementation
of resource v. 1.0 at Ubuntu linux. We have slightly modified them to
match resource v. 1.5 and GF v. 3.0.
</P>
<OL>
<LI>Create a sister directory for <CODE>GF/lib/resource/english</CODE>, named
<CODE>german</CODE>.
<PRE>
cd GF/lib/resource/
mkdir german
cd german
</PRE>
<P></P>
<LI>Check out the [ISO 639 3-letter language code
<A HREF="http://www.w3.org/WAI/ER/IG/ert/iso639.htm">http://www.w3.org/WAI/ER/IG/ert/iso639.htm</A>]
for German: both <CODE>Ger</CODE> and <CODE>Deu</CODE> are given, and we pick <CODE>Ger</CODE>.
(We use the 3-letter codes rather than the more common 2-letter codes,
since they will suffice for many more languages!)
<P></P>
<LI>Copy the <CODE>*Eng.gf</CODE> files from <CODE>english</CODE> <CODE>german</CODE>,
and rename them:
<PRE>
cp ../english/*Eng.gf .
rename 's/Eng/Ger/' *Eng.gf
</PRE>
If you don't have the <CODE>rename</CODE> command, you can use a bash script with <CODE>mv</CODE>.
</OL>
<OL>
<LI>Change the <CODE>Eng</CODE> module references to <CODE>Ger</CODE> references
in all files:
<PRE>
sed -i 's/English/German/g' *Ger.gf
sed -i 's/Eng/Ger/g' *Ger.gf
</PRE>
The first line prevents changing the word <CODE>English</CODE>, which appears
here and there in comments, to <CODE>Gerlish</CODE>. The <CODE>sed</CODE> command syntax
may vary depending on your operating system.
<P></P>
<LI>This may of course change unwanted occurrences of the
string <CODE>Eng</CODE> - verify this by
<PRE>
grep Ger *.gf
</PRE>
But you will have to make lots of manual changes in all files anyway!
<P></P>
<LI>Comment out the contents of these files:
<PRE>
sed -i 's/^/--/' *Ger.gf
</PRE>
This will give you a set of templates out of which the grammar
will grow as you uncomment and modify the files rule by rule.
<P></P>
<LI>In all <CODE>.gf</CODE> files, uncomment the module headers and brackets,
leaving the module bodies commented. Unfortunately, there is no
simple way to do this automatically (or to avoid commenting these
lines in the previous step) - but uncommenting the first
and the last lines will actually do the job for many of the files.
<P></P>
<LI>Uncomment the contents of the main grammar file:
<PRE>
sed -i 's/^--//' LangGer.gf
</PRE>
<P></P>
<LI>Now you can open the grammar <CODE>LangGer</CODE> in GF:
<PRE>
gf LangGer.gf
</PRE>
You will get lots of warnings on missing rules, but the grammar will compile.
<P></P>
<LI>At all the following steps you will now have a valid, but incomplete
GF grammar. The GF command
<PRE>
pg -missing
</PRE>
tells you what exactly is missing.
</OL>
<P>
Here is the module structure of <CODE>LangGer</CODE>. It has been simplified by leaving out
the majority of the phrase category modules. Each of them has the same dependencies
as <CODE>VerbGer</CODE>, whose complete dependencies are shown as an example.
</P>
<P>
<IMG ALIGN="middle" SRC="German.png" BORDER="0" ALT="">
</P>
<A NAME="toc10"></A>
<H3>Direction of work</H3>
<P>
The real work starts now. There are many ways to proceed, the most obvious ones being
</P>
<UL>
<LI>Top-down: start from the module <CODE>Phrase</CODE> and go down to <CODE>Sentence</CODE>, then
<CODE>Verb</CODE>, <CODE>Noun</CODE>, and in the end <CODE>Lexicon</CODE>. In this way, you are all the time
building complete phrases, and add them with more content as you proceed.
<B>This approach is not recommended</B>. It is impossible to test the rules if
you have no words to apply the constructions to.
<P></P>
<LI>Bottom-up: set as your first goal to implement <CODE>Lexicon</CODE>. To this end, you
need to write <CODE>ParadigmsGer</CODE>, which in turn needs parts of
<CODE>MorphoGer</CODE> and <CODE>ResGer</CODE>.
<B>This approach is not recommended</B>. You can get stuck to details of
morphology such as irregular words, and you don't have enough grasp about
the type system to decide what forms to cover in morphology.
</UL>
<P>
The practical working direction is thus a saw-like motion between the morphological
and top-level modules. Here is a possible course of the work that gives enough
test data and enough general view at any point:
</P>
<OL>
<LI>Define <CODE>Cat.N</CODE> and the required parameter types in <CODE>ResGer</CODE>. As we define
<PRE>
lincat N = {s : Number => Case => Str ; g : Gender} ;
</PRE>
we need the parameter types <CODE>Number</CODE>, <CODE>Case</CODE>, and <CODE>Gender</CODE>. The definition
of <CODE>Number</CODE> in <A HREF="../lib/resource/common/ParamX.gf"><CODE>common/ParamX</CODE></A>
works for German, so we
use it and just define <CODE>Case</CODE> and <CODE>Gender</CODE> in <CODE>ResGer</CODE>.
<P></P>
<LI>Define some cases of <CODE>mkN</CODE> in <CODE>ParadigmsGer</CODE>. In this way you can
already implement a huge amount of nouns correctly in <CODE>LexiconGer</CODE>. Actually
just adding the worst-case instance of <CODE>mkN</CODE> (the one taking the most
arguments) should suffice for every noun - but,
since it is tedious to use, you
might proceed to the next step before returning to morphology and defining the
real work horse, <CODE>mkN</CODE> taking two forms and a gender.
<P></P>
<LI>While doing this, you may want to test the resource independently. Do this by
starting the GF shell in the <CODE>resource</CODE> directory, by the commands
<PRE>
> i -retain german/ParadigmsGer
> cc -table mkN "Kirche"
</PRE>
<P></P>
<LI>Proceed to determiners and pronouns in
<CODE>NounGer</CODE> (<CODE>DetCN UsePron DetQuant NumSg DefArt IndefArt UseN</CODE>) and
<CODE>StructuralGer</CODE> (<CODE>i_Pron this_Quant</CODE>). You also need some categories and
parameter types. At this point, it is maybe not possible to find out the final
linearization types of <CODE>CN</CODE>, <CODE>NP</CODE>, <CODE>Det</CODE>, and <CODE>Quant</CODE>, but at least you should
be able to correctly inflect noun phrases such as <I>every airplane</I>:
<PRE>
> i german/LangGer.gf
> l -table DetCN every_Det (UseN airplane_N)
Nom: jeder Flugzeug
Acc: jeden Flugzeug
Dat: jedem Flugzeug
Gen: jedes Flugzeugs
</PRE>
<P></P>
<LI>Proceed to verbs: define <CODE>CatGer.V</CODE>, <CODE>ResGer.VForm</CODE>, and
<CODE>ParadigmsGer.mkV</CODE>. You may choose to exclude <CODE>notpresent</CODE>
cases at this point. But anyway, you will be able to inflect a good
number of verbs in <CODE>Lexicon</CODE>, such as
<CODE>live_V</CODE> (<CODE>mkV "leben"</CODE>).
<P></P>
<LI>Now you can soon form your first sentences: define <CODE>VP</CODE> and
<CODE>Cl</CODE> in <CODE>CatGer</CODE>, <CODE>VerbGer.UseV</CODE>, and <CODE>SentenceGer.PredVP</CODE>.
Even if you have excluded the tenses, you will be able to produce
<PRE>
> i -preproc=./mkPresent german/LangGer.gf
> l -table PredVP (UsePron i_Pron) (UseV live_V)
Pres Simul Pos Main: ich lebe
Pres Simul Pos Inv: lebe ich
Pres Simul Pos Sub: ich lebe
Pres Simul Neg Main: ich lebe nicht
Pres Simul Neg Inv: lebe ich nicht
Pres Simul Neg Sub: ich nicht lebe
</PRE>
You should also be able to parse:
<PRE>
> p -cat=Cl "ich lebe"
PredVP (UsePron i_Pron) (UseV live_V)
</PRE>
<P></P>
<LI>Transitive verbs
(<CODE>CatGer.V2 CatGer.VPSlash ParadigmsGer.mkV2 VerbGer.ComplSlash VerbGer.SlashV2a</CODE>)
are a natural next step, so that you can
produce <CODE>ich liebe dich</CODE> ("I love you").
<P></P>
<LI>Adjectives (<CODE>CatGer.A ParadigmsGer.mkA NounGer.AdjCN AdjectiveGer.PositA</CODE>)
will force you to think about strong and weak declensions, so that you can
correctly inflect <I>mein neuer Wagen, dieser neue Wagen</I>
("my new car, this new car").
<P></P>
<LI>Once you have implemented the set
(``Noun.DetCN Noun.AdjCN Verb.UseV Verb.ComplSlash Verb.SlashV2a Sentence.PredVP),
you have overcome most of difficulties. You know roughly what parameters
and dependences there are in your language, and you can now proceed very
much in the order you please.
</OL>
<A NAME="toc11"></A>
<H3>The develop-test cycle</H3>
<P>
The following develop-test cycle will
be applied most of the time, both in the first steps described above
and in later steps where you are more on your own.
</P>
<OL>
<LI>Select a phrase category module, e.g. <CODE>NounGer</CODE>, and uncomment some
linearization rules (for instance, <CODE>DetCN</CODE>, as above).
<P></P>
<LI>Write down some German examples of this rule, for instance translations
of "the dog", "the house", "the big house", etc. Write these in all their
different forms (two numbers and four cases).
<P></P>
<LI>Think about the categories involved (<CODE>CN, NP, N, Det</CODE>) and the
variations they have. Encode this in the lincats of <CODE>CatGer</CODE>.
You may have to define some new parameter types in <CODE>ResGer</CODE>.
<P></P>
<LI>To be able to test the construction,
define some words you need to instantiate it
in <CODE>LexiconGer</CODE>. You will also need some regular inflection patterns
in<CODE>ParadigmsGer</CODE>.
<P></P>
<LI>Test by parsing, linearization,
and random generation. In particular, linearization to a table should
be used so that you see all forms produced; the <CODE>treebank</CODE> option
preserves the tree
<PRE>
> gr -cat=NP -number=20 | l -table -treebank
</PRE>
<P></P>
<LI>Save some tree-linearization pairs for later regression testing. You can save
a gold standard treebank and use the Unix <CODE>diff</CODE> command to compare later
linearizations produced from the same list of trees. If you save the trees
in a file <CODE>trees</CODE>, you can do as follows:
<PRE>
> rf -file=trees -tree -lines | l -table -treebank | wf -file=treebank
</PRE>
<P></P>
<LI>A file with trees testing all resource functions is included in the resource,
entitled <CODE>resource/exx-resource.gft</CODE>. A treebank can be created from this by
the Unix command
<PRE>
% runghc Make.hs test langs=Ger
</PRE>
</OL>
<P>
You are likely to run this cycle a few times for each linearization rule
you implement, and some hundreds of times altogether. There are roughly
70 <CODE>cat</CODE>s and
600 <CODE>funs</CODE> in <CODE>Lang</CODE> at the moment; 170 of the <CODE>funs</CODE> are outside the two
lexicon modules).
</P>
<A NAME="toc12"></A>
<H3>Auxiliary modules</H3>
<P>
These auxuliary <CODE>resource</CODE> modules will be written by you.
</P>
<UL>
<LI><CODE>ResGer</CODE>: parameter types and auxiliary operations
(a resource for the resource grammar!)
<LI><CODE>ParadigmsGer</CODE>: complete inflection engine and most important regular paradigms
<LI><CODE>MorphoGer</CODE>: auxiliaries for <CODE>ParadigmsGer</CODE> and <CODE>StructuralGer</CODE>. This need
not be separate from <CODE>ResGer</CODE>.
</UL>
<P>
These modules are language-independent and provided by the existing resource
package.
</P>
<UL>
<LI><CODE>ParamX</CODE>: parameter types used in many languages
<LI><CODE>CommonX</CODE>: implementation of language-uniform categories
such as $Text$ and $Phr$, as well as of
the logical tense, anteriority, and polarity parameters
<LI><CODE>Coordination</CODE>: operations to deal with lists and coordination
<LI><CODE>Prelude</CODE>: general-purpose operations on strings, records,
truth values, etc.
<LI><CODE>Predef</CODE>: general-purpose operations with hard-coded definitions
</UL>
<P>
An important decision is what rules to implement in terms of operations in
<CODE>ResGer</CODE>. The <B>golden rule of functional programming</B> says:
</P>
<UL>
<LI><I>Whenever you find yourself programming by copy and paste, write a function instead!</I>.
</UL>
<P>
This rule suggests that an operation should be created if it is to be
used at least twice. At the same time, a sound principle of <B>vicinity</B> says:
</P>
<UL>
<LI><I>It should not require too much browsing to understand what a piece of code does.</I>
</UL>
<P>
From these two principles, we have derived the following practice:
</P>
<UL>
<LI>If an operation is needed <I>in two different modules</I>,
it should be created in as an <CODE>oper</CODE> in <CODE>ResGer</CODE>. An example is <CODE>mkClause</CODE>,
used in <CODE>Sentence</CODE>, <CODE>Question</CODE>, and <CODE>Relative</CODE>-
<LI>If an operation is needed <I>twice in the same module</I>, but never
outside, it should be created in the same module. Many examples are
found in <CODE>Numerals</CODE>.
<LI>If an operation is needed <I>twice in the same judgement</I>, but never
outside, it should be created by a <CODE>let</CODE> definition.
<LI>If an operation is only needed once, it should not be created as an <CODE>oper</CODE>,
but rather inlined. However, a <CODE>let</CODE> definition may well be in place just
to make the readable.
Most functions in phrase category modules
are implemented in this way.
</UL>
<P>
This discipline is very different from the one followed in early
versions of the library (up to 0.9). We then valued the principle of
abstraction more than vicinity, creating layers of abstraction for
almost everything. This led in practice to the duplication of almost
all code on the <CODE>lin</CODE> and <CODE>oper</CODE> levels, and made the code
hard to understand and maintain.
</P>
<A NAME="toc13"></A>
<H3>Morphology and lexicon</H3>
<P>
The paradigms needed to implement
<CODE>LexiconGer</CODE> are defined in
<CODE>ParadigmsGer</CODE>.
This module provides high-level ways to define the linearization of
lexical items, of categories <CODE>N, A, V</CODE> and their complement-taking
variants.
</P>
<P>
For ease of use, the <CODE>Paradigms</CODE> modules follow a certain
naming convention. Thus they for each lexical category, such as <CODE>N</CODE>,
the overloaded functions, such as <CODE>mkN</CODE>, with the following cases:
</P>
<UL>
<LI>the worst-case construction of <CODE>N</CODE>. Its type signature
has the form
<PRE>
mkN : Str -> ... -> Str -> P -> ... -> Q -> N
</PRE>
with as many string and parameter arguments as can ever be needed to
construct an <CODE>N</CODE>.
<LI>the most regular cases, with just one string argument:
<PRE>
mkN : Str -> N
</PRE>
<LI>A language-dependent (small) set of functions to handle mild irregularities
and common exceptions.
</UL>
<P>
For the complement-taking variants, such as <CODE>V2</CODE>, we provide
</P>
<UL>
<LI>a case that takes a <CODE>V</CODE> and all necessary arguments, such
as case and preposition:
<PRE>
mkV2 : V -> Case -> Str -> V2 ;
</PRE>
<LI>a case that takes a <CODE>Str</CODE> and produces a transitive verb with the direct
object case:
<PRE>
mkV2 : Str -> V2 ;
</PRE>
<LI>A language-dependent (small) set of functions to handle common special cases,
such as transitive verbs that are not regular:
<PRE>
mkV2 : V -> V2 ;
</PRE>
</UL>
<P>
The golden rule for the design of paradigms is that
</P>
<UL>
<LI><I>The user of the library will only need function applications with constants and strings, never any records or tables.</I>
</UL>
<P>
The discipline of data abstraction moreover requires that the user of the resource
is not given access to parameter constructors, but only to constants that denote
them. This gives the resource grammarian the freedom to change the underlying
data representation if needed. It means that the <CODE>ParadigmsGer</CODE> module has
to define constants for those parameter types and constructors that
the application grammarian may need to use, e.g.
</P>
<PRE>
oper
Case : Type ;
nominative, accusative, genitive, dative : Case ;
</PRE>
<P>
These constants are defined in terms of parameter types and constructors
in <CODE>ResGer</CODE> and <CODE>MorphoGer</CODE>, which modules are not
visible to the application grammarian.
</P>
<A NAME="toc14"></A>
<H3>Lock fields</H3>
<P>
An important difference between <CODE>MorphoGer</CODE> and
<CODE>ParadigmsGer</CODE> is that the former uses "raw" record types
for word classes, whereas the latter used category symbols defined in
<CODE>CatGer</CODE>. When these category symbols are used to denote
record types in a resource modules, such as <CODE>ParadigmsGer</CODE>,
a <B>lock field</B> is added to the record, so that categories
with the same implementation are not confused with each other.
(This is inspired by the <CODE>newtype</CODE> discipline in Haskell.)
For instance, the lincats of adverbs and conjunctions are the same
in <CODE>CommonX</CODE> (and therefore in <CODE>CatGer</CODE>, which inherits it):
</P>
<PRE>
lincat Adv = {s : Str} ;
lincat Conj = {s : Str} ;
</PRE>
<P>
But when these category symbols are used to denote their linearization
types in resource module, these definitions are translated to
</P>
<PRE>
oper Adv : Type = {s : Str ; lock_Adv : {}} ;
oper Conj : Type = {s : Str} ; lock_Conj : {}} ;
</PRE>
<P>
In this way, the user of a resource grammar cannot confuse adverbs with
conjunctions. In other words, the lock fields force the type checker
to function as grammaticality checker.
</P>
<P>
When the resource grammar is <CODE>open</CODE>ed in an application grammar, the
lock fields are never seen (except possibly in type error messages),
and the application grammarian should never write them herself. If she
has to do this, it is a sign that the resource grammar is incomplete, and
the proper way to proceed is to fix the resource grammar.
</P>
<P>
The resource grammarian has to provide the dummy lock field values
in her hidden definitions of constants in <CODE>Paradigms</CODE>. For instance,
</P>
<PRE>
mkAdv : Str -> Adv ;
-- mkAdv s = {s = s ; lock_Adv = <>} ;
</PRE>
<P></P>
<A NAME="toc15"></A>
<H3>Lexicon construction</H3>
<P>
The lexicon belonging to <CODE>LangGer</CODE> consists of two modules:
</P>
<UL>
<LI><CODE>StructuralGer</CODE>, structural words, built by using both
<CODE>ParadigmsGer</CODE> and <CODE>MorphoGer</CODE>.
<LI><CODE>LexiconGer</CODE>, content words, built by using <CODE>ParadigmsGer</CODE> only.
</UL>
<P>
The reason why <CODE>MorphoGer</CODE> has to be used in <CODE>StructuralGer</CODE>
is that <CODE>ParadigmsGer</CODE> does not contain constructors for closed
word classes such as pronouns and determiners. The reason why we
recommend <CODE>ParadigmsGer</CODE> for building <CODE>LexiconGer</CODE> is that
the coverage of the paradigms gets thereby tested and that the
use of the paradigms in <CODE>LexiconGer</CODE> gives a good set of examples for
those who want to build new lexica.
</P>
<A NAME="toc16"></A>
<H2>Lexicon extension</H2>
<A NAME="toc17"></A>
<H3>The irregularity lexicon</H3>
<P>
It is useful in most languages to provide a separate module of irregular
verbs and other words which are difficult for a lexicographer
to handle. There are usually a limited number of such words - a
few hundred perhaps. Building such a lexicon separately also
makes it less important to cover <I>everything</I> by the
worst-case variants of the paradigms <CODE>mkV</CODE> etc.
</P>
<A NAME="toc18"></A>
<H3>Lexicon extraction from a word list</H3>
<P>
You can often find resources such as lists of
irregular verbs on the internet. For instance, the
Irregular German Verb page
previously found in
<CODE>http://www.iee.et.tu-dresden.de/~wernerr/grammar/verben_dt.html</CODE>
page gives a list of verbs in the
traditional tabular format, which begins as follows:
</P>
<PRE>
backen (du bäckst, er bäckt) backte [buk] gebacken
befehlen (du befiehlst, er befiehlt; befiehl!) befahl (beföhle; befähle) befohlen
beginnen begann (begönne; begänne) begonnen
beißen biß gebissen
</PRE>
<P>
All you have to do is to write a suitable verb paradigm
</P>
<PRE>
irregV : (x1,_,_,_,_,x6 : Str) -> V ;
</PRE>
<P>
and a Perl or Python or Haskell script that transforms
the table to
</P>
<PRE>
backen_V = irregV "backen" "bäckt" "back" "backte" "backte" "gebacken" ;
befehlen_V = irregV "befehlen" "befiehlt" "befiehl" "befahl" "beföhle" "befohlen" ;
</PRE>
<P></P>
<P>
When using ready-made word lists, you should think about
coyright issues. All resource grammar material should
be provided under GNU Lesser General Public License (LGPL).
</P>
<A NAME="toc19"></A>
<H3>Lexicon extraction from raw text data</H3>
<P>
This is a cheap technique to build a lexicon of thousands
of words, if text data is available in digital format.
See the <A HREF="http://www.cs.chalmers.se/~markus/extract/">Extract Homepage</A>
homepage for details.
</P>
<A NAME="toc20"></A>
<H3>Bootstrapping with smart paradigms</H3>
<P>
This is another cheap technique, where you need as input a list of words with
part-of-speech marking. You initialize the lexicon by using the one-argument
<CODE>mkN</CODE> etc paradigms, and add forms to those words that do not come out right.
This procedure is described in the paper
</P>
<P>
A. Ranta.
How predictable is Finnish morphology? An experiment on lexicon construction.
In J. Nivre, M. Dahllöf and B. Megyesi (eds),
<I>Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein</I>,
University of Uppsala,
2008.
Available from the <A HREF="http://publications.uu.se/abstract.xsql?dbid=8933">series homepage</A>
</P>
<A NAME="toc21"></A>
<H2>Extending the resource grammar API</H2>
<P>
Sooner or later it will happen that the resource grammar API
does not suffice for all applications. A common reason is
that it does not include idiomatic expressions in a given language.
The solution then is in the first place to build language-specific
extension modules, like <CODE>ExtraGer</CODE>.
</P>
<A NAME="toc22"></A>
<H2>Using parametrized modules</H2>
<A NAME="toc23"></A>
<H3>Writing an instance of parametrized resource grammar implementation</H3>
<P>
Above we have looked at how a resource implementation is built by
the copy and paste method (from English to German), that is, formally
speaking, from scratch. A more elegant solution available for
families of languages such as Romance and Scandinavian is to
use parametrized modules. The advantages are
</P>
<UL>
<LI>theoretical: linguistic generalizations and insights
<LI>practical: maintainability improves with fewer components
</UL>
<P>
Here is a set of
<A HREF="http://www.cs.chalmers.se/~aarne/geocal2006.pdf">slides</A>
on the topic.
</P>
<A NAME="toc24"></A>
<H3>Parametrizing a resource grammar implementation</H3>
<P>
This is the most demanding form of resource grammar writing.
We do <I>not</I> recommend the method of parametrizing from the
beginning: it is easier to have one language first implemented
in the conventional way and then add another language of the
same family by aprametrization. This means that the copy and
paste method is still used, but at this time the differences
are put into an <CODE>interface</CODE> module.
</P>
<A NAME="toc25"></A>
<H2>Character encoding and transliterations</H2>
<P>
This section is relevant for languages using a non-ASCII character set.
</P>
<A NAME="toc26"></A>
<H2>Coding conventions in GF</H2>
<P>
From version 3.0, GF follows a simple encoding convention:
</P>
<UL>
<LI>GF source files may follow any encoding, such as isolatin-1 or UTF-8;
the default is isolatin-1, and UTF8 must be indicated by the judgement
<PRE>
flags coding = utf8 ;
</PRE>
in each source module.
<LI>for internal processing, all characters are converted to 16-bit unicode,
as the first step of grammar compilation guided by the <CODE>coding</CODE> flag
<LI>as the last step of compilation, all characters are converted to UTF-8
<LI>thus, GF object files (<CODE>gfo</CODE>) and the Portable Grammar Format (<CODE>pgf</CODE>)
are in UTF-8
</UL>
<P>
Most current resource grammars use isolatin-1 in the source, but this does
not affect their use in parallel with grammars written in other encodings.
In fact, a grammar can be put up from modules using different codings.
</P>
<P>
<B>Warning</B>. While string literals may contain any characters, identifiers
must be isolatin-1 letters (or digits, underscores, or dashes). This has to
do with the restrictions of the lexer tool that is used.
</P>
<A NAME="toc27"></A>
<H2>Transliterations</H2>
<P>
While UTF-8 is well supported by most web browsers, its use in terminals and
text editors may cause disappointment. Many grammarians therefore prefer to
use ASCII transliterations. GF 3.0beta2 provides the following built-in
transliterations:
</P>
<UL>
<LI>Arabic
<LI>Devanagari (Hindi)
<LI>Thai
</UL>
<P>
New transliterations can be defined in the GF source file
<A HREF="../src/GF/Text/Transliterations.hs"><CODE>GF/Text/Transliterations.hs</CODE></A>.
This file also gives instructions on how new ones are added.
</P>
<!-- html code generated by txt2tags 2.4 (http://txt2tags.sf.net) -->
<!-- cmdline: txt2tags -\-toc Resource-HOWTO.txt -->
</BODY></HTML>
|