996
|
1 INTERCAL IMPLEMENTOR'S NOTES
|
|
2 by ESR (and AIS where noted)
|
|
3
|
|
4 The C-INTERCAL compiler has a very conventional implementation using YACC and
|
|
5 LEX (latterly, bison and flex). It generates C, which is then passed to your C
|
|
6 compiler.
|
|
7
|
|
8 Lexical issues
|
|
9
|
|
10 Note that the spectacular ugliness of INTERCAL syntax requires that the lexical
|
|
11 analyzer have two levels. One, embedded in the input() function, handles the
|
|
12 backquote and bang constructs, and stashes the input line away in a buffer for
|
|
13 the splat construct's benefit. The upper level is generated by lex(1) and does
|
|
14 normal tokenizing for YACC.
|
|
15
|
|
16 In order to splatter erroneous statements correctly, the generated code has to
|
|
17 track both logical and physical lines. That's the reason for the `lineno'
|
|
18 variable in the generated C, it's actually tracking physical lines.
|
|
19
|
|
20 Numeral tokens for input are defined in a symbol table (numerals.c) that is
|
|
21 directly included in the run-time library module (cesspool.c). This avoids
|
|
22 having to put the the size of the numerals array in an extern. To add new
|
|
23 numeral tokens, simply put them in the numerals initializer.
|
|
24
|
|
25 AIS: Now computed COME FROM is in the language, the lexer is called forth to do
|
|
26 a bit of syntax analysis, by distinguishing it from an ordinary COME FROM
|
|
27 and passing different symbols to the parser. The lexer is also responsible
|
|
28 for helping the parser to match sparks and ears in complicated array
|
|
29 situations, by distinguishing possibly closing sparkears from definitely
|
|
30 opening sparkears.
|
|
31
|
|
32 Compilation
|
|
33
|
|
34 The parser builds an array of tuples, one for each INTERCAL statement. Most
|
|
35 tuples have node trees attached. Once all tuples have been generated, the
|
|
36 compile-time checker and optimizer phases can do consistency checks and
|
|
37 expression-tree rewrites. Finally, the tuples are ground out as C code by the
|
|
38 emit() function.
|
|
39
|
|
40 Calculations are fully type-checked at compile time; they have to be because
|
|
41 (as I read the manual) the 16- and 32-bit versions of the unary ops do
|
|
42 different things. The only potential problem here is that the typechecker has
|
|
43 to assume that :m ~ :n has the type of :n (32-bit) even though the result might
|
|
44 fit in 16 bits. At run-time everything is calculated in 32 bits. When
|
|
45 INTERCAL-72 was designed 32 bits was expensive; now it's cheap. Really, the
|
|
46 only reason for retaining a 16-bit type at all is for the irritation value of
|
|
47 it (yes, C-INTERCAL *does* enforce the 16-bit limit on constants).
|
|
48
|
|
49 Labels are mapped to tuple indices (logical line numbers) in the code checker,
|
|
50 just before optimization.
|
|
51
|
|
52 The optimizer does full recursive folding of all constant expressions at
|
|
53 compile time (evaluating away all the irritating little kluges you have to use
|
|
54 to generate 32-bit constants). It also checks for known INTERCAL idioms for
|
|
55 `test for equality', `test for nonzeroness', and the C logical operators &, |,
|
|
56 ^, and ~.
|
|
57
|
|
58 AIS: The optimizer has now been expanded. See below.
|
|
59
|
|
60 Code Generation
|
|
61
|
|
62 Each line of INTERCAL is translated into a C if()-then; the guard part is used
|
|
63 to implement abstentions and RESUMES, and the arm part translates the `body' of
|
|
64 the corresponding INTERCAL statement.
|
|
65
|
|
66 The generated C code is plugged into the template file ick-wrap.c inside
|
|
67 main(). It needs to be linked with cesspool.o, fiddle.o and lose.o (these are
|
|
68 in libick.a, with the support for runtime switches, arrgghh.o). Cesspool.o is
|
|
69 the code that implements the storage manager; fiddle.o implements the INTERCAL
|
|
70 operators; and lose.o is the code that generates INTERCAL's error messages.
|
|
71 The routine arrgghh.o parses the runtime command line arguments.
|
|
72
|
|
73 The abstain[] array in the generated C is used to track line and label
|
|
74 abstentions; if member i is on, the statement on line i is being abstained
|
|
75 from. If gerund abstentions/reinstatements are present in the code, a second
|
|
76 array recording the type of each statement in generated into the runtime, and
|
|
77 used to ensure that these operations are translated into abstention-guard
|
|
78 changes on all appropriate line numbers.
|
|
79
|
|
80 RESUMES are implemented with a branch to a generated switch statement that
|
|
81 executes a goto to the appropriate label. If there are no RESUMES, no such
|
|
82 switch is generated.
|
|
83
|
|
84 The compiler places a simple label at the location of each COME FROM in the
|
|
85 program, while all of the machinery for checking for abstention and conditional
|
|
86 execution and actually performing the jump is placed immediately after the code
|
|
87 for the target statement.
|
|
88
|
|
89 Credits
|
|
90
|
|
91 I wrote the first version of this compiler over a weekend using a pre-ANSI C
|
|
92 compiler. It worked, but it wasn't pretty. Louis Howell added the array
|
|
93 support later; he also torture-tested the COME FROM implementation by actually
|
|
94 using it for the life2.i program included in this distribution, and fixed some
|
|
95 bugs. Brian Raiter did a much-needed delinting while porting it for ANSI C,
|
|
96 flex and Linux. He also improved the lexical analyzer's line tracking.
|
|
97
|
|
98 Code Generation 2 (by AIS)
|
|
99
|
|
100 The recent additions to the language have made code generation potentially more
|
|
101 complicated, depending on which switches are on. Each INTERCAL command is still
|
|
102 translated into an if()-then, but with extra guards around it. COME FROM is
|
|
103 still handled the same way, but only in singlethreaded programs with no
|
|
104 computed COME FROMs. For anything more complicated than that, a general COMING
|
|
105 FROM mechanism is invoked. C's most confusing control-flow identifiers, setjmp
|
|
106 and longjmp, are used. When computed COME FROMs are involved, after every
|
|
107 labeled statement, a set of gotos are done to check each in turn. A jmp_buf,
|
|
108 cjb, is used to store the point in the program where the check started. Each
|
|
109 COME FROM leading to that line and computed COME FROM anywhere in the program
|
|
110 is checked, and modifies cjb to aim for its suckpoint if neccesary. In the case
|
|
111 of a multithreaded program, one cjb will be stored in a linked ring of threads,
|
|
112 and the other will be the one in the main program. Extra guards are also added
|
|
113 for ONCE/AGAIN; the position of this guard depends on whether the command is a
|
|
114 NEXT or something else. ONCE/AGAIN are quite legal even in singlethreaded
|
|
115 programs. Computed ABSTAIN is dealt with by allowing the abstain[] array to
|
|
116 hold values other than 0 or 1. Multithreading is supported by a new library,
|
|
117 libickmt.a, to hold the multithread versions of libick.a functions. See
|
|
118 unravel.c for an explanation of the full gory details of the multithread
|
|
119 implementation (it has many comments explaining the implementation at the top).
|
|
120
|
|
121 Optimisation (by AIS)
|
|
122
|
|
123 When I got my C-INTERCAL distribution, it only had a rudimentary
|
|
124 optimiser. Constant folding had already been done, and the following
|
|
125 optimisations. The following is mixed INTERCAL and C, sometimes in the same
|
|
126 statement. It should be clear from context and whether I have used INTERCAL
|
|
127 spark-ears or C brackets which language is which, although in some places I've
|
|
128 mixed them. Note that the optimizer is only run in binary.
|
|
129
|
|
130 Optimizations available with -O
|
|
131
|
|
132 Optimisations that had already been done
|
|
133
|
|
134 Constant folding
|
|
135 'Vx$y'~"#0x55555555" into (x | y)
|
|
136 '?x$y'~"#0x55555555" into (x ^ y)
|
|
137 '&x$y'~"#0x55555555" into (x & y)
|
|
138 (x ^ 0xFFFF) or (x ^ 0xFFFFFFFF) into ~x
|
|
139
|
|
140 I added:
|
|
141
|
|
142 [This list has been removed as it got out of date; see src/idiotism.oil for the
|
|
143 new list. Most of the idioms are due to me, a few are by Joris Huizer.]
|
|
144
|
|
145 Optimizations available with -f (I wrote all these)
|
|
146
|
|
147 Guard removal (if a line can't be abstained, the if(!abstained[x]) line is
|
|
148 removed when possible)
|
|
149
|
|
150 Ignorance checks (if a variable can't be ignored and it can't overflow,
|
|
151 assignment is done with C's = rather than a function)
|
|
152
|
|
153 Recognition of NEXT...NEXT...computed RESUME as an idiom for if() (currently
|
|
154 disabled because the rest of the optimiser can't handle it)
|
|
155
|
|
156 Optimizations available with -F (I wrote this one)
|
|
157
|
|
158 This optimization should mean that INTERCAL suddenly becomes the best language
|
|
159 for various benchmark programs. Basically, the file is analysed to try to
|
|
160 determine whether it takes no input and always produces the same output. If
|
|
161 that is the case, the output is simply a shell-script (with executable
|
|
162 permissions set and a #! line) that outputs the correct output! Therefore, the
|
|
163 time required for one run of the program is taken up during compilation, and
|
|
164 whenever the program is executed, it just copies out the predetermined
|
|
165 output. This means (for instance) that primes.i runs in only a fraction of a
|
|
166 second when optimized with -F. Of course, this tends to slow down the compiler
|
|
167 and often produces a large object file, which is why it has a command-line
|
|
168 option of its own.
|