1[comment {-*- tcl -*- doctools manpage}]
2[manpage_begin grammar::me_vm n 0.1]
3[copyright {2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>}]
4[moddesc   {Grammar operations and usage}]
5[titledesc {Virtual machine for parsing token streams}]
6[category  {Grammars and finite automata}]
7[description]
8[keywords {virtual machine}]
9[keywords parsing grammar]
10
11Please go and read the document [syscmd grammar::me_intro] first for
12an overview of the various documents and their relations.
13
14[para]
15
16This document specifies a virtual machine for the controlled matching
17and parsing of token streams, creating an
18
19[term {abstract syntax tree}] (short [term AST]) reflecting the
20structure of the input. Special machine features are the caching and
21reuse of partial results, caching of the encountered input, and the
22ability to backtrack in both input and AST creation.
23
24[para]
25
26These features make the specified virtual machine especially useful to
27packrat parsers based on parsing expression grammars. It is however
28not restricted to this type of parser. Normal LL and LR parsers can be
29implemented with it as well.
30
31[para]
32
33The following sections will discuss first the abstract state kept by
34ME virtual machines, and then their instruction set.
35
36
37[section {MACHINE STATE}]
38
39A ME virtual machine manages the following state:
40
41[list_begin definitions]
42[def "[term {Current token}] CT"]
43
44The token from the input under consideration by the machine.
45
46[para]
47
48This information is used and modified by the instructions defined in
49the section
50
51[sectref {TERMINAL MATCHING}].
52
53
54[def "[term {Current location}] CL"]
55
56The location of the [term {current token}] in the input stream, as
57offset relative to the beginning of the stream. The first token is
58considered to be at offset [const 0].
59
60[para]
61
62This information is implicitly used and modified by the instructions
63defined in the sections
64
65[sectref {TERMINAL MATCHING}] and
66[sectref {NONTERMINAL MATCHING}],
67
68and can be directly queried and modified by the instructions defined
69in section
70
71[sectref {INPUT LOCATION HANDLING}].
72
73
74[def "[term {Location stack}] LS"]
75
76In addition to the above a stack of locations, for backtracking.
77Locations can put on the stack, removed from it, and removed with
78setting the current location.
79
80[para]
81
82This information is implicitly used and modified by the instructions
83defined in the sections
84
85[sectref {TERMINAL MATCHING}] and
86[sectref {NONTERMINAL MATCHING}],
87
88and can be directly queried and modified by the instructions defined
89in section
90
91[sectref {INPUT LOCATION HANDLING}].
92
93
94[def "[term {Match status}] OK"]
95
96A boolean value, the result of the last attempt at matching input.
97It is set to [const true] if that attempt was successful, and
98[const false] otherwise.
99
100[para]
101
102This information is influenced by the instructions defined in the
103sections
104
105[sectref {TERMINAL MATCHING}],
106[sectref {NONTERMINAL MATCHING}], and
107[sectref {UNCONDITIONAL MATCHING}].
108
109It is queried by the instructions defined in the section
110
111[sectref {CONTROL FLOW}].
112
113
114[def "[term {Semantic value}] SV"]
115
116The semantic value associated with (generated by) the last attempt at
117matching input. Contains either the empty string or a node for the
118abstract syntax tree constructed from the input.
119
120[para]
121
122This information is influenced by the instructions defined in the
123sections
124
125[sectref {SEMANTIC VALUES}], and
126[sectref {AST STACK HANDLING}].
127
128
129[def "[term {AST stack}] AS"]
130
131A stack of partial abstract syntax trees constructed by the machine
132during matching.
133
134[para]
135
136This information is influenced by the instructions defined in the
137sections
138
139[sectref {SEMANTIC VALUES}], and
140[sectref {AST STACK HANDLING}].
141
142
143[def "[term {AST Marker stack}] MS"]
144
145In addition to the above a stack of stacks, for backtracking. This is
146actually a stack of markers into the AST stack, thus implicitly
147snapshooting the state of the AST stack at some point in time. Markers
148can be put on the stack, dropped from it, and used to roll back the
149AST stack to an earlier state.
150
151[para]
152
153This information is influenced by the instructions defined in the
154sections
155
156[sectref {SEMANTIC VALUES}], and
157[sectref {AST STACK HANDLING}].
158
159
160[def "[term {Error status}] ER"]
161
162Error information associated with the last attempt at matching
163input. Contains either the empty string or a list of 2 elements, a
164location in the input and a list of error messages associated with
165it, in this order.
166
167[para]
168
169[emph Note] that error information can be set even if the last attempt
170at matching input was successful. For example the *-operator (matching
171a sub-expression zero or more times) in a parsing expression grammar
172is always successful, even if it encounters a problem further in the
173input and has to backtrack. Such problems must not be forgotten when
174continuing to match.
175
176[para]
177
178This information is queried and influenced by the instructions defined
179in the sections
180
181[sectref {TERMINAL MATCHING}],
182[sectref {NONTERMINAL MATCHING}], and
183[sectref {ERROR HANDLING}].
184
185
186[def "[term {Error stack}] ES"]
187
188In addition to the above a stack of error information, to allow the
189merging of current and older error information when performing
190backtracking in choices after an unsucessful match.
191
192[para]
193
194This information is queried and influenced by the instructions defined
195in the sections
196
197[sectref {TERMINAL MATCHING}],
198[sectref {NONTERMINAL MATCHING}], and
199[sectref {ERROR HANDLING}].
200
201
202[def "[term {Return stack}] RS"]
203
204A stack of program counter values, i.e. locations in the code
205controlling the virtual machine, for the management of subroutine
206calls, i.e. the matching of nonterminal symbols.
207
208[para]
209
210This information is queried and influenced by the instructions defined
211in the section
212
213[sectref {NONTERMINAL MATCHING}].
214
215
216[def "[term {Nonterminal cache}] NC"]
217
218A cache of machine states (A 4-tuple containing a location in the
219input, match status [term OK], semantic value [term SV], and error
220status [term ER]) keyed by name of nonterminal symbol and location in
221the input stream.
222
223[para]
224
225The key location is where machine started the attempt to match the
226named nonterminal symbol, and the location in the value is where
227machine ended up after the attempt completed, independent of the
228success of the attempt.
229
230[para]
231
232This status is queried and influenced by the instructions defined in
233the section
234
235[sectref {NONTERMINAL MATCHING}].
236
237[list_end]
238
239
240[section {MACHINE INSTRUCTIONS}]
241
242With the machine state specified it is now possible to explain the
243instruction set of ME virtual machines. They are grouped roughly by
244the machine state they influence and/or query.
245
246
247[subsection {TERMINAL MATCHING}]
248
249First the instructions to match tokens from the input stream, and
250by extension all terminal symbols.
251
252[para]
253
254These instructions are the only ones which may retrieve a new token
255from the input stream. This is a [emph may] and not a [emph will]
256because the instructions will a retrieve new token if, and only if the
257current location [term CL] is at the head of the stream.
258
259If the machine has backtracked (see [cmd icl_rewind]) the instructions
260will retrieve the token to compare against from the internal cache.
261
262[para]
263[list_begin definitions]
264
265[def "[cmd ict_advance] [arg message]"]
266
267This instruction tries to advance to the next token in the input
268stream, i.e. the one after the current location [term CL]. The
269instruction will fail if, and only if the end of the input stream is
270reached, i.e. if there is no next token.
271
272[para]
273
274The sucess/failure of the instruction is remembered in the match
275status [term OK]. In the case of failure the error status [term ER] is
276set to the current location and the message [arg message].
277
278In the case of success the error status [term ER] is cleared, the new
279token is made the current token [term CT], and the new location is
280made the current location [term CL].
281
282[para]
283
284The argument [arg message] is a reference to the string to put into
285the error status [term ER], if such is needed.
286
287
288[def "[cmd ict_match_token] [arg tok] [arg message]"]
289
290This instruction tests the current token [term CT] for equality
291with the argument [arg tok] and records the result in the match
292status [term OK]. The instruction fails if the current token is
293not equal to [arg tok].
294
295[para]
296
297In case of failure the error status [term ER] is set to the current
298location [term CL] and the message [arg message], and the
299current location [term CL] is moved one token backwards.
300
301Otherwise, i.e. upon success, the error status [term ER] is cleared
302and the current location [term CL] is not touched.
303
304
305[def "[cmd ict_match_tokrange] [arg tokbegin] [arg tokend] [arg message]"]
306
307This instruction tests the current token [term CT] for being in
308the range of tokens from [arg tokbegin] to [arg tokend]
309(inclusive) and records the result in the match status [term OK]. The
310instruction fails if the current token is not inside the range.
311
312[para]
313
314In case of failure the error status [term ER] is set to the current
315location [term CL] and the message [arg message], and the current location
316[term CL] is moved one token backwards.
317
318Otherwise, i.e. upon success, the error status [term ER] is cleared
319and the current location [term CL] is not touched.
320
321
322[def "[cmd ict_match_tokclass] [arg code] [arg message]"]
323
324This instruction tests the current token [term CT] for being a member
325of the token class [arg code] and records the result in the match
326status [term OK]. The instruction fails if the current token is not a
327member of the specified class.
328
329[para]
330
331In case of failure the error status [term ER] is set to the current
332location [term CL] and the message [arg message], and the
333current location [term CL] is moved one token backwards.
334
335Otherwise, i.e. upon success, the error status [term ER] is cleared
336and the current location [term CL] is not touched.
337
338[para]
339
340Currently the following classes are legal:
341
342[list_begin definitions]
343[def alnum]
344A token is accepted if it is a unicode alphabetical character, or a digit.
345[def alpha]
346A token is accepted if it is a unicode alphabetical character.
347[def digit]
348A token is accepted if it is a unicode digit character.
349[def xdigit]
350A token is accepted if it is a hexadecimal digit character.
351[def punct]
352A token is accepted if it is a unicode punctuation character.
353[def space]
354A token is accepted if it is a unicode space character.
355[list_end]
356
357[list_end]
358[para]
359
360
361[subsection {NONTERMINAL MATCHING}]
362
363The instructions in this section handle the matching of nonterminal
364symbols. They query the nonterminal cache [term NC] for saved
365information, and put such information into the cache.
366
367[para]
368
369The usage of the cache is a performance aid for backtracking parsers,
370allowing them to avoid an expensive rematch of complex nonterminal
371symbols if they have been encountered before.
372
373[para]
374
375[list_begin definitions]
376
377[def "[cmd inc_restore] [arg branchlabel] [arg nt]"]
378
379This instruction checks if the nonterminal cache [term NC] contains
380information about the nonterminal symbol [arg nt], at the current
381location [term CL]. If that is the case the instruction will update
382the machine state (current location [term CL], match status [term OK],
383semantic value [term SV], and error status [term ER]) with the found
384information and continue execution at the instruction refered to by
385the [arg branchlabel]. The new current location [term CL] will be the
386last token matched by the nonterminal symbol, i.e. belonging to it.
387
388[para]
389
390If no information was found the instruction will continue execution at
391the next instruction.
392
393[para]
394
395Together with [cmd icf_ntcall] it is possible to generate code for
396memoized and non-memoized matching of nonterminal symbols, either as
397subroutine calls, or inlined in the caller.
398
399
400[def "[cmd inc_save] [arg nt]"]
401
402This instruction saves the current state of the machine (current
403location [term CL], match status [term OK], semantic value [term SV],
404and error status [term ER]), to the nonterminal cache [term NC]. It
405will also pop an entry from the location stack [term LS] and save it
406as the start location of the match.
407
408[para]
409
410It is expected to be called at the end of matching a nonterminal
411symbol, with [arg nt] the name of the nonterminal symbol the code was
412working on. This allows the instruction [cmd inc_restore] to check for
413and retrieve the data, should we have to match this nonterminal symbol
414at the same location again, during backtracking.
415
416
417[def "[cmd icf_ntcall] [arg branchlabel]"]
418
419This instruction invokes the code for matching the nonterminal symbol
420[arg nt] as a subroutine. To this end it stores the current program
421counter [term PC] on the return stack [term RS], the current location
422[term CL] on the location stack [term LS], and then continues
423execution at the address [arg branchlabel].
424
425[para]
426
427The next matching [cmd icf_ntreturn] will cause the execution to
428continue at the instruction coming after the call.
429
430
431[def [cmd icf_ntreturn]]
432
433This instruction will pop an entry from the return stack [term RS],
434assign it to the program counter [term PC], and then continue
435execution at the new address.
436
437[list_end]
438[para]
439
440
441[subsection {UNCONDITIONAL MATCHING}]
442
443The instructions in this section are the remaining match
444operators. They change the match status [term OK] directly and
445unconditionally.
446
447[list_begin definitions]
448
449[def [cmd iok_ok]]
450
451This instruction sets the match status [term OK] to [const true],
452indicating a successful match.
453
454
455[def [cmd iok_fail]]
456
457This instruction sets the match status [term OK] to [const false],
458indicating a failed match.
459
460
461[def [cmd iok_negate]]
462
463This instruction negates the match status [term OK], turning a failure
464into a success and vice versa.
465
466[list_end]
467[para]
468
469
470[subsection {CONTROL FLOW}]
471
472The instructions in this section implement both conditional and
473unconditional control flow. The conditional jumps query the match
474status [term OK].
475
476[list_begin definitions]
477
478[def "[cmd icf_jalways] [arg branchlabel]"]
479
480This instruction sets the program counter [term PC] to the address
481specified by [arg branchlabel] and then continues execution from
482there. This is an unconditional jump.
483
484
485[def "[cmd icf_jok] [arg branchlabel]"]
486
487This instruction sets the program counter [term PC] to the address
488specified by [arg branchlabel]. This happens if, and only if the match
489status [term OK] indicates a success. Otherwise it simply continues
490execution at the next instruction. This is a conditional jump.
491
492
493[def "[cmd icf_jfail] [arg branchlabel]"]
494
495This instruction sets the program counter [term PC] to the address
496specified by [arg branchlabel]. This happens if, and only if the match
497status [term OK] indicates a failure. Otherwise it simply continues
498execution at the next instruction. This is a conditional jump.
499
500
501[def [cmd icf_halt]]
502
503This instruction halts the machine and blocks any further execution.
504
505[list_end]
506
507
508[subsection {INPUT LOCATION HANDLING}]
509
510The instructions in this section are for backtracking, they manipulate
511the current location [term CL] of the machine state.
512
513They allow a user of the machine to query and save locations in the
514input, and to rewind the current location [term CL] to saved
515locations, making them one of the components enabling the
516implementation of backtracking parsers.
517
518[list_begin definitions]
519
520[def [cmd icl_push]]
521
522This instruction pushes a copy of the current location [term CL] on
523the location stack [term LS].
524
525
526[def [cmd icl_rewind]]
527
528This instruction pops an entry from the location stack [term LS] and
529then moves the current location [term CL] back to this point in the
530input.
531
532
533[def [cmd icl_pop]]
534
535This instruction pops an entry from the location stack [term LS] and
536discards it.
537
538[list_end]
539[para]
540
541
542[subsection {ERROR HANDLING}]
543
544The instructions in this section provide read and write access to the
545error status [term ER] of the machine.
546
547[list_begin definitions]
548
549[def [cmd ier_push]]
550
551This instruction pushes a copy of the current error status [term ER]
552on the error stack [term ES].
553
554
555[def [cmd ier_clear]]
556
557This instruction clears the error status [term ER].
558
559
560[def "[cmd ier_nonterminal] [arg message]"]
561
562This instruction checks if the error status [term ER] contains an
563error whose location is just past the location found in the top entry
564of the location stack [term LS].
565
566Nothing happens if no such error is found.
567
568Otherwise the found error is replaced by an error at the location
569found on the stack, having the message [arg message].
570
571
572[def [cmd ier_merge]]
573
574This instruction pops an entry from the error stack [term ES], merges
575it with the current error status [term ER] and stores the result of
576the merge as the new error status [term ER].
577
578[para]
579
580The merge is performed as described below:
581
582[para]
583
584If one of the two error states is empty the other is chosen. If
585neither error state is empty, and refering to different locations,
586then the error state with the location further in the input is
587chosen. If both error states refer to the same location their messages
588are merged (with removing duplicates).
589
590[list_end]
591
592
593[subsection {SEMANTIC VALUES}]
594
595The instructions in this section manipulate the semantic value
596[term SV].
597
598[list_begin definitions]
599
600[def [cmd isv_clear]]
601
602This instruction clears the semantic value [term SV].
603
604
605[def [cmd isv_terminal]]
606
607This instruction creates a terminal AST node for the current token
608[term CT], makes it the semantic value [term SV], and also pushes the
609node on the AST stack [term AS].
610
611
612[def "[cmd isv_nonterminal_leaf] [arg nt]"]
613
614This instruction creates a nonterminal AST node without any children
615for the nonterminal [arg nt], and makes it the semantic value
616[term SV].
617
618[para]
619
620This instruction should be executed if, and only if the match status
621[term OK] indicates a success.
622
623In the case of a failure [cmd isv_clear] should be called.
624
625
626[def "[cmd isv_nonterminal_range] [arg nt]"]
627
628This instruction creates a nonterminal AST node for the nonterminal
629
630[arg nt], with a single terminal node as its child, and makes this AST
631the semantic value [term SV]. The terminal node refers to the input
632string from the location found on top of the location stack [term LS]
633to the current location [term CL] (both inclusive).
634
635[para]
636
637This instruction should be executed if, and only if the match status
638[term OK] indicates a success.
639
640In the case of a failure [cmd isv_clear] should be called.
641
642
643[def "[cmd isv_nonterminal_reduce] [arg nt]"]
644
645This instruction creates a nonterminal AST node for the nonterminal
646[arg nt] and makes it the semantic value [term SV].
647
648[para]
649
650All entries on the AST stack [term AS] above the marker found in the
651top entry of the AST Marker stack [term MS] become children of the new
652node, with the entry at the stack top becoming the rightmost child. If
653the AST Marker stack [term MS] is empty the whole stack is used. The
654AST marker stack [term MS] is left unchanged.
655
656[para]
657
658This instruction should be executed if, and only if the match status
659[term OK] indicates a success.
660
661In the case of a failure [cmd isv_clear] should be called.
662
663[list_end]
664[para]
665
666
667[subsection {AST STACK HANDLING}]
668
669The instructions in this section manipulate the AST stack [term AS],
670and the AST Marker stack [term MS].
671
672[list_begin definitions]
673
674[def [cmd ias_push]]
675
676This instruction pushes the semantic value [term SV] on the AST stack
677[term AS].
678
679
680[def [cmd ias_mark]]
681
682This instruction pushes a marker for the current state of the AST
683stack [term AS] on the AST Marker stack [term MS].
684
685
686[def [cmd ias_mrewind]]
687
688This instruction pops an entry from the AST Marker stack [term MS] and
689then proceeds to pop entries from the AST stack [term AS] until the
690state represented by the popped marker has been reached again.
691
692Nothing is done if the AST stack [term AS] is already smaller than
693indicated by the popped marker.
694
695
696[def [cmd ias_mpop]]
697
698This instruction pops an entry from the AST Marker stack [term MS] and
699discards it.
700
701[list_end]
702
703[section {BUGS, IDEAS, FEEDBACK}]
704
705This document, and the package it describes, will undoubtedly contain
706bugs and other problems.
707
708Please report such in the category [emph grammar_me] of the
709[uri {http://sourceforge.net/tracker/?group_id=12883} {Tcllib SF Trackers}].
710
711Please also report any ideas for enhancements you may have for either
712package and/or documentation.
713
714
715[manpage_end]
716