1[comment {-*- tcl -*- doctools manpage}] 2[manpage_begin grammar::me_vm n 0.1] 3[copyright {2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>}] 4[moddesc {Grammar operations and usage}] 5[titledesc {Virtual machine for parsing token streams}] 6[category {Grammars and finite automata}] 7[description] 8[keywords {virtual machine}] 9[keywords parsing grammar] 10 11Please go and read the document [syscmd grammar::me_intro] first for 12an overview of the various documents and their relations. 13 14[para] 15 16This document specifies a virtual machine for the controlled matching 17and parsing of token streams, creating an 18 19[term {abstract syntax tree}] (short [term AST]) reflecting the 20structure of the input. Special machine features are the caching and 21reuse of partial results, caching of the encountered input, and the 22ability to backtrack in both input and AST creation. 23 24[para] 25 26These features make the specified virtual machine especially useful to 27packrat parsers based on parsing expression grammars. It is however 28not restricted to this type of parser. Normal LL and LR parsers can be 29implemented with it as well. 30 31[para] 32 33The following sections will discuss first the abstract state kept by 34ME virtual machines, and then their instruction set. 35 36 37[section {MACHINE STATE}] 38 39A ME virtual machine manages the following state: 40 41[list_begin definitions] 42[def "[term {Current token}] CT"] 43 44The token from the input under consideration by the machine. 45 46[para] 47 48This information is used and modified by the instructions defined in 49the section 50 51[sectref {TERMINAL MATCHING}]. 52 53 54[def "[term {Current location}] CL"] 55 56The location of the [term {current token}] in the input stream, as 57offset relative to the beginning of the stream. The first token is 58considered to be at offset [const 0]. 59 60[para] 61 62This information is implicitly used and modified by the instructions 63defined in the sections 64 65[sectref {TERMINAL MATCHING}] and 66[sectref {NONTERMINAL MATCHING}], 67 68and can be directly queried and modified by the instructions defined 69in section 70 71[sectref {INPUT LOCATION HANDLING}]. 72 73 74[def "[term {Location stack}] LS"] 75 76In addition to the above a stack of locations, for backtracking. 77Locations can put on the stack, removed from it, and removed with 78setting the current location. 79 80[para] 81 82This information is implicitly used and modified by the instructions 83defined in the sections 84 85[sectref {TERMINAL MATCHING}] and 86[sectref {NONTERMINAL MATCHING}], 87 88and can be directly queried and modified by the instructions defined 89in section 90 91[sectref {INPUT LOCATION HANDLING}]. 92 93 94[def "[term {Match status}] OK"] 95 96A boolean value, the result of the last attempt at matching input. 97It is set to [const true] if that attempt was successful, and 98[const false] otherwise. 99 100[para] 101 102This information is influenced by the instructions defined in the 103sections 104 105[sectref {TERMINAL MATCHING}], 106[sectref {NONTERMINAL MATCHING}], and 107[sectref {UNCONDITIONAL MATCHING}]. 108 109It is queried by the instructions defined in the section 110 111[sectref {CONTROL FLOW}]. 112 113 114[def "[term {Semantic value}] SV"] 115 116The semantic value associated with (generated by) the last attempt at 117matching input. Contains either the empty string or a node for the 118abstract syntax tree constructed from the input. 119 120[para] 121 122This information is influenced by the instructions defined in the 123sections 124 125[sectref {SEMANTIC VALUES}], and 126[sectref {AST STACK HANDLING}]. 127 128 129[def "[term {AST stack}] AS"] 130 131A stack of partial abstract syntax trees constructed by the machine 132during matching. 133 134[para] 135 136This information is influenced by the instructions defined in the 137sections 138 139[sectref {SEMANTIC VALUES}], and 140[sectref {AST STACK HANDLING}]. 141 142 143[def "[term {AST Marker stack}] MS"] 144 145In addition to the above a stack of stacks, for backtracking. This is 146actually a stack of markers into the AST stack, thus implicitly 147snapshooting the state of the AST stack at some point in time. Markers 148can be put on the stack, dropped from it, and used to roll back the 149AST stack to an earlier state. 150 151[para] 152 153This information is influenced by the instructions defined in the 154sections 155 156[sectref {SEMANTIC VALUES}], and 157[sectref {AST STACK HANDLING}]. 158 159 160[def "[term {Error status}] ER"] 161 162Error information associated with the last attempt at matching 163input. Contains either the empty string or a list of 2 elements, a 164location in the input and a list of error messages associated with 165it, in this order. 166 167[para] 168 169[emph Note] that error information can be set even if the last attempt 170at matching input was successful. For example the *-operator (matching 171a sub-expression zero or more times) in a parsing expression grammar 172is always successful, even if it encounters a problem further in the 173input and has to backtrack. Such problems must not be forgotten when 174continuing to match. 175 176[para] 177 178This information is queried and influenced by the instructions defined 179in the sections 180 181[sectref {TERMINAL MATCHING}], 182[sectref {NONTERMINAL MATCHING}], and 183[sectref {ERROR HANDLING}]. 184 185 186[def "[term {Error stack}] ES"] 187 188In addition to the above a stack of error information, to allow the 189merging of current and older error information when performing 190backtracking in choices after an unsucessful match. 191 192[para] 193 194This information is queried and influenced by the instructions defined 195in the sections 196 197[sectref {TERMINAL MATCHING}], 198[sectref {NONTERMINAL MATCHING}], and 199[sectref {ERROR HANDLING}]. 200 201 202[def "[term {Return stack}] RS"] 203 204A stack of program counter values, i.e. locations in the code 205controlling the virtual machine, for the management of subroutine 206calls, i.e. the matching of nonterminal symbols. 207 208[para] 209 210This information is queried and influenced by the instructions defined 211in the section 212 213[sectref {NONTERMINAL MATCHING}]. 214 215 216[def "[term {Nonterminal cache}] NC"] 217 218A cache of machine states (A 4-tuple containing a location in the 219input, match status [term OK], semantic value [term SV], and error 220status [term ER]) keyed by name of nonterminal symbol and location in 221the input stream. 222 223[para] 224 225The key location is where machine started the attempt to match the 226named nonterminal symbol, and the location in the value is where 227machine ended up after the attempt completed, independent of the 228success of the attempt. 229 230[para] 231 232This status is queried and influenced by the instructions defined in 233the section 234 235[sectref {NONTERMINAL MATCHING}]. 236 237[list_end] 238 239 240[section {MACHINE INSTRUCTIONS}] 241 242With the machine state specified it is now possible to explain the 243instruction set of ME virtual machines. They are grouped roughly by 244the machine state they influence and/or query. 245 246 247[subsection {TERMINAL MATCHING}] 248 249First the instructions to match tokens from the input stream, and 250by extension all terminal symbols. 251 252[para] 253 254These instructions are the only ones which may retrieve a new token 255from the input stream. This is a [emph may] and not a [emph will] 256because the instructions will a retrieve new token if, and only if the 257current location [term CL] is at the head of the stream. 258 259If the machine has backtracked (see [cmd icl_rewind]) the instructions 260will retrieve the token to compare against from the internal cache. 261 262[para] 263[list_begin definitions] 264 265[def "[cmd ict_advance] [arg message]"] 266 267This instruction tries to advance to the next token in the input 268stream, i.e. the one after the current location [term CL]. The 269instruction will fail if, and only if the end of the input stream is 270reached, i.e. if there is no next token. 271 272[para] 273 274The sucess/failure of the instruction is remembered in the match 275status [term OK]. In the case of failure the error status [term ER] is 276set to the current location and the message [arg message]. 277 278In the case of success the error status [term ER] is cleared, the new 279token is made the current token [term CT], and the new location is 280made the current location [term CL]. 281 282[para] 283 284The argument [arg message] is a reference to the string to put into 285the error status [term ER], if such is needed. 286 287 288[def "[cmd ict_match_token] [arg tok] [arg message]"] 289 290This instruction tests the current token [term CT] for equality 291with the argument [arg tok] and records the result in the match 292status [term OK]. The instruction fails if the current token is 293not equal to [arg tok]. 294 295[para] 296 297In case of failure the error status [term ER] is set to the current 298location [term CL] and the message [arg message], and the 299current location [term CL] is moved one token backwards. 300 301Otherwise, i.e. upon success, the error status [term ER] is cleared 302and the current location [term CL] is not touched. 303 304 305[def "[cmd ict_match_tokrange] [arg tokbegin] [arg tokend] [arg message]"] 306 307This instruction tests the current token [term CT] for being in 308the range of tokens from [arg tokbegin] to [arg tokend] 309(inclusive) and records the result in the match status [term OK]. The 310instruction fails if the current token is not inside the range. 311 312[para] 313 314In case of failure the error status [term ER] is set to the current 315location [term CL] and the message [arg message], and the current location 316[term CL] is moved one token backwards. 317 318Otherwise, i.e. upon success, the error status [term ER] is cleared 319and the current location [term CL] is not touched. 320 321 322[def "[cmd ict_match_tokclass] [arg code] [arg message]"] 323 324This instruction tests the current token [term CT] for being a member 325of the token class [arg code] and records the result in the match 326status [term OK]. The instruction fails if the current token is not a 327member of the specified class. 328 329[para] 330 331In case of failure the error status [term ER] is set to the current 332location [term CL] and the message [arg message], and the 333current location [term CL] is moved one token backwards. 334 335Otherwise, i.e. upon success, the error status [term ER] is cleared 336and the current location [term CL] is not touched. 337 338[para] 339 340Currently the following classes are legal: 341 342[list_begin definitions] 343[def alnum] 344A token is accepted if it is a unicode alphabetical character, or a digit. 345[def alpha] 346A token is accepted if it is a unicode alphabetical character. 347[def digit] 348A token is accepted if it is a unicode digit character. 349[def xdigit] 350A token is accepted if it is a hexadecimal digit character. 351[def punct] 352A token is accepted if it is a unicode punctuation character. 353[def space] 354A token is accepted if it is a unicode space character. 355[list_end] 356 357[list_end] 358[para] 359 360 361[subsection {NONTERMINAL MATCHING}] 362 363The instructions in this section handle the matching of nonterminal 364symbols. They query the nonterminal cache [term NC] for saved 365information, and put such information into the cache. 366 367[para] 368 369The usage of the cache is a performance aid for backtracking parsers, 370allowing them to avoid an expensive rematch of complex nonterminal 371symbols if they have been encountered before. 372 373[para] 374 375[list_begin definitions] 376 377[def "[cmd inc_restore] [arg branchlabel] [arg nt]"] 378 379This instruction checks if the nonterminal cache [term NC] contains 380information about the nonterminal symbol [arg nt], at the current 381location [term CL]. If that is the case the instruction will update 382the machine state (current location [term CL], match status [term OK], 383semantic value [term SV], and error status [term ER]) with the found 384information and continue execution at the instruction refered to by 385the [arg branchlabel]. The new current location [term CL] will be the 386last token matched by the nonterminal symbol, i.e. belonging to it. 387 388[para] 389 390If no information was found the instruction will continue execution at 391the next instruction. 392 393[para] 394 395Together with [cmd icf_ntcall] it is possible to generate code for 396memoized and non-memoized matching of nonterminal symbols, either as 397subroutine calls, or inlined in the caller. 398 399 400[def "[cmd inc_save] [arg nt]"] 401 402This instruction saves the current state of the machine (current 403location [term CL], match status [term OK], semantic value [term SV], 404and error status [term ER]), to the nonterminal cache [term NC]. It 405will also pop an entry from the location stack [term LS] and save it 406as the start location of the match. 407 408[para] 409 410It is expected to be called at the end of matching a nonterminal 411symbol, with [arg nt] the name of the nonterminal symbol the code was 412working on. This allows the instruction [cmd inc_restore] to check for 413and retrieve the data, should we have to match this nonterminal symbol 414at the same location again, during backtracking. 415 416 417[def "[cmd icf_ntcall] [arg branchlabel]"] 418 419This instruction invokes the code for matching the nonterminal symbol 420[arg nt] as a subroutine. To this end it stores the current program 421counter [term PC] on the return stack [term RS], the current location 422[term CL] on the location stack [term LS], and then continues 423execution at the address [arg branchlabel]. 424 425[para] 426 427The next matching [cmd icf_ntreturn] will cause the execution to 428continue at the instruction coming after the call. 429 430 431[def [cmd icf_ntreturn]] 432 433This instruction will pop an entry from the return stack [term RS], 434assign it to the program counter [term PC], and then continue 435execution at the new address. 436 437[list_end] 438[para] 439 440 441[subsection {UNCONDITIONAL MATCHING}] 442 443The instructions in this section are the remaining match 444operators. They change the match status [term OK] directly and 445unconditionally. 446 447[list_begin definitions] 448 449[def [cmd iok_ok]] 450 451This instruction sets the match status [term OK] to [const true], 452indicating a successful match. 453 454 455[def [cmd iok_fail]] 456 457This instruction sets the match status [term OK] to [const false], 458indicating a failed match. 459 460 461[def [cmd iok_negate]] 462 463This instruction negates the match status [term OK], turning a failure 464into a success and vice versa. 465 466[list_end] 467[para] 468 469 470[subsection {CONTROL FLOW}] 471 472The instructions in this section implement both conditional and 473unconditional control flow. The conditional jumps query the match 474status [term OK]. 475 476[list_begin definitions] 477 478[def "[cmd icf_jalways] [arg branchlabel]"] 479 480This instruction sets the program counter [term PC] to the address 481specified by [arg branchlabel] and then continues execution from 482there. This is an unconditional jump. 483 484 485[def "[cmd icf_jok] [arg branchlabel]"] 486 487This instruction sets the program counter [term PC] to the address 488specified by [arg branchlabel]. This happens if, and only if the match 489status [term OK] indicates a success. Otherwise it simply continues 490execution at the next instruction. This is a conditional jump. 491 492 493[def "[cmd icf_jfail] [arg branchlabel]"] 494 495This instruction sets the program counter [term PC] to the address 496specified by [arg branchlabel]. This happens if, and only if the match 497status [term OK] indicates a failure. Otherwise it simply continues 498execution at the next instruction. This is a conditional jump. 499 500 501[def [cmd icf_halt]] 502 503This instruction halts the machine and blocks any further execution. 504 505[list_end] 506 507 508[subsection {INPUT LOCATION HANDLING}] 509 510The instructions in this section are for backtracking, they manipulate 511the current location [term CL] of the machine state. 512 513They allow a user of the machine to query and save locations in the 514input, and to rewind the current location [term CL] to saved 515locations, making them one of the components enabling the 516implementation of backtracking parsers. 517 518[list_begin definitions] 519 520[def [cmd icl_push]] 521 522This instruction pushes a copy of the current location [term CL] on 523the location stack [term LS]. 524 525 526[def [cmd icl_rewind]] 527 528This instruction pops an entry from the location stack [term LS] and 529then moves the current location [term CL] back to this point in the 530input. 531 532 533[def [cmd icl_pop]] 534 535This instruction pops an entry from the location stack [term LS] and 536discards it. 537 538[list_end] 539[para] 540 541 542[subsection {ERROR HANDLING}] 543 544The instructions in this section provide read and write access to the 545error status [term ER] of the machine. 546 547[list_begin definitions] 548 549[def [cmd ier_push]] 550 551This instruction pushes a copy of the current error status [term ER] 552on the error stack [term ES]. 553 554 555[def [cmd ier_clear]] 556 557This instruction clears the error status [term ER]. 558 559 560[def "[cmd ier_nonterminal] [arg message]"] 561 562This instruction checks if the error status [term ER] contains an 563error whose location is just past the location found in the top entry 564of the location stack [term LS]. 565 566Nothing happens if no such error is found. 567 568Otherwise the found error is replaced by an error at the location 569found on the stack, having the message [arg message]. 570 571 572[def [cmd ier_merge]] 573 574This instruction pops an entry from the error stack [term ES], merges 575it with the current error status [term ER] and stores the result of 576the merge as the new error status [term ER]. 577 578[para] 579 580The merge is performed as described below: 581 582[para] 583 584If one of the two error states is empty the other is chosen. If 585neither error state is empty, and refering to different locations, 586then the error state with the location further in the input is 587chosen. If both error states refer to the same location their messages 588are merged (with removing duplicates). 589 590[list_end] 591 592 593[subsection {SEMANTIC VALUES}] 594 595The instructions in this section manipulate the semantic value 596[term SV]. 597 598[list_begin definitions] 599 600[def [cmd isv_clear]] 601 602This instruction clears the semantic value [term SV]. 603 604 605[def [cmd isv_terminal]] 606 607This instruction creates a terminal AST node for the current token 608[term CT], makes it the semantic value [term SV], and also pushes the 609node on the AST stack [term AS]. 610 611 612[def "[cmd isv_nonterminal_leaf] [arg nt]"] 613 614This instruction creates a nonterminal AST node without any children 615for the nonterminal [arg nt], and makes it the semantic value 616[term SV]. 617 618[para] 619 620This instruction should be executed if, and only if the match status 621[term OK] indicates a success. 622 623In the case of a failure [cmd isv_clear] should be called. 624 625 626[def "[cmd isv_nonterminal_range] [arg nt]"] 627 628This instruction creates a nonterminal AST node for the nonterminal 629 630[arg nt], with a single terminal node as its child, and makes this AST 631the semantic value [term SV]. The terminal node refers to the input 632string from the location found on top of the location stack [term LS] 633to the current location [term CL] (both inclusive). 634 635[para] 636 637This instruction should be executed if, and only if the match status 638[term OK] indicates a success. 639 640In the case of a failure [cmd isv_clear] should be called. 641 642 643[def "[cmd isv_nonterminal_reduce] [arg nt]"] 644 645This instruction creates a nonterminal AST node for the nonterminal 646[arg nt] and makes it the semantic value [term SV]. 647 648[para] 649 650All entries on the AST stack [term AS] above the marker found in the 651top entry of the AST Marker stack [term MS] become children of the new 652node, with the entry at the stack top becoming the rightmost child. If 653the AST Marker stack [term MS] is empty the whole stack is used. The 654AST marker stack [term MS] is left unchanged. 655 656[para] 657 658This instruction should be executed if, and only if the match status 659[term OK] indicates a success. 660 661In the case of a failure [cmd isv_clear] should be called. 662 663[list_end] 664[para] 665 666 667[subsection {AST STACK HANDLING}] 668 669The instructions in this section manipulate the AST stack [term AS], 670and the AST Marker stack [term MS]. 671 672[list_begin definitions] 673 674[def [cmd ias_push]] 675 676This instruction pushes the semantic value [term SV] on the AST stack 677[term AS]. 678 679 680[def [cmd ias_mark]] 681 682This instruction pushes a marker for the current state of the AST 683stack [term AS] on the AST Marker stack [term MS]. 684 685 686[def [cmd ias_mrewind]] 687 688This instruction pops an entry from the AST Marker stack [term MS] and 689then proceeds to pop entries from the AST stack [term AS] until the 690state represented by the popped marker has been reached again. 691 692Nothing is done if the AST stack [term AS] is already smaller than 693indicated by the popped marker. 694 695 696[def [cmd ias_mpop]] 697 698This instruction pops an entry from the AST Marker stack [term MS] and 699discards it. 700 701[list_end] 702 703[section {BUGS, IDEAS, FEEDBACK}] 704 705This document, and the package it describes, will undoubtedly contain 706bugs and other problems. 707 708Please report such in the category [emph grammar_me] of the 709[uri {http://sourceforge.net/tracker/?group_id=12883} {Tcllib SF Trackers}]. 710 711Please also report any ideas for enhancements you may have for either 712package and/or documentation. 713 714 715[manpage_end] 716