2.\" 3.TH FLEX 1 "April 1995" "Version 2.5" 4.SH NAME 5flex \- fast lexical analyzer generator 6.SH SYNOPSIS 7.B flex 8.B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton] 9.B [\-\-help \-\-version] 10.I [filename ...] 11.SH OVERVIEW 12This manual describes 13.I flex, 14a tool for generating programs that perform pattern-matching on text. The 15manual includes both tutorial and reference sections: 16.nf 17 18 Description 19 a brief overview of the tool 20 21 Some Simple Examples 22 23 Format Of The Input File 24 25 Patterns 26 the extended regular expressions used by flex 27 28 How The Input Is Matched 29 the rules for determining what has been matched 30 31 Actions 32 how to specify what to do when a pattern is matched 33 34 The Generated Scanner 35 details regarding the scanner that flex produces; 36 how to control the input source 37 38 Start Conditions 39 introducing context into your scanners, and 40 managing "mini-scanners" 41 42 Multiple Input Buffers 43 how to manipulate multiple input sources; how to 44 scan from strings instead of files 45 46 End-of-file Rules 47 special rules for matching the end of the input 48 49 Miscellaneous Macros 50 a summary of macros available to the actions 51 52 Values Available To The User 53 a summary of values available to the actions 54 55 Interfacing With Yacc 56 connecting flex scanners together with yacc parsers 57 58 Options 59 flex command-line options, and the "%option" 60 directive 61 62 Performance Considerations 63 how to make your scanner go as fast as possible 64 65 Generating C++ Scanners 66 the (experimental) facility for generating C++ 67 scanner classes 68 69 Incompatibilities With Lex And POSIX 70 how flex differs from AT&T lex and the POSIX lex 71 standard 72 73 Diagnostics 74 those error messages produced by flex (or scanners 75 it generates) whose meanings might not be apparent 76 77 Files 78 files used by flex 79 80 Deficiencies / Bugs 81 known problems with flex 82 83 See Also 84 other documentation, related tools 85 86 Author 87 includes contact information 88 89.fi 90.SH DESCRIPTION 91.I flex 92is a tool for generating 93.I scanners: 94programs which recognize lexical patterns in text. 95.I flex 96reads 97the given input files, or its standard input if no file names are given, 98for a description of a scanner to generate. The description is in 99the form of pairs 100of regular expressions and C code, called 101.I rules. flex 102generates as output a C source file, 103.B lex.yy.c, 104which defines a routine 105.B yylex(). 106This file is compiled and linked with the 107.B \-ll 108library to produce an executable. When the executable is run, 109it analyzes its input for occurrences 110of the regular expressions. Whenever it finds one, it executes 111the corresponding C code. 112.SH SOME SIMPLE EXAMPLES 113First some simple examples to get the flavor of how one uses 114.I flex. 115The following 116.I flex 117input specifies a scanner which whenever it encounters the string 118"username" will replace it with the user's login name: 119.nf 120 121 %% 122 username printf( "%s", getlogin() ); 123 124.fi 125By default, any text not matched by a 126.I flex 127scanner 128is copied to the output, so the net effect of this scanner is 129to copy its input file to its output with each occurrence 130of "username" expanded. 131In this input, there is just one rule. "username" is the 132.I pattern 133and the "printf" is the 134.I action. 135The "%%" marks the beginning of the rules. 136.PP 137Here's another simple example: 138.nf 139 140 %{ 141 int num_lines = 0, num_chars = 0; 142 %} 143 144 %% 145 \\n ++num_lines; ++num_chars; 146 . ++num_chars; 147 148 %% 149 main() 150 { 151 yylex(); 152 printf( "# of lines = %d, # of chars = %d\\n", 153 num_lines, num_chars ); 154 } 155 156.fi 157This scanner counts the number of characters and the number 158of lines in its input (it produces no output other than the 159final report on the counts). The first line 160declares two globals, "num_lines" and "num_chars", which are accessible 161both inside 162.B yylex() 163and in the 164.B main() 165routine declared after the second "%%". There are two rules, one 166which matches a newline ("\\n") and increments both the line count and 167the character count, and one which matches any character other than 168a newline (indicated by the "." regular expression). 169.PP 170A somewhat more complicated example: 171.nf 172 173 /* scanner for a toy Pascal-like language */ 174 175 %{ 176 /* need this for the call to atof() below */ 177 #include <math.h> 178 %} 179 180 DIGIT [0-9] 181 ID [a-z][a-z0-9]* 182 183 %% 184 185 {DIGIT}+ { 186 printf( "An integer: %s (%d)\\n", yytext, 187 atoi( yytext ) ); 188 } 189 190 {DIGIT}+"."{DIGIT}* { 191 printf( "A float: %s (%g)\\n", yytext, 192 atof( yytext ) ); 193 } 194 195 if|then|begin|end|procedure|function { 196 printf( "A keyword: %s\\n", yytext ); 197 } 198 199 {ID} printf( "An identifier: %s\\n", yytext ); 200 201 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); 202 203 "{"[^}\\n]*"}" /* eat up one-line comments */ 204 205 [ \\t\\n]+ /* eat up whitespace */ 206 207 . printf( "Unrecognized character: %s\\n", yytext ); 208 209 %% 210 211 main( argc, argv ) 212 int argc; 213 char **argv; 214 { 215 ++argv, --argc; /* skip over program name */ 216 if ( argc > 0 ) 217 yyin = fopen( argv[0], "r" ); 218 else 219 yyin = stdin; 220 221 yylex(); 222 } 223 224.fi 225This is the beginnings of a simple scanner for a language like 226Pascal. It identifies different types of 227.I tokens 228and reports on what it has seen. 229.PP 230The details of this example will be explained in the following 231sections. 232.SH FORMAT OF THE INPUT FILE 233The 234.I flex 235input file consists of three sections, separated by a line with just 236.B %% 237in it: 238.nf 239 240 definitions 241 %% 242 rules 243 %% 244 user code 245 246.fi 247The 248.I definitions 249section contains declarations of simple 250.I name 251definitions to simplify the scanner specification, and declarations of 252.I start conditions, 253which are explained in a later section. 254.PP 255Name definitions have the form: 256.nf 257 258 name definition 259 260.fi 261The "name" is a word beginning with a letter or an underscore ('_') 262followed by zero or more letters, digits, '_', or '-' (dash). 263The definition is taken to begin at the first non-white-space character 264following the name and continuing to the end of the line. 265The definition can subsequently be referred to using "{name}", which 266will expand to "(definition)". For example, 267.nf 268 269 DIGIT [0-9] 270 ID [a-z][a-z0-9]* 271 272.fi 273defines "DIGIT" to be a regular expression which matches a 274single digit, and 275"ID" to be a regular expression which matches a letter 276followed by zero-or-more letters-or-digits. 277A subsequent reference to 278.nf 279 280 {DIGIT}+"."{DIGIT}* 281 282.fi 283is identical to 284.nf 285 286 ([0-9])+"."([0-9])* 287 288.fi 289and matches one-or-more digits followed by a '.' followed 290by zero-or-more digits. 291.PP 292The 293.I rules 294section of the 295.I flex 296input contains a series of rules of the form: 297.nf 298 299 pattern action 300 301.fi 302where the pattern must be unindented and the action must begin 303on the same line. 304.PP 305See below for a further description of patterns and actions. 306.PP 307Finally, the user code section is simply copied to 308.B lex.yy.c 309verbatim. 310It is used for companion routines which call or are called 311by the scanner. The presence of this section is optional; 312if it is missing, the second 313.B %% 314in the input file may be skipped, too. 315.PP 316In the definitions and rules sections, any 317.I indented 318text or text enclosed in 319.B %{ 320and 321.B %} 322is copied verbatim to the output (with the %{}'s removed). 323The %{}'s must appear unindented on lines by themselves. 324.PP 325In the rules section, 326any indented or %{} text appearing before the 327first rule may be used to declare variables 328which are local to the scanning routine and (after the declarations) 329code which is to be executed whenever the scanning routine is entered. 330Other indented or %{} text in the rule section is still copied to the output, 331but its meaning is not well-defined and it may well cause compile-time 332errors (this feature is present for 333.I POSIX 334compliance; see below for other such features). 335.PP 336In the definitions section (but not in the rules section), 337an unindented comment (i.e., a line 338beginning with "/*") is also copied verbatim to the output up 339to the next "*/". 340.SH PATTERNS 341The patterns in the input are written using an extended set of regular 342expressions. These are: 343.nf 344 345 x match the character 'x' 346 . any character (byte) except newline 347 [xyz] a "character class"; in this case, the pattern 348 matches either an 'x', a 'y', or a 'z' 349 [abj-oZ] a "character class" with a range in it; matches 350 an 'a', a 'b', any letter from 'j' through 'o', 351 or a 'Z' 352 [^A-Z] a "negated character class", i.e., any character 353 but those in the class. In this case, any 354 character EXCEPT an uppercase letter. 355 [^A-Z\\n] any character EXCEPT an uppercase letter or 356 a newline 357 r* zero or more r's, where r is any regular expression 358 r+ one or more r's 359 r? zero or one r's (that is, "an optional r") 360 r{2,5} anywhere from two to five r's 361 r{2,} two or more r's 362 r{4} exactly 4 r's 363 {name} the expansion of the "name" definition 364 (see above) 365 "[xyz]\\"foo" 366 the literal string: [xyz]"foo 367 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', 368 then the ANSI-C interpretation of \\x. 369 Otherwise, a literal 'X' (used to escape 370 operators such as '*') 371 \\0 a NUL character (ASCII code 0) 372 \\123 the character with octal value 123 373 \\x2a the character with hexadecimal value 2a 374 (r) match an r; parentheses are used to override 375 precedence (see below) 376 377 378 rs the regular expression r followed by the 379 regular expression s; called "concatenation" 380 381 382 r|s either an r or an s 383 384 385 r/s an r but only if it is followed by an s. The 386 text matched by s is included when determining 387 whether this rule is the "longest match", 388 but is then returned to the input before 389 the action is executed. So the action only 390 sees the text matched by r. This type 391 of pattern is called trailing context". 392 (There are some combinations of r/s that flex 393 cannot match correctly; see notes in the 394 Deficiencies / Bugs section below regarding 395 "dangerous trailing context".) 396 ^r an r, but only at the beginning of a line (i.e., 397 which just starting to scan, or right after a 398 newline has been scanned). 399 r$ an r, but only at the end of a line (i.e., just 400 before a newline). Equivalent to "r/\\n". 401 402 Note that flex's notion of "newline" is exactly 403 whatever the C compiler used to compile flex 404 interprets '\\n' as; in particular, on some DOS 405 systems you must either filter out \\r's in the 406 input yourself, or explicitly use r/\\r\\n for "r$". 407 408 409 <s>r an r, but only in start condition s (see 410 below for discussion of start conditions) 411 <s1,s2,s3>r 412 same, but in any of start conditions s1, 413 s2, or s3 414 <*>r an r in any start condition, even an exclusive one. 415 416 417 <<EOF>> an end-of-file 418 <s1,s2><<EOF>> 419 an end-of-file when in start condition s1 or s2 420 421.fi 422Note that inside of a character class, all regular expression operators 423lose their special meaning except escape ('\\') and the character class 424operators, '-', ']', and, at the beginning of the class, '^'. 425.PP 426The regular expressions listed above are grouped according to 427precedence, from highest precedence at the top to lowest at the bottom. 428Those grouped together have equal precedence. For example, 429.nf 430 431 foo|bar* 432 433.fi 434is the same as 435.nf 436 437 (foo)|(ba(r*)) 438 439.fi 440since the '*' operator has higher precedence than concatenation, 441and concatenation higher than alternation ('|'). This pattern 442therefore matches 443.I either 444the string "foo" 445.I or 446the string "ba" followed by zero-or-more r's. 447To match "foo" or zero-or-more "bar"'s, use: 448.nf 449 450 foo|(bar)* 451 452.fi 453and to match zero-or-more "foo"'s-or-"bar"'s: 454.nf 455 456 (foo|bar)* 457 458.fi 459.PP 460In addition to characters and ranges of characters, character classes 461can also contain character class 462.I expressions. 463These are expressions enclosed inside 464.B [: 465and 466.B :] 467delimiters (which themselves must appear between the '[' and ']' of the 468character class; other elements may occur inside the character class, too). 469The valid expressions are: 470.nf 471 472 [:alnum:] [:alpha:] [:blank:] 473 [:cntrl:] [:digit:] [:graph:] 474 [:lower:] [:print:] [:punct:] 475 [:space:] [:upper:] [:xdigit:] 476 477.fi 478These expressions all designate a set of characters equivalent to 479the corresponding standard C 480.B isXXX 481function. For example, 482.B [:alnum:] 483designates those characters for which 484.B isalnum() 485returns true - i.e., any alphabetic or numeric. 486Some systems don't provide 487.B isblank(), 488so flex defines 489.B [:blank:] 490as a blank or a tab. 491.PP 492For example, the following character classes are all equivalent: 493.nf 494 495 [[:alnum:]] 496 [[:alpha:][:digit:]] 497 [[:alpha:]0-9] 498 [a-zA-Z0-9] 499 500.fi 501If your scanner is case-insensitive (the 502.B \-i 503flag), then 504.B [:upper:] 505and 506.B [:lower:] 507are equivalent to 508.B [:alpha:]. 509.PP 510Some notes on patterns: 511.IP - 512A negated character class such as the example "[^A-Z]" 513above 514.I will match a newline 515unless "\\n" (or an equivalent escape sequence) is one of the 516characters explicitly present in the negated character class 517(e.g., "[^A-Z\\n]"). This is unlike how many other regular 518expression tools treat negated character classes, but unfortunately 519the inconsistency is historically entrenched. 520Matching newlines means that a pattern like [^"]* can match the entire 521input unless there's another quote in the input. 522.IP - 523A rule can have at most one instance of trailing context (the '/' operator 524or the '$' operator). The start condition, '^', and "<<EOF>>" patterns 525can only occur at the beginning of a pattern, and, as well as with '/' and '$', 526cannot be grouped inside parentheses. A '^' which does not occur at 527the beginning of a rule or a '$' which does not occur at the end of 528a rule loses its special properties and is treated as a normal character. 529.IP 530The following are illegal: 531.nf 532 533 foo/bar$ 534 <sc1>foo<sc2>bar 535 536.fi 537Note that the first of these, can be written "foo/bar\\n". 538.IP 539The following will result in '$' or '^' being treated as a normal character: 540.nf 541 542 foo|(bar$) 543 foo|^bar 544 545.fi 546If what's wanted is a "foo" or a bar-followed-by-a-newline, the following 547could be used (the special '|' action is explained below): 548.nf 549 550 foo | 551 bar$ /* action goes here */ 552 553.fi 554A similar trick will work for matching a foo or a 555bar-at-the-beginning-of-a-line. 556.SH HOW THE INPUT IS MATCHED 557When the generated scanner is run, it analyzes its input looking 558for strings which match any of its patterns. If it finds more than 559one match, it takes the one matching the most text (for trailing 560context rules, this includes the length of the trailing part, even 561though it will then be returned to the input). If it finds two 562or more matches of the same length, the 563rule listed first in the 564.I flex 565input file is chosen. 566.PP 567Once the match is determined, the text corresponding to the match 568(called the 569.I token) 570is made available in the global character pointer 571.B yytext, 572and its length in the global integer 573.B yyleng. 574The 575.I action 576corresponding to the matched pattern is then executed (a more 577detailed description of actions follows), and then the remaining 578input is scanned for another match. 579.PP 580If no match is found, then the 581.I default rule 582is executed: the next character in the input is considered matched and 583copied to the standard output. Thus, the simplest legal 584.I flex 585input is: 586.nf 587 588 %% 589 590.fi 591which generates a scanner that simply copies its input (one character 592at a time) to its output. 593.PP 594Note that 595.B yytext 596can be defined in two different ways: either as a character 597.I pointer 598or as a character 599.I array. 600You can control which definition 601.I flex 602uses by including one of the special directives 603.B %pointer 604or 605.B %array 606in the first (definitions) section of your flex input. The default is 607.B %pointer, 608unless you use the 609.B -l 610lex compatibility option, in which case 611.B yytext 612will be an array. 613The advantage of using 614.B %pointer 615is substantially faster scanning and no buffer overflow when matching 616very large tokens (unless you run out of dynamic memory). The disadvantage 617is that you are restricted in how your actions can modify 618.B yytext 619(see the next section), and calls to the 620.B unput() 621function destroys the present contents of 622.B yytext, 623which can be a considerable porting headache when moving between different 624.I lex 625versions. 626.PP 627The advantage of 628.B %array 629is that you can then modify 630.B yytext 631to your heart's content, and calls to 632.B unput() 633do not destroy 634.B yytext 635(see below). Furthermore, existing 636.I lex 637programs sometimes access 638.B yytext 639externally using declarations of the form: 640.nf 641 extern char yytext[]; 642.fi 643This definition is erroneous when used with 644.B %pointer, 645but correct for 646.B %array. 647.PP 648.B %array 649defines 650.B yytext 651to be an array of 652.B YYLMAX 653characters, which defaults to a fairly large value. You can change 654the size by simply #define'ing 655.B YYLMAX 656to a different value in the first section of your 657.I flex 658input. As mentioned above, with 659.B %pointer 660yytext grows dynamically to accommodate large tokens. While this means your 661.B %pointer 662scanner can accommodate very large tokens (such as matching entire blocks 663of comments), bear in mind that each time the scanner must resize 664.B yytext 665it also must rescan the entire token from the beginning, so matching such 666tokens can prove slow. 667.B yytext 668presently does 669.I not 670dynamically grow if a call to 671.B unput() 672results in too much text being pushed back; instead, a run-time error results. 673.PP 674Also note that you cannot use 675.B %array 676with C++ scanner classes 677(the 678.B c++ 679option; see below). 680.SH ACTIONS 681Each pattern in a rule has a corresponding action, which can be any 682arbitrary C statement. The pattern ends at the first non-escaped 683whitespace character; the remainder of the line is its action. If the 684action is empty, then when the pattern is matched the input token 685is simply discarded. For example, here is the specification for a program 686which deletes all occurrences of "zap me" from its input: 687.nf 688 689 %% 690 "zap me" 691 692.fi 693(It will copy all other characters in the input to the output since 694they will be matched by the default rule.) 695.PP 696Here is a program which compresses multiple blanks and tabs down to 697a single blank, and throws away whitespace found at the end of a line: 698.nf 699 700 %% 701 [ \\t]+ putchar( ' ' ); 702 [ \\t]+$ /* ignore this token */ 703 704.fi 705.PP 706If the action contains a '{', then the action spans till the balancing '}' 707is found, and the action may cross multiple lines. 708.I flex 709knows about C strings and comments and won't be fooled by braces found 710within them, but also allows actions to begin with 711.B %{ 712and will consider the action to be all the text up to the next 713.B %} 714(regardless of ordinary braces inside the action). 715.PP 716An action consisting solely of a vertical bar ('|') means "same as 717the action for the next rule." See below for an illustration. 718.PP 719Actions can include arbitrary C code, including 720.B return 721statements to return a value to whatever routine called 722.B yylex(). 723Each time 724.B yylex() 725is called it continues processing tokens from where it last left 726off until it either reaches 727the end of the file or executes a return. 728.PP 729Actions are free to modify 730.B yytext 731except for lengthening it (adding 732characters to its end--these will overwrite later characters in the 733input stream). This however does not apply when using 734.B %array 735(see above); in that case, 736.B yytext 737may be freely modified in any way. 738.PP 739Actions are free to modify 740.B yyleng 741except they should not do so if the action also includes use of 742.B yymore() 743(see below). 744.PP 745There are a number of special directives which can be included within 746an action: 747.IP - 748.B ECHO 749copies yytext to the scanner's output. 750.IP - 751.B BEGIN 752followed by the name of a start condition places the scanner in the 753corresponding start condition (see below). 754.IP - 755.B REJECT 756directs the scanner to proceed on to the "second best" rule which matched the 757input (or a prefix of the input). The rule is chosen as described 758above in "How the Input is Matched", and 759.B yytext 760and 761.B yyleng 762set up appropriately. 763It may either be one which matched as much text 764as the originally chosen rule but came later in the 765.I flex 766input file, or one which matched less text. 767For example, the following will both count the 768words in the input and call the routine special() whenever "frob" is seen: 769.nf 770 771 int word_count = 0; 772 %% 773 774 frob special(); REJECT; 775 [^ \\t\\n]+ ++word_count; 776 777.fi 778Without the 779.B REJECT, 780any "frob"'s in the input would not be counted as words, since the 781scanner normally executes only one action per token. 782Multiple 783.B REJECT's 784are allowed, each one finding the next best choice to the currently 785active rule. For example, when the following scanner scans the token 786"abcd", it will write "abcdabcaba" to the output: 787.nf 788 789 %% 790 a | 791 ab | 792 abc | 793 abcd ECHO; REJECT; 794 .|\\n /* eat up any unmatched character */ 795 796.fi 797(The first three rules share the fourth's action since they use 798the special '|' action.) 799.B REJECT 800is a particularly expensive feature in terms of scanner performance; 801if it is used in 802.I any 803of the scanner's actions it will slow down 804.I all 805of the scanner's matching. Furthermore, 806.B REJECT 807cannot be used with the 808.I -Cf 809or 810.I -CF 811options (see below). 812.IP 813Note also that unlike the other special actions, 814.B REJECT 815is a 816.I branch; 817code immediately following it in the action will 818.I not 819be executed. 820.IP - 821.B yymore() 822tells the scanner that the next time it matches a rule, the corresponding 823token should be 824.I appended 825onto the current value of 826.B yytext 827rather than replacing it. For example, given the input "mega-kludge" 828the following will write "mega-mega-kludge" to the output: 829.nf 830 831 %% 832 mega- ECHO; yymore(); 833 kludge ECHO; 834 835.fi 836First "mega-" is matched and echoed to the output. Then "kludge" 837is matched, but the previous "mega-" is still hanging around at the 838beginning of 839.B yytext 840so the 841.B ECHO 842for the "kludge" rule will actually write "mega-kludge". 843.PP 844Two notes regarding use of 845.B yymore(). 846First, 847.B yymore() 848depends on the value of 849.I yyleng 850correctly reflecting the size of the current token, so you must not 851modify 852.I yyleng 853if you are using 854.B yymore(). 855Second, the presence of 856.B yymore() 857in the scanner's action entails a minor performance penalty in the 858scanner's matching speed. 859.IP - 860.B yyless(n) 861returns all but the first 862.I n 863characters of the current token back to the input stream, where they 864will be rescanned when the scanner looks for the next match. 865.B yytext 866and 867.B yyleng 868are adjusted appropriately (e.g., 869.B yyleng 870will now be equal to 871.I n 872). For example, on the input "foobar" the following will write out 873"foobarbar": 874.nf 875 876 %% 877 foobar ECHO; yyless(3); 878 [a-z]+ ECHO; 879 880.fi 881An argument of 0 to 882.B yyless 883will cause the entire current input string to be scanned again. Unless you've 884changed how the scanner will subsequently process its input (using 885.B BEGIN, 886for example), this will result in an endless loop. 887.PP 888Note that 889.B yyless 890is a macro and can only be used in the flex input file, not from 891other source files. 892.IP - 893.B unput(c) 894puts the character 895.I c 896back onto the input stream. It will be the next character scanned. 897The following action will take the current token and cause it 898to be rescanned enclosed in parentheses. 899.nf 900 901 { 902 int i; 903 /* Copy yytext because unput() trashes yytext */ 904 char *yycopy = strdup( yytext ); 905 unput( ')' ); 906 for ( i = yyleng - 1; i >= 0; --i ) 907 unput( yycopy[i] ); 908 unput( '(' ); 909 free( yycopy ); 910 } 911 912.fi 913Note that since each 914.B unput() 915puts the given character back at the 916.I beginning 917of the input stream, pushing back strings must be done back-to-front. 918.PP 919An important potential problem when using 920.B unput() 921is that if you are using 922.B %pointer 923(the default), a call to 924.B unput() 925.I destroys 926the contents of 927.I yytext, 928starting with its rightmost character and devouring one character to 929the left with each call. If you need the value of yytext preserved 930after a call to 931.B unput() 932(as in the above example), 933you must either first copy it elsewhere, or build your scanner using 934.B %array 935instead (see How The Input Is Matched). 936.PP 937Finally, note that you cannot put back 938.B EOF 939to attempt to mark the input stream with an end-of-file. 940.IP - 941.B input() 942reads the next character from the input stream. For example, 943the following is one way to eat up C comments: 944.nf 945 946 %% 947 "/*" { 948 register int c; 949 950 for ( ; ; ) 951 { 952 while ( (c = input()) != '*' && 953 c != EOF ) 954 ; /* eat up text of comment */ 955 956 if ( c == '*' ) 957 { 958 while ( (c = input()) == '*' ) 959 ; 960 if ( c == '/' ) 961 break; /* found the end */ 962 } 963 964 if ( c == EOF ) 965 { 966 error( "EOF in comment" ); 967 break; 968 } 969 } 970 } 971 972.fi 973(Note that if the scanner is compiled using 974.B C++, 975then 976.B input() 977is instead referred to as 978.B yyinput(), 979in order to avoid a name clash with the 980.B C++ 981stream by the name of 982.I input.) 983.IP - 984.B YY_FLUSH_BUFFER 985flushes the scanner's internal buffer 986so that the next time the scanner attempts to match a token, it will 987first refill the buffer using 988.B YY_INPUT 989(see The Generated Scanner, below). This action is a special case 990of the more general 991.B yy_flush_buffer() 992function, described below in the section Multiple Input Buffers. 993.IP - 994.B yyterminate() 995can be used in lieu of a return statement in an action. It terminates 996the scanner and returns a 0 to the scanner's caller, indicating "all done". 997By default, 998.B yyterminate() 999is also called when an end-of-file is encountered. It is a macro and 1000may be redefined. 1001.SH THE GENERATED SCANNER 1002The output of 1003.I flex 1004is the file 1005.B lex.yy.c, 1006which contains the scanning routine 1007.B yylex(), 1008a number of tables used by it for matching tokens, and a number 1009of auxiliary routines and macros. By default, 1010.B yylex() 1011is declared as follows: 1012.nf 1013 1014 int yylex() 1015 { 1016 ... various definitions and the actions in here ... 1017 } 1018 1019.fi 1020(If your environment supports function prototypes, then it will 1021be "int yylex( void )".) This definition may be changed by defining 1022the "YY_DECL" macro. For example, you could use: 1023.nf 1024 1025 #define YY_DECL float lexscan( a, b ) float a, b; 1026 1027.fi 1028to give the scanning routine the name 1029.I lexscan, 1030returning a float, and taking two floats as arguments. Note that 1031if you give arguments to the scanning routine using a 1032K&R-style/non-prototyped function declaration, you must terminate 1033the definition with a semi-colon (;). 1034.PP 1035Whenever 1036.B yylex() 1037is called, it scans tokens from the global input file 1038.I yyin 1039(which defaults to stdin). It continues until it either reaches 1040an end-of-file (at which point it returns the value 0) or 1041one of its actions executes a 1042.I return 1043statement. 1044.PP 1045If the scanner reaches an end-of-file, subsequent calls are undefined 1046unless either 1047.I yyin 1048is pointed at a new input file (in which case scanning continues from 1049that file), or 1050.B yyrestart() 1051is called. 1052.B yyrestart() 1053takes one argument, a 1054.B FILE * 1055pointer (which can be nil, if you've set up 1056.B YY_INPUT 1057to scan from a source other than 1058.I yyin), 1059and initializes 1060.I yyin 1061for scanning from that file. Essentially there is no difference between 1062just assigning 1063.I yyin 1064to a new input file or using 1065.B yyrestart() 1066to do so; the latter is available for compatibility with previous versions 1067of 1068.I flex, 1069and because it can be used to switch input files in the middle of scanning. 1070It can also be used to throw away the current input buffer, by calling 1071it with an argument of 1072.I yyin; 1073but better is to use 1074.B YY_FLUSH_BUFFER 1075(see above). 1076Note that 1077.B yyrestart() 1078does 1079.I not 1080reset the start condition to 1081.B INITIAL 1082(see Start Conditions, below). 1083.PP 1084If 1085.B yylex() 1086stops scanning due to executing a 1087.I return 1088statement in one of the actions, the scanner may then be called again and it 1089will resume scanning where it left off. 1090.PP 1091By default (and for purposes of efficiency), the scanner uses 1092block-reads rather than simple 1093.I getc() 1094calls to read characters from 1095.I yyin. 1096The nature of how it gets its input can be controlled by defining the 1097.B YY_INPUT 1098macro. 1099YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its 1100action is to place up to 1101.I max_size 1102characters in the character array 1103.I buf 1104and return in the integer variable 1105.I result 1106either the 1107number of characters read or the constant YY_NULL (0 on Unix systems) 1108to indicate EOF. The default YY_INPUT reads from the 1109global file-pointer "yyin". 1110.PP 1111A sample definition of YY_INPUT (in the definitions 1112section of the input file): 1113.nf 1114 1115 %{ 1116 #define YY_INPUT(buf,result,max_size) \\ 1117 { \\ 1118 int c = getchar(); \\ 1119 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ 1120 } 1121 %} 1122 1123.fi 1124This definition will change the input processing to occur 1125one character at a time. 1126.PP 1127When the scanner receives an end-of-file indication from YY_INPUT, 1128it then checks the 1129.B yywrap() 1130function. If 1131.B yywrap() 1132returns false (zero), then it is assumed that the 1133function has gone ahead and set up 1134.I yyin 1135to point to another input file, and scanning continues. If it returns 1136true (non-zero), then the scanner terminates, returning 0 to its 1137caller. Note that in either case, the start condition remains unchanged; 1138it does 1139.I not 1140revert to 1141.B INITIAL. 1142.PP 1143If you do not supply your own version of 1144.B yywrap(), 1145then you must either use 1146.B %option noyywrap 1147(in which case the scanner behaves as though 1148.B yywrap() 1149returned 1), or you must link with 1150.B \-ll 1151to obtain the default version of the routine, which always returns 1. 1152.PP 1153Three routines are available for scanning from in-memory buffers rather 1154than files: 1155.B yy_scan_string(), yy_scan_bytes(), 1156and 1157.B yy_scan_buffer(). 1158See the discussion of them below in the section Multiple Input Buffers. 1159.PP 1160The scanner writes its 1161.B ECHO 1162output to the 1163.I yyout 1164global (default, stdout), which may be redefined by the user simply 1165by assigning it to some other 1166.B FILE 1167pointer. 1168.SH START CONDITIONS 1169.I flex 1170provides a mechanism for conditionally activating rules. Any rule 1171whose pattern is prefixed with "<sc>" will only be active when 1172the scanner is in the start condition named "sc". For example, 1173.nf 1174 1175 <STRING>[^"]* { /* eat up the string body ... */ 1176 ... 1177 } 1178 1179.fi 1180will be active only when the scanner is in the "STRING" start 1181condition, and 1182.nf 1183 1184 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ 1185 ... 1186 } 1187 1188.fi 1189will be active only when the current start condition is 1190either "INITIAL", "STRING", or "QUOTE". 1191.PP 1192Start conditions 1193are declared in the definitions (first) section of the input 1194using unindented lines beginning with either 1195.B %s 1196or 1197.B %x 1198followed by a list of names. 1199The former declares 1200.I inclusive 1201start conditions, the latter 1202.I exclusive 1203start conditions. A start condition is activated using the 1204.B BEGIN 1205action. Until the next 1206.B BEGIN 1207action is executed, rules with the given start 1208condition will be active and 1209rules with other start conditions will be inactive. 1210If the start condition is 1211.I inclusive, 1212then rules with no start conditions at all will also be active. 1213If it is 1214.I exclusive, 1215then 1216.I only 1217rules qualified with the start condition will be active. 1218A set of rules contingent on the same exclusive start condition 1219describe a scanner which is independent of any of the other rules in the 1220.I flex 1221input. Because of this, 1222exclusive start conditions make it easy to specify "mini-scanners" 1223which scan portions of the input that are syntactically different 1224from the rest (e.g., comments). 1225.PP 1226If the distinction between inclusive and exclusive start conditions 1227is still a little vague, here's a simple example illustrating the 1228connection between the two. The set of rules: 1229.nf 1230 1231 %s example 1232 %% 1233 1234 <example>foo do_something(); 1235 1236 bar something_else(); 1237 1238.fi 1239is equivalent to 1240.nf 1241 1242 %x example 1243 %% 1244 1245 <example>foo do_something(); 1246 1247 <INITIAL,example>bar something_else(); 1248 1249.fi 1250Without the 1251.B <INITIAL,example> 1252qualifier, the 1253.I bar 1254pattern in the second example wouldn't be active (i.e., couldn't match) 1255when in start condition 1256.B example. 1257If we just used 1258.B <example> 1259to qualify 1260.I bar, 1261though, then it would only be active in 1262.B example 1263and not in 1264.B INITIAL, 1265while in the first example it's active in both, because in the first 1266example the 1267.B example 1268start condition is an 1269.I inclusive 1270.B (%s) 1271start condition. 1272.PP 1273Also note that the special start-condition specifier 1274.B <*> 1275matches every start condition. Thus, the above example could also 1276have been written; 1277.nf 1278 1279 %x example 1280 %% 1281 1282 <example>foo do_something(); 1283 1284 <*>bar something_else(); 1285 1286.fi 1287.PP 1288The default rule (to 1289.B ECHO 1290any unmatched character) remains active in start conditions. It 1291is equivalent to: 1292.nf 1293 1294 <*>.|\\n ECHO; 1295 1296.fi 1297.PP 1298.B BEGIN(0) 1299returns to the original state where only the rules with 1300no start conditions are active. This state can also be 1301referred to as the start-condition "INITIAL", so 1302.B BEGIN(INITIAL) 1303is equivalent to 1304.B BEGIN(0). 1305(The parentheses around the start condition name are not required but 1306are considered good style.) 1307.PP 1308.B BEGIN 1309actions can also be given as indented code at the beginning 1310of the rules section. For example, the following will cause 1311the scanner to enter the "SPECIAL" start condition whenever 1312.B yylex() 1313is called and the global variable 1314.I enter_special 1315is true: 1316.nf 1317 1318 int enter_special; 1319 1320 %x SPECIAL 1321 %% 1322 if ( enter_special ) 1323 BEGIN(SPECIAL); 1324 1325 <SPECIAL>blahblahblah 1326 ...more rules follow... 1327 1328.fi 1329.PP 1330To illustrate the uses of start conditions, 1331here is a scanner which provides two different interpretations 1332of a string like "123.456". By default it will treat it as 1333three tokens, the integer "123", a dot ('.'), and the integer "456". 1334But if the string is preceded earlier in the line by the string 1335"expect-floats" 1336it will treat it as a single token, the floating-point number 1337123.456: 1338.nf 1339 1340 %{ 1341 #include <math.h> 1342 %} 1343 %s expect 1344 1345 %% 1346 expect-floats BEGIN(expect); 1347 1348 <expect>[0-9]+"."[0-9]+ { 1349 printf( "found a float, = %f\\n", 1350 atof( yytext ) ); 1351 } 1352 <expect>\\n { 1353 /* that's the end of the line, so 1354 * we need another "expect-number" 1355 * before we'll recognize any more 1356 * numbers 1357 */ 1358 BEGIN(INITIAL); 1359 } 1360 1361 [0-9]+ { 1362 printf( "found an integer, = %d\\n", 1363 atoi( yytext ) ); 1364 } 1365 1366 "." printf( "found a dot\\n" ); 1367 1368.fi 1369Here is a scanner which recognizes (and discards) C comments while 1370maintaining a count of the current input line. 1371.nf 1372 1373 %x comment 1374 %% 1375 int line_num = 1; 1376 1377 "/*" BEGIN(comment); 1378 1379 <comment>[^*\\n]* /* eat anything that's not a '*' */ 1380 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ 1381 <comment>\\n ++line_num; 1382 <comment>"*"+"/" BEGIN(INITIAL); 1383 1384.fi 1385This scanner goes to a bit of trouble to match as much 1386text as possible with each rule. In general, when attempting to write 1387a high-speed scanner try to match as much possible in each rule, as 1388it's a big win. 1389.PP 1390Note that start-conditions names are really integer values and 1391can be stored as such. Thus, the above could be extended in the 1392following fashion: 1393.nf 1394 1395 %x comment foo 1396 %% 1397 int line_num = 1; 1398 int comment_caller; 1399 1400 "/*" { 1401 comment_caller = INITIAL; 1402 BEGIN(comment); 1403 } 1404 1405 ... 1406 1407 <foo>"/*" { 1408 comment_caller = foo; 1409 BEGIN(comment); 1410 } 1411 1412 <comment>[^*\\n]* /* eat anything that's not a '*' */ 1413 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ 1414 <comment>\\n ++line_num; 1415 <comment>"*"+"/" BEGIN(comment_caller); 1416 1417.fi 1418Furthermore, you can access the current start condition using 1419the integer-valued 1420.B YY_START 1421macro. For example, the above assignments to 1422.I comment_caller 1423could instead be written 1424.nf 1425 1426 comment_caller = YY_START; 1427 1428.fi 1429Flex provides 1430.B YYSTATE 1431as an alias for 1432.B YY_START 1433(since that is what's used by AT&T 1434.I lex). 1435.PP 1436Note that start conditions do not have their own name-space; %s's and %x's 1437declare names in the same fashion as #define's. 1438.PP 1439Finally, here's an example of how to match C-style quoted strings using 1440exclusive start conditions, including expanded escape sequences (but 1441not including checking for a string that's too long): 1442.nf 1443 1444 %x str 1445 1446 %% 1447 char string_buf[MAX_STR_CONST]; 1448 char *string_buf_ptr; 1449 1450 1451 \\" string_buf_ptr = string_buf; BEGIN(str); 1452 1453 <str>\\" { /* saw closing quote - all done */ 1454 BEGIN(INITIAL); 1455 *string_buf_ptr = '\\0'; 1456 /* return string constant token type and 1457 * value to parser 1458 */ 1459 } 1460 1461 <str>\\n { 1462 /* error - unterminated string constant */ 1463 /* generate error message */ 1464 } 1465 1466 <str>\\\\[0-7]{1,3} { 1467 /* octal escape sequence */ 1468 int result; 1469 1470 (void) sscanf( yytext + 1, "%o", &result ); 1471 1472 if ( result > 0xff ) 1473 /* error, constant is out-of-bounds */ 1474 1475 *string_buf_ptr++ = result; 1476 } 1477 1478 <str>\\\\[0-9]+ { 1479 /* generate error - bad escape sequence; something 1480 * like '\\48' or '\\0777777' 1481 */ 1482 } 1483 1484 <str>\\\\n *string_buf_ptr++ = '\\n'; 1485 <str>\\\\t *string_buf_ptr++ = '\\t'; 1486 <str>\\\\r *string_buf_ptr++ = '\\r'; 1487 <str>\\\\b *string_buf_ptr++ = '\\b'; 1488 <str>\\\\f *string_buf_ptr++ = '\\f'; 1489 1490 <str>\\\\(.|\\n) *string_buf_ptr++ = yytext[1]; 1491 1492 <str>[^\\\\\\n\\"]+ { 1493 char *yptr = yytext; 1494 1495 while ( *yptr ) 1496 *string_buf_ptr++ = *yptr++; 1497 } 1498 1499.fi 1500.PP 1501Often, such as in some of the examples above, you wind up writing a 1502whole bunch of rules all preceded by the same start condition(s). Flex 1503makes this a little easier and cleaner by introducing a notion of 1504start condition 1505.I scope. 1506A start condition scope is begun with: 1507.nf 1508 1509 <SCs>{ 1510 1511.fi 1512where 1513.I SCs 1514is a list of one or more start conditions. Inside the start condition 1515scope, every rule automatically has the prefix 1516.I <SCs> 1517applied to it, until a 1518.I '}' 1519which matches the initial 1520.I '{'. 1521So, for example, 1522.nf 1523 1524 <ESC>{ 1525 "\\\\n" return '\\n'; 1526 "\\\\r" return '\\r'; 1527 "\\\\f" return '\\f'; 1528 "\\\\0" return '\\0'; 1529 } 1530 1531.fi 1532is equivalent to: 1533.nf 1534 1535 <ESC>"\\\\n" return '\\n'; 1536 <ESC>"\\\\r" return '\\r'; 1537 <ESC>"\\\\f" return '\\f'; 1538 <ESC>"\\\\0" return '\\0'; 1539 1540.fi 1541Start condition scopes may be nested. 1542.PP 1543Three routines are available for manipulating stacks of start conditions: 1544.TP 1545.B void yy_push_state(int new_state) 1546pushes the current start condition onto the top of the start condition 1547stack and switches to 1548.I new_state 1549as though you had used 1550.B BEGIN new_state 1551(recall that start condition names are also integers). 1552.TP 1553.B void yy_pop_state() 1554pops the top of the stack and switches to it via 1555.B BEGIN. 1556.TP 1557.B int yy_top_state() 1558returns the top of the stack without altering the stack's contents. 1559.PP 1560The start condition stack grows dynamically and so has no built-in 1561size limitation. If memory is exhausted, program execution aborts. 1562.PP 1563To use start condition stacks, your scanner must include a 1564.B %option stack 1565directive (see Options below). 1566.SH MULTIPLE INPUT BUFFERS 1567Some scanners (such as those which support "include" files) 1568require reading from several input streams. As 1569.I flex 1570scanners do a large amount of buffering, one cannot control 1571where the next input will be read from by simply writing a 1572.B YY_INPUT 1573which is sensitive to the scanning context. 1574.B YY_INPUT 1575is only called when the scanner reaches the end of its buffer, which 1576may be a long time after scanning a statement such as an "include" 1577which requires switching the input source. 1578.PP 1579To negotiate these sorts of problems, 1580.I flex 1581provides a mechanism for creating and switching between multiple 1582input buffers. An input buffer is created by using: 1583.nf 1584 1585 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) 1586 1587.fi 1588which takes a 1589.I FILE 1590pointer and a size and creates a buffer associated with the given 1591file and large enough to hold 1592.I size 1593characters (when in doubt, use 1594.B YY_BUF_SIZE 1595for the size). It returns a 1596.B YY_BUFFER_STATE 1597handle, which may then be passed to other routines (see below). The 1598.B YY_BUFFER_STATE 1599type is a pointer to an opaque 1600.B struct yy_buffer_state 1601structure, so you may safely initialize YY_BUFFER_STATE variables to 1602.B ((YY_BUFFER_STATE) 0) 1603if you wish, and also refer to the opaque structure in order to 1604correctly declare input buffers in source files other than that 1605of your scanner. Note that the 1606.I FILE 1607pointer in the call to 1608.B yy_create_buffer 1609is only used as the value of 1610.I yyin 1611seen by 1612.B YY_INPUT; 1613if you redefine 1614.B YY_INPUT 1615so it no longer uses 1616.I yyin, 1617then you can safely pass a nil 1618.I FILE 1619pointer to 1620.B yy_create_buffer. 1621You select a particular buffer to scan from using: 1622.nf 1623 1624 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) 1625 1626.fi 1627switches the scanner's input buffer so subsequent tokens will 1628come from 1629.I new_buffer. 1630Note that 1631.B yy_switch_to_buffer() 1632may be used by yywrap() to set things up for continued scanning, instead 1633of opening a new file and pointing 1634.I yyin 1635at it. Note also that switching input sources via either 1636.B yy_switch_to_buffer() 1637or 1638.B yywrap() 1639does 1640.I not 1641change the start condition. 1642.nf 1643 1644 void yy_delete_buffer( YY_BUFFER_STATE buffer ) 1645 1646.fi 1647is used to reclaim the storage associated with a buffer. ( 1648.B buffer 1649can be nil, in which case the routine does nothing.) 1650You can also clear the current contents of a buffer using: 1651.nf 1652 1653 void yy_flush_buffer( YY_BUFFER_STATE buffer ) 1654 1655.fi 1656This function discards the buffer's contents, 1657so the next time the scanner attempts to match a token from the 1658buffer, it will first fill the buffer anew using 1659.B YY_INPUT. 1660.PP 1661.B yy_new_buffer() 1662is an alias for 1663.B yy_create_buffer(), 1664provided for compatibility with the C++ use of 1665.I new 1666and 1667.I delete 1668for creating and destroying dynamic objects. 1669.PP 1670Finally, the 1671.B YY_CURRENT_BUFFER 1672macro returns a 1673.B YY_BUFFER_STATE 1674handle to the current buffer. 1675.PP 1676Here is an example of using these features for writing a scanner 1677which expands include files (the 1678.B <<EOF>> 1679feature is discussed below): 1680.nf 1681 1682 /* the "incl" state is used for picking up the name 1683 * of an include file 1684 */ 1685 %x incl 1686 1687 %{ 1688 #define MAX_INCLUDE_DEPTH 10 1689 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1690 int include_stack_ptr = 0; 1691 %} 1692 1693 %% 1694 include BEGIN(incl); 1695 1696 [a-z]+ ECHO; 1697 [^a-z\\n]*\\n? ECHO; 1698 1699 <incl>[ \\t]* /* eat the whitespace */ 1700 <incl>[^ \\t\\n]+ { /* got the include file name */ 1701 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 1702 { 1703 fprintf( stderr, "Includes nested too deeply" ); 1704 exit( 1 ); 1705 } 1706 1707 include_stack[include_stack_ptr++] = 1708 YY_CURRENT_BUFFER; 1709 1710 yyin = fopen( yytext, "r" ); 1711 1712 if ( ! yyin ) 1713 error( ... ); 1714 1715 yy_switch_to_buffer( 1716 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 1717 1718 BEGIN(INITIAL); 1719 } 1720 1721 <<EOF>> { 1722 if ( --include_stack_ptr < 0 ) 1723 { 1724 yyterminate(); 1725 } 1726 1727 else 1728 { 1729 yy_delete_buffer( YY_CURRENT_BUFFER ); 1730 yy_switch_to_buffer( 1731 include_stack[include_stack_ptr] ); 1732 } 1733 } 1734 1735.fi 1736Three routines are available for setting up input buffers for 1737scanning in-memory strings instead of files. All of them create 1738a new input buffer for scanning the string, and return a corresponding 1739.B YY_BUFFER_STATE 1740handle (which you should delete with 1741.B yy_delete_buffer() 1742when done with it). They also switch to the new buffer using 1743.B yy_switch_to_buffer(), 1744so the next call to 1745.B yylex() 1746will start scanning the string. 1747.TP 1748.B yy_scan_string(const char *str) 1749scans a NUL-terminated string. 1750.TP 1751.B yy_scan_bytes(const char *bytes, int len) 1752scans 1753.I len 1754bytes (including possibly NUL's) 1755starting at location 1756.I bytes. 1757.PP 1758Note that both of these functions create and scan a 1759.I copy 1760of the string or bytes. (This may be desirable, since 1761.B yylex() 1762modifies the contents of the buffer it is scanning.) You can avoid the 1763copy by using: 1764.TP 1765.B yy_scan_buffer(char *base, yy_size_t size) 1766which scans in place the buffer starting at 1767.I base, 1768consisting of 1769.I size 1770bytes, the last two bytes of which 1771.I must 1772be 1773.B YY_END_OF_BUFFER_CHAR 1774(ASCII NUL). 1775These last two bytes are not scanned; thus, scanning 1776consists of 1777.B base[0] 1778through 1779.B base[size-2], 1780inclusive. 1781.IP 1782If you fail to set up 1783.I base 1784in this manner (i.e., forget the final two 1785.B YY_END_OF_BUFFER_CHAR 1786bytes), then 1787.B yy_scan_buffer() 1788returns a nil pointer instead of creating a new input buffer. 1789.IP 1790The type 1791.B yy_size_t 1792is an integral type to which you can cast an integer expression 1793reflecting the size of the buffer. 1794.SH END-OF-FILE RULES 1795The special rule "<<EOF>>" indicates 1796actions which are to be taken when an end-of-file is 1797encountered and yywrap() returns non-zero (i.e., indicates 1798no further files to process). The action must finish 1799by doing one of four things: 1800.IP - 1801assigning 1802.I yyin 1803to a new input file (in previous versions of flex, after doing the 1804assignment you had to call the special action 1805.B YY_NEW_FILE; 1806this is no longer necessary); 1807.IP - 1808executing a 1809.I return 1810statement; 1811.IP - 1812executing the special 1813.B yyterminate() 1814action; 1815.IP - 1816or, switching to a new buffer using 1817.B yy_switch_to_buffer() 1818as shown in the example above. 1819.PP 1820<<EOF>> rules may not be used with other 1821patterns; they may only be qualified with a list of start 1822conditions. If an unqualified <<EOF>> rule is given, it 1823applies to 1824.I all 1825start conditions which do not already have <<EOF>> actions. To 1826specify an <<EOF>> rule for only the initial start condition, use 1827.nf 1828 1829 <INITIAL><<EOF>> 1830 1831.fi 1832.PP 1833These rules are useful for catching things like unclosed comments. 1834An example: 1835.nf 1836 1837 %x quote 1838 %% 1839 1840 ...other rules for dealing with quotes... 1841 1842 <quote><<EOF>> { 1843 error( "unterminated quote" ); 1844 yyterminate(); 1845 } 1846 <<EOF>> { 1847 if ( *++filelist ) 1848 yyin = fopen( *filelist, "r" ); 1849 else 1850 yyterminate(); 1851 } 1852 1853.fi 1854.SH MISCELLANEOUS MACROS 1855The macro 1856.B YY_USER_ACTION 1857can be defined to provide an action 1858which is always executed prior to the matched rule's action. For example, 1859it could be #define'd to call a routine to convert yytext to lower-case. 1860When 1861.B YY_USER_ACTION 1862is invoked, the variable 1863.I yy_act 1864gives the number of the matched rule (rules are numbered starting with 1). 1865Suppose you want to profile how often each of your rules is matched. The 1866following would do the trick: 1867.nf 1868 1869 #define YY_USER_ACTION ++ctr[yy_act] 1870 1871.fi 1872where 1873.I ctr 1874is an array to hold the counts for the different rules. Note that 1875the macro 1876.B YY_NUM_RULES 1877gives the total number of rules (including the default rule, even if 1878you use 1879.B \-s), 1880so a correct declaration for 1881.I ctr 1882is: 1883.nf 1884 1885 int ctr[YY_NUM_RULES]; 1886 1887.fi 1888.PP 1889The macro 1890.B YY_USER_INIT 1891may be defined to provide an action which is always executed before 1892the first scan (and before the scanner's internal initializations are done). 1893For example, it could be used to call a routine to read 1894in a data table or open a logging file. 1895.PP 1896The macro 1897.B yy_set_interactive(is_interactive) 1898can be used to control whether the current buffer is considered 1899.I interactive. 1900An interactive buffer is processed more slowly, 1901but must be used when the scanner's input source is indeed 1902interactive to avoid problems due to waiting to fill buffers 1903(see the discussion of the 1904.B \-I 1905flag below). A non-zero value 1906in the macro invocation marks the buffer as interactive, a zero 1907value as non-interactive. Note that use of this macro overrides 1908.B %option interactive , 1909.B %option always-interactive 1910or 1911.B %option never-interactive 1912(see Options below). 1913.B yy_set_interactive() 1914must be invoked prior to beginning to scan the buffer that is 1915(or is not) to be considered interactive. 1916.PP 1917The macro 1918.B yy_set_bol(at_bol) 1919can be used to control whether the current buffer's scanning 1920context for the next token match is done as though at the 1921beginning of a line. A non-zero macro argument makes rules anchored with 1922\&'^' active, while a zero argument makes '^' rules inactive. 1923.PP 1924The macro 1925.B YY_AT_BOL() 1926returns true if the next token scanned from the current buffer 1927will have '^' rules active, false otherwise. 1928.PP 1929In the generated scanner, the actions are all gathered in one large 1930switch statement and separated using 1931.B YY_BREAK, 1932which may be redefined. By default, it is simply a "break", to separate 1933each rule's action from the following rule's. 1934Redefining 1935.B YY_BREAK 1936allows, for example, C++ users to 1937#define YY_BREAK to do nothing (while being very careful that every 1938rule ends with a "break" or a "return"!) to avoid suffering from 1939unreachable statement warnings where because a rule's action ends with 1940"return", the 1941.B YY_BREAK 1942is inaccessible. 1943.SH VALUES AVAILABLE TO THE USER 1944This section summarizes the various values available to the user 1945in the rule actions. 1946.IP - 1947.B char *yytext 1948holds the text of the current token. It may be modified but not lengthened 1949(you cannot append characters to the end). 1950.IP 1951If the special directive 1952.B %array 1953appears in the first section of the scanner description, then 1954.B yytext 1955is instead declared 1956.B char yytext[YYLMAX], 1957where 1958.B YYLMAX 1959is a macro definition that you can redefine in the first section 1960if you don't like the default value (generally 8KB). Using 1961.B %array 1962results in somewhat slower scanners, but the value of 1963.B yytext 1964becomes immune to calls to 1965.I input() 1966and 1967.I unput(), 1968which potentially destroy its value when 1969.B yytext 1970is a character pointer. The opposite of 1971.B %array 1972is 1973.B %pointer, 1974which is the default. 1975.IP 1976You cannot use 1977.B %array 1978when generating C++ scanner classes 1979(the 1980.B \-+ 1981flag). 1982.IP - 1983.B int yyleng 1984holds the length of the current token. 1985.IP - 1986.B FILE *yyin 1987is the file which by default 1988.I flex 1989reads from. It may be redefined but doing so only makes sense before 1990scanning begins or after an EOF has been encountered. Changing it in 1991the midst of scanning will have unexpected results since 1992.I flex 1993buffers its input; use 1994.B yyrestart() 1995instead. 1996Once scanning terminates because an end-of-file 1997has been seen, you can assign 1998.I yyin 1999at the new input file and then call the scanner again to continue scanning. 2000.IP - 2001.B void yyrestart( FILE *new_file ) 2002may be called to point 2003.I yyin 2004at the new input file. The switch-over to the new file is immediate 2005(any previously buffered-up input is lost). Note that calling 2006.B yyrestart() 2007with 2008.I yyin 2009as an argument thus throws away the current input buffer and continues 2010scanning the same input file. 2011.IP - 2012.B FILE *yyout 2013is the file to which 2014.B ECHO 2015actions are done. It can be reassigned by the user. 2016.IP - 2017.B YY_CURRENT_BUFFER 2018returns a 2019.B YY_BUFFER_STATE 2020handle to the current buffer. 2021.IP - 2022.B YY_START 2023returns an integer value corresponding to the current start 2024condition. You can subsequently use this value with 2025.B BEGIN 2026to return to that start condition. 2027.SH INTERFACING WITH YACC 2028One of the main uses of 2029.I flex 2030is as a companion to the 2031.I yacc 2032parser-generator. 2033.I yacc 2034parsers expect to call a routine named 2035.B yylex() 2036to find the next input token. The routine is supposed to 2037return the type of the next token as well as putting any associated 2038value in the global 2039.B yylval. 2040To use 2041.I flex 2042with 2043.I yacc, 2044one specifies the 2045.B \-d 2046option to 2047.I yacc 2048to instruct it to generate the file 2049.B y.tab.h 2050containing definitions of all the 2051.B %tokens 2052appearing in the 2053.I yacc 2054input. This file is then included in the 2055.I flex 2056scanner. For example, if one of the tokens is "TOK_NUMBER", 2057part of the scanner might look like: 2058.nf 2059 2060 %{ 2061 #include "y.tab.h" 2062 %} 2063 2064 %% 2065 2066 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2067 2068.fi 2069.SH OPTIONS 2070.I flex 2071has the following options: 2072.TP 2073.B \-b 2074Generate backing-up information to 2075.I lex.backup. 2076This is a list of scanner states which require backing up 2077and the input characters on which they do so. By adding rules one 2078can remove backing-up states. If 2079.I all 2080backing-up states are eliminated and 2081.B \-Cf 2082or 2083.B \-CF 2084is used, the generated scanner will run faster (see the 2085.B \-p 2086flag). Only users who wish to squeeze every last cycle out of their 2087scanners need worry about this option. (See the section on Performance 2088Considerations below.) 2089.TP 2090.B \-c 2091is a do-nothing, deprecated option included for POSIX compliance. 2092.TP 2093.B \-d 2094makes the generated scanner run in 2095.I debug 2096mode. Whenever a pattern is recognized and the global 2097.B yy_flex_debug 2098is non-zero (which is the default), 2099the scanner will write to 2100.I stderr 2101a line of the form: 2102.nf 2103 2104 --accepting rule at line 53 ("the matched text") 2105 2106.fi 2107The line number refers to the location of the rule in the file 2108defining the scanner (i.e., the file that was fed to flex). Messages 2109are also generated when the scanner backs up, accepts the 2110default rule, reaches the end of its input buffer (or encounters 2111a NUL; at this point, the two look the same as far as the scanner's concerned), 2112or reaches an end-of-file. 2113.TP 2114.B \-f 2115specifies 2116.I fast scanner. 2117No table compression is done and stdio is bypassed. 2118The result is large but fast. This option is equivalent to 2119.B \-Cfr 2120(see below). 2121.TP 2122.B \-h 2123generates a "help" summary of 2124.I flex's 2125options to 2126.I stdout 2127and then exits. 2128.B \-? 2129and 2130.B \-\-help 2131are synonyms for 2132.B \-h. 2133.TP 2134.B \-i 2135instructs 2136.I flex 2137to generate a 2138.I case-insensitive 2139scanner. The case of letters given in the 2140.I flex 2141input patterns will 2142be ignored, and tokens in the input will be matched regardless of case. The 2143matched text given in 2144.I yytext 2145will have the preserved case (i.e., it will not be folded). 2146.TP 2147.B \-l 2148turns on maximum compatibility with the original AT&T 2149.I lex 2150implementation. Note that this does not mean 2151.I full 2152compatibility. Use of this option costs a considerable amount of 2153performance, and it cannot be used with the 2154.B \-+, -f, -F, -Cf, 2155or 2156.B -CF 2157options. For details on the compatibilities it provides, see the section 2158"Incompatibilities With Lex And POSIX" below. This option also results 2159in the name 2160.B YY_FLEX_LEX_COMPAT 2161being #define'd in the generated scanner. 2162.TP 2163.B \-n 2164is another do-nothing, deprecated option included only for 2165POSIX compliance. 2166.TP 2167.B \-p 2168generates a performance report to stderr. The report 2169consists of comments regarding features of the 2170.I flex 2171input file which will cause a serious loss of performance in the resulting 2172scanner. If you give the flag twice, you will also get comments regarding 2173features that lead to minor performance losses. 2174.IP 2175Note that the use of 2176.B REJECT, 2177.B %option yylineno, 2178and variable trailing context (see the Deficiencies / Bugs section below) 2179entails a substantial performance penalty; use of 2180.I yymore(), 2181the 2182.B ^ 2183operator, 2184and the 2185.B \-I 2186flag entail minor performance penalties. 2187.TP 2188.B \-s 2189causes the 2190.I default rule 2191(that unmatched scanner input is echoed to 2192.I stdout) 2193to be suppressed. If the scanner encounters input that does not 2194match any of its rules, it aborts with an error. This option is 2195useful for finding holes in a scanner's rule set. 2196.TP 2197.B \-t 2198instructs 2199.I flex 2200to write the scanner it generates to standard output instead 2201of 2202.B lex.yy.c. 2203.TP 2204.B \-v 2205specifies that 2206.I flex 2207should write to 2208.I stderr 2209a summary of statistics regarding the scanner it generates. 2210Most of the statistics are meaningless to the casual 2211.I flex 2212user, but the first line identifies the version of 2213.I flex 2214(same as reported by 2215.B \-V), 2216and the next line the flags used when generating the scanner, including 2217those that are on by default. 2218.TP 2219.B \-w 2220suppresses warning messages. 2221.TP 2222.B \-B 2223instructs 2224.I flex 2225to generate a 2226.I batch 2227scanner, the opposite of 2228.I interactive 2229scanners generated by 2230.B \-I 2231(see below). In general, you use 2232.B \-B 2233when you are 2234.I certain 2235that your scanner will never be used interactively, and you want to 2236squeeze a 2237.I little 2238more performance out of it. If your goal is instead to squeeze out a 2239.I lot 2240more performance, you should be using the 2241.B \-Cf 2242or 2243.B \-CF 2244options (discussed below), which turn on 2245.B \-B 2246automatically anyway. 2247.TP 2248.B \-F 2249specifies that the 2250.ul 2251fast 2252scanner table representation should be used (and stdio 2253bypassed). This representation is 2254about as fast as the full table representation 2255.B (-f), 2256and for some sets of patterns will be considerably smaller (and for 2257others, larger). In general, if the pattern set contains both "keywords" 2258and a catch-all, "identifier" rule, such as in the set: 2259.nf 2260 2261 "case" return TOK_CASE; 2262 "switch" return TOK_SWITCH; 2263 ... 2264 "default" return TOK_DEFAULT; 2265 [a-z]+ return TOK_ID; 2266 2267.fi 2268then you're better off using the full table representation. If only 2269the "identifier" rule is present and you then use a hash table or some such 2270to detect the keywords, you're better off using 2271.B -F. 2272.IP 2273This option is equivalent to 2274.B \-CFr 2275(see below). It cannot be used with 2276.B \-+. 2277.TP 2278.B \-I 2279instructs 2280.I flex 2281to generate an 2282.I interactive 2283scanner. An interactive scanner is one that only looks ahead to decide 2284what token has been matched if it absolutely must. It turns out that 2285always looking one extra character ahead, even if the scanner has already 2286seen enough text to disambiguate the current token, is a bit faster than 2287only looking ahead when necessary. But scanners that always look ahead 2288give dreadful interactive performance; for example, when a user types 2289a newline, it is not recognized as a newline token until they enter 2290.I another 2291token, which often means typing in another whole line. 2292.IP 2293.I Flex 2294scanners default to 2295.I interactive 2296unless you use the 2297.B \-Cf 2298or 2299.B \-CF 2300table-compression options (see below). That's because if you're looking 2301for high-performance you should be using one of these options, so if you 2302didn't, 2303.I flex 2304assumes you'd rather trade off a bit of run-time performance for intuitive 2305interactive behavior. Note also that you 2306.I cannot 2307use 2308.B \-I 2309in conjunction with 2310.B \-Cf 2311or 2312.B \-CF. 2313Thus, this option is not really needed; it is on by default for all those 2314cases in which it is allowed. 2315.IP 2316Note that if 2317.B isatty() 2318returns false for the scanner input, flex will revert to batch mode, even if 2319.B \-I 2320was specified. To force interactive mode no matter what, use 2321.B %option always-interactive 2322(see Options below). 2323.IP 2324You can force a scanner to 2325.I not 2326be interactive by using 2327.B \-B 2328(see above). 2329.TP 2330.B \-L 2331instructs 2332.I flex 2333not to generate 2334.B #line 2335directives. Without this option, 2336.I flex 2337peppers the generated scanner 2338with #line directives so error messages in the actions will be correctly 2339located with respect to either the original 2340.I flex 2341input file (if the errors are due to code in the input file), or 2342.B lex.yy.c 2343(if the errors are 2344.I flex's 2345fault -- you should report these sorts of errors to the email address 2346given below). 2347.TP 2348.B \-T 2349makes 2350.I flex 2351run in 2352.I trace 2353mode. It will generate a lot of messages to 2354.I stderr 2355concerning 2356the form of the input and the resultant non-deterministic and deterministic 2357finite automata. This option is mostly for use in maintaining 2358.I flex. 2359.TP 2360.B \-V 2361prints the version number to 2362.I stdout 2363and exits. 2364.B \-\-version 2365is a synonym for 2366.B \-V. 2367.TP 2368.B \-7 2369instructs 2370.I flex 2371to generate a 7-bit scanner, i.e., one which can only recognize 7-bit 2372characters in its input. The advantage of using 2373.B \-7 2374is that the scanner's tables can be up to half the size of those generated 2375using the 2376.B \-8 2377option (see below). The disadvantage is that such scanners often hang 2378or crash if their input contains an 8-bit character. 2379.IP 2380Note, however, that unless you generate your scanner using the 2381.B \-Cf 2382or 2383.B \-CF 2384table compression options, use of 2385.B \-7 2386will save only a small amount of table space, and make your scanner 2387considerably less portable. 2388.I Flex's 2389default behavior is to generate an 8-bit scanner unless you use the 2390.B \-Cf 2391or 2392.B \-CF, 2393in which case 2394.I flex 2395defaults to generating 7-bit scanners unless your site was always 2396configured to generate 8-bit scanners (as will often be the case 2397with non-USA sites). You can tell whether flex generated a 7-bit 2398or an 8-bit scanner by inspecting the flag summary in the 2399.B \-v 2400output as described above. 2401.IP 2402Note that if you use 2403.B \-Cfe 2404or 2405.B \-CFe 2406(those table compression options, but also using equivalence classes as 2407discussed see below), flex still defaults to generating an 8-bit 2408scanner, since usually with these compression options full 8-bit tables 2409are not much more expensive than 7-bit tables. 2410.TP 2411.B \-8 2412instructs 2413.I flex 2414to generate an 8-bit scanner, i.e., one which can recognize 8-bit 2415characters. This flag is only needed for scanners generated using 2416.B \-Cf 2417or 2418.B \-CF, 2419as otherwise flex defaults to generating an 8-bit scanner anyway. 2420.IP 2421See the discussion of 2422.B \-7 2423above for flex's default behavior and the tradeoffs between 7-bit 2424and 8-bit scanners. 2425.TP 2426.B \-+ 2427specifies that you want flex to generate a C++ 2428scanner class. See the section on Generating C++ Scanners below for 2429details. 2430.TP 2431.B \-C[aefFmr] 2432controls the degree of table compression and, more generally, trade-offs 2433between small scanners and fast scanners. 2434.IP 2435.B \-Ca 2436("align") instructs flex to trade off larger tables in the 2437generated scanner for faster performance because the elements of 2438the tables are better aligned for memory access and computation. On some 2439RISC architectures, fetching and manipulating longwords is more efficient 2440than with smaller-sized units such as shortwords. This option can 2441double the size of the tables used by your scanner. 2442.IP 2443.B \-Ce 2444directs 2445.I flex 2446to construct 2447.I equivalence classes, 2448i.e., sets of characters 2449which have identical lexical properties (for example, if the only 2450appearance of digits in the 2451.I flex 2452input is in the character class 2453"[0-9]" then the digits '0', '1', ..., '9' will all be put 2454in the same equivalence class). Equivalence classes usually give 2455dramatic reductions in the final table/object file sizes (typically 2456a factor of 2-5) and are pretty cheap performance-wise (one array 2457look-up per character scanned). 2458.IP 2459.B \-Cf 2460specifies that the 2461.I full 2462scanner tables should be generated - 2463.I flex 2464should not compress the 2465tables by taking advantages of similar transition functions for 2466different states. 2467.IP 2468.B \-CF 2469specifies that the alternate fast scanner representation (described 2470above under the 2471.B \-F 2472flag) 2473should be used. This option cannot be used with 2474.B \-+. 2475.IP 2476.B \-Cm 2477directs 2478.I flex 2479to construct 2480.I meta-equivalence classes, 2481which are sets of equivalence classes (or characters, if equivalence 2482classes are not being used) that are commonly used together. Meta-equivalence 2483classes are often a big win when using compressed tables, but they 2484have a moderate performance impact (one or two "if" tests and one 2485array look-up per character scanned). 2486.IP 2487.B \-Cr 2488causes the generated scanner to 2489.I bypass 2490use of the standard I/O library (stdio) for input. Instead of calling 2491.B fread() 2492or 2493.B getc(), 2494the scanner will use the 2495.B read() 2496system call, resulting in a performance gain which varies from system 2497to system, but in general is probably negligible unless you are also using 2498.B \-Cf 2499or 2500.B \-CF. 2501Using 2502.B \-Cr 2503can cause strange behavior if, for example, you read from 2504.I yyin 2505using stdio prior to calling the scanner (because the scanner will miss 2506whatever text your previous reads left in the stdio input buffer). 2507.IP 2508.B \-Cr 2509has no effect if you define 2510.B YY_INPUT 2511(see The Generated Scanner above). 2512.IP 2513A lone 2514.B \-C 2515specifies that the scanner tables should be compressed but neither 2516equivalence classes nor meta-equivalence classes should be used. 2517.IP 2518The options 2519.B \-Cf 2520or 2521.B \-CF 2522and 2523.B \-Cm 2524do not make sense together - there is no opportunity for meta-equivalence 2525classes if the table is not being compressed. Otherwise the options 2526may be freely mixed, and are cumulative. 2527.IP 2528The default setting is 2529.B \-Cem, 2530which specifies that 2531.I flex 2532should generate equivalence classes 2533and meta-equivalence classes. This setting provides the highest 2534degree of table compression. You can trade off 2535faster-executing scanners at the cost of larger tables with 2536the following generally being true: 2537.nf 2538 2539 slowest & smallest 2540 -Cem 2541 -Cm 2542 -Ce 2543 -C 2544 -C{f,F}e 2545 -C{f,F} 2546 -C{f,F}a 2547 fastest & largest 2548 2549.fi 2550Note that scanners with the smallest tables are usually generated and 2551compiled the quickest, so 2552during development you will usually want to use the default, maximal 2553compression. 2554.IP 2555.B \-Cfe 2556is often a good compromise between speed and size for production 2557scanners. 2558.TP 2559.B \-ooutput 2560directs flex to write the scanner to the file 2561.B output 2562instead of 2563.B lex.yy.c. 2564If you combine 2565.B \-o 2566with the 2567.B \-t 2568option, then the scanner is written to 2569.I stdout 2570but its 2571.B #line 2572directives (see the 2573.B \\-L 2574option above) refer to the file 2575.B output. 2576.TP 2577.B \-Pprefix 2578changes the default 2579.I "yy" 2580prefix used by 2581.I flex 2582for all globally-visible variable and function names to instead be 2583.I prefix. 2584For example, 2585.B \-Pfoo 2586changes the name of 2587.B yytext 2588to 2589.B footext. 2590It also changes the name of the default output file from 2591.B lex.yy.c 2592to 2593.B lex.foo.c. 2594Here are all of the names affected: 2595.nf 2596 2597 yy_create_buffer 2598 yy_delete_buffer 2599 yy_flex_debug 2600 yy_init_buffer 2601 yy_flush_buffer 2602 yy_load_buffer_state 2603 yy_switch_to_buffer 2604 yyin 2605 yyleng 2606 yylex 2607 yylineno 2608 yyout 2609 yyrestart 2610 yytext 2611 yywrap 2612 2613.fi 2614(If you are using a C++ scanner, then only 2615.B yywrap 2616and 2617.B yyFlexLexer 2618are affected.) 2619Within your scanner itself, you can still refer to the global variables 2620and functions using either version of their name; but externally, they 2621have the modified name. 2622.IP 2623This option lets you easily link together multiple 2624.I flex 2625programs into the same executable. Note, though, that using this 2626option also renames 2627.B yywrap(), 2628so you now 2629.I must 2630either 2631provide your own (appropriately-named) version of the routine for your 2632scanner, or use 2633.B %option noyywrap, 2634as linking with 2635.B \-ll 2636no longer provides one for you by default. 2637.TP 2638.B \-Sskeleton_file 2639overrides the default skeleton file from which 2640.I flex 2641constructs its scanners. You'll never need this option unless you are doing 2642.I flex 2643maintenance or development. 2644.PP 2645.I flex 2646also provides a mechanism for controlling options within the 2647scanner specification itself, rather than from the flex command-line. 2648This is done by including 2649.B %option 2650directives in the first section of the scanner specification. 2651You can specify multiple options with a single 2652.B %option 2653directive, and multiple directives in the first section of your flex input 2654file. 2655.PP 2656Most options are given simply as names, optionally preceded by the 2657word "no" (with no intervening whitespace) to negate their meaning. 2658A number are equivalent to flex flags or their negation: 2659.nf 2660 2661 7bit -7 option 2662 8bit -8 option 2663 align -Ca option 2664 backup -b option 2665 batch -B option 2666 c++ -+ option 2667 2668 caseful or 2669 case-sensitive opposite of -i (default) 2670 2671 case-insensitive or 2672 caseless -i option 2673 2674 debug -d option 2675 default opposite of -s option 2676 ecs -Ce option 2677 fast -F option 2678 full -f option 2679 interactive -I option 2680 lex-compat -l option 2681 meta-ecs -Cm option 2682 perf-report -p option 2683 read -Cr option 2684 stdout -t option 2685 verbose -v option 2686 warn opposite of -w option 2687 (use "%option nowarn" for -w) 2688 2689 array equivalent to "%array" 2690 pointer equivalent to "%pointer" (default) 2691 2692.fi 2693Some 2694.B %option's 2695provide features otherwise not available: 2696.TP 2697.B always-interactive 2698instructs flex to generate a scanner which always considers its input 2699"interactive". Normally, on each new input file the scanner calls 2700.B isatty() 2701in an attempt to determine whether 2702the scanner's input source is interactive and thus should be read a 2703character at a time. When this option is used, however, then no 2704such call is made. 2705.TP 2706.B main 2707directs flex to provide a default 2708.B main() 2709program for the scanner, which simply calls 2710.B yylex(). 2711This option implies 2712.B noyywrap 2713(see below). 2714.TP 2715.B never-interactive 2716instructs flex to generate a scanner which never considers its input 2717"interactive" (again, no call made to 2718.B isatty()). 2719This is the opposite of 2720.B always-interactive. 2721.TP 2722.B stack 2723enables the use of start condition stacks (see Start Conditions above). 2724.TP 2725.B stdinit 2726if set (i.e., 2727.B %option stdinit) 2728initializes 2729.I yyin 2730and 2731.I yyout 2732to 2733.I stdin 2734and 2735.I stdout, 2736instead of the default of 2737.I nil. 2738Some existing 2739.I lex 2740programs depend on this behavior, even though it is not compliant with 2741ANSI C, which does not require 2742.I stdin 2743and 2744.I stdout 2745to be compile-time constant. 2746.TP 2747.B yylineno 2748directs 2749.I flex 2750to generate a scanner that maintains the number of the current line 2751read from its input in the global variable 2752.B yylineno. 2753This option is implied by 2754.B %option lex-compat. 2755.TP 2756.B yywrap 2757if unset (i.e., 2758.B %option noyywrap), 2759makes the scanner not call 2760.B yywrap() 2761upon an end-of-file, but simply assume that there are no more 2762files to scan (until the user points 2763.I yyin 2764at a new file and calls 2765.B yylex() 2766again). 2767.PP 2768.I flex 2769scans your rule actions to determine whether you use the 2770.B REJECT 2771or 2772.B yymore() 2773features. The 2774.B reject 2775and 2776.B yymore 2777options are available to override its decision as to whether you use the 2778options, either by setting them (e.g., 2779.B %option reject) 2780to indicate the feature is indeed used, or 2781unsetting them to indicate it actually is not used 2782(e.g., 2783.B %option noyymore). 2784.PP 2785Three options take string-delimited values, offset with '=': 2786.nf 2787 2788 %option outfile="ABC" 2789 2790.fi 2791is equivalent to 2792.B -oABC, 2793and 2794.nf 2795 2796 %option prefix="XYZ" 2797 2798.fi 2799is equivalent to 2800.B -PXYZ. 2801Finally, 2802.nf 2803 2804 %option yyclass="foo" 2805 2806.fi 2807only applies when generating a C++ scanner ( 2808.B \-+ 2809option). It informs 2810.I flex 2811that you have derived 2812.B foo 2813as a subclass of 2814.B yyFlexLexer, 2815so 2816.I flex 2817will place your actions in the member function 2818.B foo::yylex() 2819instead of 2820.B yyFlexLexer::yylex(). 2821It also generates a 2822.B yyFlexLexer::yylex() 2823member function that emits a run-time error (by invoking 2824.B yyFlexLexer::LexerError()) 2825if called. 2826See Generating C++ Scanners, below, for additional information. 2827.PP 2828A number of options are available for lint purists who want to suppress 2829the appearance of unneeded routines in the generated scanner. Each of the 2830following, if unset 2831(e.g., 2832.B %option nounput 2833), results in the corresponding routine not appearing in 2834the generated scanner: 2835.nf 2836 2837 input, unput 2838 yy_push_state, yy_pop_state, yy_top_state 2839 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2840 2841.fi 2842(though 2843.B yy_push_state() 2844and friends won't appear anyway unless you use 2845.B %option stack). 2846.SH PERFORMANCE CONSIDERATIONS 2847The main design goal of 2848.I flex 2849is that it generate high-performance scanners. It has been optimized 2850for dealing well with large sets of rules. Aside from the effects on 2851scanner speed of the table compression 2852.B \-C 2853options outlined above, 2854there are a number of options/actions which degrade performance. These 2855are, from most expensive to least: 2856.nf 2857 2858 REJECT 2859 %option yylineno 2860 arbitrary trailing context 2861 2862 pattern sets that require backing up 2863 %array 2864 %option interactive 2865 %option always-interactive 2866 2867 '^' beginning-of-line operator 2868 yymore() 2869 2870.fi 2871with the first three all being quite expensive and the last two 2872being quite cheap. Note also that 2873.B unput() 2874is implemented as a routine call that potentially does quite a bit of 2875work, while 2876.B yyless() 2877is a quite-cheap macro; so if just putting back some excess text you 2878scanned, use 2879.B yyless(). 2880.PP 2881.B REJECT 2882should be avoided at all costs when performance is important. 2883It is a particularly expensive option. 2884.PP 2885Getting rid of backing up is messy and often may be an enormous 2886amount of work for a complicated scanner. In principal, one begins 2887by using the 2888.B \-b 2889flag to generate a 2890.I lex.backup 2891file. For example, on the input 2892.nf 2893 2894 %% 2895 foo return TOK_KEYWORD; 2896 foobar return TOK_KEYWORD; 2897 2898.fi 2899the file looks like: 2900.nf 2901 2902 State #6 is non-accepting - 2903 associated rule line numbers: 2904 2 3 2905 out-transitions: [ o ] 2906 jam-transitions: EOF [ \\001-n p-\\177 ] 2907 2908 State #8 is non-accepting - 2909 associated rule line numbers: 2910 3 2911 out-transitions: [ a ] 2912 jam-transitions: EOF [ \\001-` b-\\177 ] 2913 2914 State #9 is non-accepting - 2915 associated rule line numbers: 2916 3 2917 out-transitions: [ r ] 2918 jam-transitions: EOF [ \\001-q s-\\177 ] 2919 2920 Compressed tables always back up. 2921 2922.fi 2923The first few lines tell us that there's a scanner state in 2924which it can make a transition on an 'o' but not on any other 2925character, and that in that state the currently scanned text does not match 2926any rule. The state occurs when trying to match the rules found 2927at lines 2 and 3 in the input file. 2928If the scanner is in that state and then reads 2929something other than an 'o', it will have to back up to find 2930a rule which is matched. With 2931a bit of headscratching one can see that this must be the 2932state it's in when it has seen "fo". When this has happened, 2933if anything other than another 'o' is seen, the scanner will 2934have to back up to simply match the 'f' (by the default rule). 2935.PP 2936The comment regarding State #8 indicates there's a problem 2937when "foob" has been scanned. Indeed, on any character other 2938than an 'a', the scanner will have to back up to accept "foo". 2939Similarly, the comment for State #9 concerns when "fooba" has 2940been scanned and an 'r' does not follow. 2941.PP 2942The final comment reminds us that there's no point going to 2943all the trouble of removing backing up from the rules unless 2944we're using 2945.B \-Cf 2946or 2947.B \-CF, 2948since there's no performance gain doing so with compressed scanners. 2949.PP 2950The way to remove the backing up is to add "error" rules: 2951.nf 2952 2953 %% 2954 foo return TOK_KEYWORD; 2955 foobar return TOK_KEYWORD; 2956 2957 fooba | 2958 foob | 2959 fo { 2960 /* false alarm, not really a keyword */ 2961 return TOK_ID; 2962 } 2963 2964.fi 2965.PP 2966Eliminating backing up among a list of keywords can also be 2967done using a "catch-all" rule: 2968.nf 2969 2970 %% 2971 foo return TOK_KEYWORD; 2972 foobar return TOK_KEYWORD; 2973 2974 [a-z]+ return TOK_ID; 2975 2976.fi 2977This is usually the best solution when appropriate. 2978.PP 2979Backing up messages tend to cascade. 2980With a complicated set of rules it's not uncommon to get hundreds 2981of messages. If one can decipher them, though, it often 2982only takes a dozen or so rules to eliminate the backing up (though 2983it's easy to make a mistake and have an error rule accidentally match 2984a valid token. A possible future 2985.I flex 2986feature will be to automatically add rules to eliminate backing up). 2987.PP 2988It's important to keep in mind that you gain the benefits of eliminating 2989backing up only if you eliminate 2990.I every 2991instance of backing up. Leaving just one means you gain nothing. 2992.PP 2993.I Variable 2994trailing context (where both the leading and trailing parts do not have 2995a fixed length) entails almost the same performance loss as 2996.B REJECT 2997(i.e., substantial). So when possible a rule like: 2998.nf 2999 3000 %% 3001 mouse|rat/(cat|dog) run(); 3002 3003.fi 3004is better written: 3005.nf 3006 3007 %% 3008 mouse/cat|dog run(); 3009 rat/cat|dog run(); 3010 3011.fi 3012or as 3013.nf 3014 3015 %% 3016 mouse|rat/cat run(); 3017 mouse|rat/dog run(); 3018 3019.fi 3020Note that here the special '|' action does 3021.I not 3022provide any savings, and can even make things worse (see 3023Deficiencies / Bugs below). 3024.LP 3025Another area where the user can increase a scanner's performance 3026(and one that's easier to implement) arises from the fact that 3027the longer the tokens matched, the faster the scanner will run. 3028This is because with long tokens the processing of most input 3029characters takes place in the (short) inner scanning loop, and 3030does not often have to go through the additional work of setting up 3031the scanning environment (e.g., 3032.B yytext) 3033for the action. Recall the scanner for C comments: 3034.nf 3035 3036 %x comment 3037 %% 3038 int line_num = 1; 3039 3040 "/*" BEGIN(comment); 3041 3042 <comment>[^*\\n]* 3043 <comment>"*"+[^*/\\n]* 3044 <comment>\\n ++line_num; 3045 <comment>"*"+"/" BEGIN(INITIAL); 3046 3047.fi 3048This could be sped up by writing it as: 3049.nf 3050 3051 %x comment 3052 %% 3053 int line_num = 1; 3054 3055 "/*" BEGIN(comment); 3056 3057 <comment>[^*\\n]* 3058 <comment>[^*\\n]*\\n ++line_num; 3059 <comment>"*"+[^*/\\n]* 3060 <comment>"*"+[^*/\\n]*\\n ++line_num; 3061 <comment>"*"+"/" BEGIN(INITIAL); 3062 3063.fi 3064Now instead of each newline requiring the processing of another 3065action, recognizing the newlines is "distributed" over the other rules 3066to keep the matched text as long as possible. Note that 3067.I adding 3068rules does 3069.I not 3070slow down the scanner! The speed of the scanner is independent 3071of the number of rules or (modulo the considerations given at the 3072beginning of this section) how complicated the rules are with 3073regard to operators such as '*' and '|'. 3074.PP 3075A final example in speeding up a scanner: suppose you want to scan 3076through a file containing identifiers and keywords, one per line 3077and with no other extraneous characters, and recognize all the 3078keywords. A natural first approach is: 3079.nf 3080 3081 %% 3082 asm | 3083 auto | 3084 break | 3085 ... etc ... 3086 volatile | 3087 while /* it's a keyword */ 3088 3089 .|\\n /* it's not a keyword */ 3090 3091.fi 3092To eliminate the back-tracking, introduce a catch-all rule: 3093.nf 3094 3095 %% 3096 asm | 3097 auto | 3098 break | 3099 ... etc ... 3100 volatile | 3101 while /* it's a keyword */ 3102 3103 [a-z]+ | 3104 .|\\n /* it's not a keyword */ 3105 3106.fi 3107Now, if it's guaranteed that there's exactly one word per line, 3108then we can reduce the total number of matches by a half by 3109merging in the recognition of newlines with that of the other 3110tokens: 3111.nf 3112 3113 %% 3114 asm\\n | 3115 auto\\n | 3116 break\\n | 3117 ... etc ... 3118 volatile\\n | 3119 while\\n /* it's a keyword */ 3120 3121 [a-z]+\\n | 3122 .|\\n /* it's not a keyword */ 3123 3124.fi 3125One has to be careful here, as we have now reintroduced backing up 3126into the scanner. In particular, while 3127.I we 3128know that there will never be any characters in the input stream 3129other than letters or newlines, 3130.I flex 3131can't figure this out, and it will plan for possibly needing to back up 3132when it has scanned a token like "auto" and then the next character 3133is something other than a newline or a letter. Previously it would 3134then just match the "auto" rule and be done, but now it has no "auto" 3135rule, only a "auto\\n" rule. To eliminate the possibility of backing up, 3136we could either duplicate all rules but without final newlines, or, 3137since we never expect to encounter such an input and therefore don't 3138how it's classified, we can introduce one more catch-all rule, this 3139one which doesn't include a newline: 3140.nf 3141 3142 %% 3143 asm\\n | 3144 auto\\n | 3145 break\\n | 3146 ... etc ... 3147 volatile\\n | 3148 while\\n /* it's a keyword */ 3149 3150 [a-z]+\\n | 3151 [a-z]+ | 3152 .|\\n /* it's not a keyword */ 3153 3154.fi 3155Compiled with 3156.B \-Cf, 3157this is about as fast as one can get a 3158.I flex 3159scanner to go for this particular problem. 3160.PP 3161A final note: 3162.I flex 3163is slow when matching NUL's, particularly when a token contains 3164multiple NUL's. 3165It's best to write rules which match 3166.I short 3167amounts of text if it's anticipated that the text will often include NUL's. 3168.PP 3169Another final note regarding performance: as mentioned above in the section 3170How the Input is Matched, dynamically resizing 3171.B yytext 3172to accommodate huge tokens is a slow process because it presently requires that 3173the (huge) token be rescanned from the beginning. Thus if performance is 3174vital, you should attempt to match "large" quantities of text but not 3175"huge" quantities, where the cutoff between the two is at about 8K 3176characters/token. 3177.SH GENERATING C++ SCANNERS 3178.I flex 3179provides two different ways to generate scanners for use with C++. The 3180first way is to simply compile a scanner generated by 3181.I flex 3182using a C++ compiler instead of a C compiler. You should not encounter 3183any compilations errors (please report any you find to the email address 3184given in the Author section below). You can then use C++ code in your 3185rule actions instead of C code. Note that the default input source for 3186your scanner remains 3187.I yyin, 3188and default echoing is still done to 3189.I yyout. 3190Both of these remain 3191.I FILE * 3192variables and not C++ 3193.I streams. 3194.PP 3195You can also use 3196.I flex 3197to generate a C++ scanner class, using the 3198.B \-+ 3199option (or, equivalently, 3200.B %option c++), 3201which is automatically specified if the name of the flex 3202executable ends in a '+', such as 3203.I flex++. 3204When using this option, flex defaults to generating the scanner to the file 3205.B lex.yy.cc 3206instead of 3207.B lex.yy.c. 3208The generated scanner includes the header file 3209.I FlexLexer.h, 3210which defines the interface to two C++ classes. 3211.PP 3212The first class, 3213.B FlexLexer, 3214provides an abstract base class defining the general scanner class 3215interface. It provides the following member functions: 3216.TP 3217.B const char* YYText() 3218returns the text of the most recently matched token, the equivalent of 3219.B yytext. 3220.TP 3221.B int YYLeng() 3222returns the length of the most recently matched token, the equivalent of 3223.B yyleng. 3224.TP 3225.B int lineno() const 3226returns the current input line number 3227(see 3228.B %option yylineno), 3229or 3230.B 1 3231if 3232.B %option yylineno 3233was not used. 3234.TP 3235.B void set_debug( int flag ) 3236sets the debugging flag for the scanner, equivalent to assigning to 3237.B yy_flex_debug 3238(see the Options section above). Note that you must build the scanner 3239using 3240.B %option debug 3241to include debugging information in it. 3242.TP 3243.B int debug() const 3244returns the current setting of the debugging flag. 3245.PP 3246Also provided are member functions equivalent to 3247.B yy_switch_to_buffer(), 3248.B yy_create_buffer() 3249(though the first argument is an 3250.B istream* 3251object pointer and not a 3252.B FILE*), 3253.B yy_flush_buffer(), 3254.B yy_delete_buffer(), 3255and 3256.B yyrestart() 3257(again, the first argument is a 3258.B istream* 3259object pointer). 3260.PP 3261The second class defined in 3262.I FlexLexer.h 3263is 3264.B yyFlexLexer, 3265which is derived from 3266.B FlexLexer. 3267It defines the following additional member functions: 3268.TP 3269.B 3270yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3271constructs a 3272.B yyFlexLexer 3273object using the given streams for input and output. If not specified, 3274the streams default to 3275.B cin 3276and 3277.B cout, 3278respectively. 3279.TP 3280.B virtual int yylex() 3281performs the same role is 3282.B yylex() 3283does for ordinary flex scanners: it scans the input stream, consuming 3284tokens, until a rule's action returns a value. If you derive a subclass 3285.B S 3286from 3287.B yyFlexLexer 3288and want to access the member functions and variables of 3289.B S 3290inside 3291.B yylex(), 3292then you need to use 3293.B %option yyclass="S" 3294to inform 3295.I flex 3296that you will be using that subclass instead of 3297.B yyFlexLexer. 3298In this case, rather than generating 3299.B yyFlexLexer::yylex(), 3300.I flex 3301generates 3302.B S::yylex() 3303(and also generates a dummy 3304.B yyFlexLexer::yylex() 3305that calls 3306.B yyFlexLexer::LexerError() 3307if called). 3308.TP 3309.B 3310virtual void switch_streams(istream* new_in = 0, 3311.B 3312ostream* new_out = 0) 3313reassigns 3314.B yyin 3315to 3316.B new_in 3317(if non-nil) 3318and 3319.B yyout 3320to 3321.B new_out 3322(ditto), deleting the previous input buffer if 3323.B yyin 3324is reassigned. 3325.TP 3326.B 3327int yylex( istream* new_in, ostream* new_out = 0 ) 3328first switches the input streams via 3329.B switch_streams( new_in, new_out ) 3330and then returns the value of 3331.B yylex(). 3332.PP 3333In addition, 3334.B yyFlexLexer 3335defines the following protected virtual functions which you can redefine 3336in derived classes to tailor the scanner: 3337.TP 3338.B 3339virtual int LexerInput( char* buf, int max_size ) 3340reads up to 3341.B max_size 3342characters into 3343.B buf 3344and returns the number of characters read. To indicate end-of-input, 3345return 0 characters. Note that "interactive" scanners (see the 3346.B \-B 3347and 3348.B \-I 3349flags) define the macro 3350.B YY_INTERACTIVE. 3351If you redefine 3352.B LexerInput() 3353and need to take different actions depending on whether or not 3354the scanner might be scanning an interactive input source, you can 3355test for the presence of this name via 3356.B #ifdef. 3357.TP 3358.B 3359virtual void LexerOutput( const char* buf, int size ) 3360writes out 3361.B size 3362characters from the buffer 3363.B buf, 3364which, while NUL-terminated, may also contain "internal" NUL's if 3365the scanner's rules can match text with NUL's in them. 3366.TP 3367.B 3368virtual void LexerError( const char* msg ) 3369reports a fatal error message. The default version of this function 3370writes the message to the stream 3371.B cerr 3372and exits. 3373.PP 3374Note that a 3375.B yyFlexLexer 3376object contains its 3377.I entire 3378scanning state. Thus you can use such objects to create reentrant 3379scanners. You can instantiate multiple instances of the same 3380.B yyFlexLexer 3381class, and you can also combine multiple C++ scanner classes together 3382in the same program using the 3383.B \-P 3384option discussed above. 3385.PP 3386Finally, note that the 3387.B %array 3388feature is not available to C++ scanner classes; you must use 3389.B %pointer 3390(the default). 3391.PP 3392Here is an example of a simple C++ scanner: 3393.nf 3394 3395 // An example of using the flex C++ scanner class. 3396 3397 %{ 3398 int mylineno = 0; 3399 %} 3400 3401 string \\"[^\\n"]+\\" 3402 3403 ws [ \\t]+ 3404 3405 alpha [A-Za-z] 3406 dig [0-9] 3407 name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])* 3408 num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)? 3409 num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)? 3410 number {num1}|{num2} 3411 3412 %% 3413 3414 {ws} /* skip blanks and tabs */ 3415 3416 "/*" { 3417 int c; 3418 3419 while((c = yyinput()) != 0) 3420 { 3421 if(c == '\\n') 3422 ++mylineno; 3423 3424 else if(c == '*') 3425 { 3426 if((c = yyinput()) == '/') 3427 break; 3428 else 3429 unput(c); 3430 } 3431 } 3432 } 3433 3434 {number} cout << "number " << YYText() << '\\n'; 3435 3436 \\n mylineno++; 3437 3438 {name} cout << "name " << YYText() << '\\n'; 3439 3440 {string} cout << "string " << YYText() << '\\n'; 3441 3442 %% 3443 3444 int main( int /* argc */, char** /* argv */ ) 3445 { 3446 FlexLexer* lexer = new yyFlexLexer; 3447 while(lexer->yylex() != 0) 3448 ; 3449 return 0; 3450 } 3451.fi 3452If you want to create multiple (different) lexer classes, you use the 3453.B \-P 3454flag (or the 3455.B prefix= 3456option) to rename each 3457.B yyFlexLexer 3458to some other 3459.B xxFlexLexer. 3460You then can include 3461.B <FlexLexer.h> 3462in your other sources once per lexer class, first renaming 3463.B yyFlexLexer 3464as follows: 3465.nf 3466 3467 #undef yyFlexLexer 3468 #define yyFlexLexer xxFlexLexer 3469 #include <FlexLexer.h> 3470 3471 #undef yyFlexLexer 3472 #define yyFlexLexer zzFlexLexer 3473 #include <FlexLexer.h> 3474 3475.fi 3476if, for example, you used 3477.B %option prefix="xx" 3478for one of your scanners and 3479.B %option prefix="zz" 3480for the other. 3481.PP 3482IMPORTANT: the present form of the scanning class is 3483.I experimental 3484and may change considerably between major releases. 3485.SH INCOMPATIBILITIES WITH LEX AND POSIX 3486.I flex 3487is a rewrite of the AT&T Unix 3488.I lex 3489tool (the two implementations do not share any code, though), 3490with some extensions and incompatibilities, both of which 3491are of concern to those who wish to write scanners acceptable 3492to either implementation. Flex is fully compliant with the POSIX 3493.I lex 3494specification, except that when using 3495.B %pointer 3496(the default), a call to 3497.B unput() 3498destroys the contents of 3499.B yytext, 3500which is counter to the POSIX specification. 3501.PP 3502In this section we discuss all of the known areas of incompatibility 3503between flex, AT&T lex, and the POSIX specification. 3504.PP 3505.I flex's 3506.B \-l 3507option turns on maximum compatibility with the original AT&T 3508.I lex 3509implementation, at the cost of a major loss in the generated scanner's 3510performance. We note below which incompatibilities can be overcome 3511using the 3512.B \-l 3513option. 3514.PP 3515.I flex 3516is fully compatible with 3517.I lex 3518with the following exceptions: 3519.IP - 3520The undocumented 3521.I lex 3522scanner internal variable 3523.B yylineno 3524is not supported unless 3525.B \-l 3526or 3527.B %option yylineno 3528is used. 3529.IP 3530.B yylineno 3531should be maintained on a per-buffer basis, rather than a per-scanner 3532(single global variable) basis. 3533.IP 3534.B yylineno 3535is not part of the POSIX specification. 3536.IP - 3537The 3538.B input() 3539routine is not redefinable, though it may be called to read characters 3540following whatever has been matched by a rule. If 3541.B input() 3542encounters an end-of-file the normal 3543.B yywrap() 3544processing is done. A ``real'' end-of-file is returned by 3545.B input() 3546as 3547.I EOF. 3548.IP 3549Input is instead controlled by defining the 3550.B YY_INPUT 3551macro. 3552.IP 3553The 3554.I flex 3555restriction that 3556.B input() 3557cannot be redefined is in accordance with the POSIX specification, 3558which simply does not specify any way of controlling the 3559scanner's input other than by making an initial assignment to 3560.I yyin. 3561.IP - 3562The 3563.B unput() 3564routine is not redefinable. This restriction is in accordance with POSIX. 3565.IP - 3566.I flex 3567scanners are not as reentrant as 3568.I lex 3569scanners. In particular, if you have an interactive scanner and 3570an interrupt handler which long-jumps out of the scanner, and 3571the scanner is subsequently called again, you may get the following 3572message: 3573.nf 3574 3575 fatal flex scanner internal error--end of buffer missed 3576 3577.fi 3578To reenter the scanner, first use 3579.nf 3580 3581 yyrestart( yyin ); 3582 3583.fi 3584Note that this call will throw away any buffered input; usually this 3585isn't a problem with an interactive scanner. 3586.IP 3587Also note that flex C++ scanner classes 3588.I are 3589reentrant, so if using C++ is an option for you, you should use 3590them instead. See "Generating C++ Scanners" above for details. 3591.IP - 3592.B output() 3593is not supported. 3594Output from the 3595.B ECHO 3596macro is done to the file-pointer 3597.I yyout 3598(default 3599.I stdout). 3600.IP 3601.B output() 3602is not part of the POSIX specification. 3603.IP - 3604.I lex 3605does not support exclusive start conditions (%x), though they 3606are in the POSIX specification. 3607.IP - 3608When definitions are expanded, 3609.I flex 3610encloses them in parentheses. 3611With lex, the following: 3612.nf 3613 3614 NAME [A-Z][A-Z0-9]* 3615 %% 3616 foo{NAME}? printf( "Found it\\n" ); 3617 %% 3618 3619.fi 3620will not match the string "foo" because when the macro 3621is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" 3622and the precedence is such that the '?' is associated with 3623"[A-Z0-9]*". With 3624.I flex, 3625the rule will be expanded to 3626"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. 3627.IP 3628Note that if the definition begins with 3629.B ^ 3630or ends with 3631.B $ 3632then it is 3633.I not 3634expanded with parentheses, to allow these operators to appear in 3635definitions without losing their special meanings. But the 3636.B <s>, /, 3637and 3638.B <<EOF>> 3639operators cannot be used in a 3640.I flex 3641definition. 3642.IP 3643Using 3644.B \-l 3645results in the 3646.I lex 3647behavior of no parentheses around the definition. 3648.IP 3649The POSIX specification is that the definition be enclosed in parentheses. 3650.IP - 3651Some implementations of 3652.I lex 3653allow a rule's action to begin on a separate line, if the rule's pattern 3654has trailing whitespace: 3655.nf 3656 3657 %% 3658 foo|bar<space here> 3659 { foobar_action(); } 3660 3661.fi 3662.I flex 3663does not support this feature. 3664.IP - 3665The 3666.I lex 3667.B %r 3668(generate a Ratfor scanner) option is not supported. It is not part 3669of the POSIX specification. 3670.IP - 3671After a call to 3672.B unput(), 3673.I yytext 3674is undefined until the next token is matched, unless the scanner 3675was built using 3676.B %array. 3677This is not the case with 3678.I lex 3679or the POSIX specification. The 3680.B \-l 3681option does away with this incompatibility. 3682.IP - 3683The precedence of the 3684.B {} 3685(numeric range) operator is different. 3686.I lex 3687interprets "abc{1,3}" as "match one, two, or 3688three occurrences of 'abc'", whereas 3689.I flex 3690interprets it as "match 'ab' 3691followed by one, two, or three occurrences of 'c'". The latter is 3692in agreement with the POSIX specification. 3693.IP - 3694The precedence of the 3695.B ^ 3696operator is different. 3697.I lex 3698interprets "^foo|bar" as "match either 'foo' at the beginning of a line, 3699or 'bar' anywhere", whereas 3700.I flex 3701interprets it as "match either 'foo' or 'bar' if they come at the beginning 3702of a line". The latter is in agreement with the POSIX specification. 3703.IP - 3704The special table-size declarations such as 3705.B %a 3706supported by 3707.I lex 3708are not required by 3709.I flex 3710scanners; 3711.I flex 3712ignores them. 3713.IP - 3714The name 3715.B FLEX_SCANNER 3716is #define'd so scanners may be written for use with either 3717.I flex 3718or 3719.I lex. 3720Scanners also include 3721.B YY_FLEX_MAJOR_VERSION 3722and 3723.B YY_FLEX_MINOR_VERSION 3724indicating which version of 3725.I flex 3726generated the scanner 3727(for example, for the 2.5 release, these defines would be 2 and 5 3728respectively). 3729.PP 3730The following 3731.I flex 3732features are not included in 3733.I lex 3734or the POSIX specification: 3735.nf 3736 3737 C++ scanners 3738 %option 3739 start condition scopes 3740 start condition stacks 3741 interactive/non-interactive scanners 3742 yy_scan_string() and friends 3743 yyterminate() 3744 yy_set_interactive() 3745 yy_set_bol() 3746 YY_AT_BOL() 3747 <<EOF>> 3748 <*> 3749 YY_DECL 3750 YY_START 3751 YY_USER_ACTION 3752 YY_USER_INIT 3753 #line directives 3754 %{}'s around actions 3755 multiple actions on a line 3756 3757.fi 3758plus almost all of the flex flags. 3759The last feature in the list refers to the fact that with 3760.I flex 3761you can put multiple actions on the same line, separated with 3762semi-colons, while with 3763.I lex, 3764the following 3765.nf 3766 3767 foo handle_foo(); ++num_foos_seen; 3768 3769.fi 3770is (rather surprisingly) truncated to 3771.nf 3772 3773 foo handle_foo(); 3774 3775.fi 3776.I flex 3777does not truncate the action. Actions that are not enclosed in 3778braces are simply terminated at the end of the line. 3779.SH DIAGNOSTICS 3780.I warning, rule cannot be matched 3781indicates that the given rule 3782cannot be matched because it follows other rules that will 3783always match the same text as it. For 3784example, in the following "foo" cannot be matched because it comes after 3785an identifier "catch-all" rule: 3786.nf 3787 3788 [a-z]+ got_identifier(); 3789 foo got_foo(); 3790 3791.fi 3792Using 3793.B REJECT 3794in a scanner suppresses this warning. 3795.PP 3796.I warning, 3797.B \-s 3798.I 3799option given but default rule can be matched 3800means that it is possible (perhaps only in a particular start condition) 3801that the default rule (match any single character) is the only one 3802that will match a particular input. Since 3803.B \-s 3804was given, presumably this is not intended. 3805.PP 3806.I reject_used_but_not_detected undefined 3807or 3808.I yymore_used_but_not_detected undefined - 3809These errors can occur at compile time. They indicate that the 3810scanner uses 3811.B REJECT 3812or 3813.B yymore() 3814but that 3815.I flex 3816failed to notice the fact, meaning that 3817.I flex 3818scanned the first two sections looking for occurrences of these actions 3819and failed to find any, but somehow you snuck some in (via a #include 3820file, for example). Use 3821.B %option reject 3822or 3823.B %option yymore 3824to indicate to flex that you really do use these features. 3825.PP 3826.I flex scanner jammed - 3827a scanner compiled with 3828.B \-s 3829has encountered an input string which wasn't matched by 3830any of its rules. This error can also occur due to internal problems. 3831.PP 3832.I token too large, exceeds YYLMAX - 3833your scanner uses 3834.B %array 3835and one of its rules matched a string longer than the 3836.B YYLMAX 3837constant (8K bytes by default). You can increase the value by 3838#define'ing 3839.B YYLMAX 3840in the definitions section of your 3841.I flex 3842input. 3843.PP 3844.I scanner requires \-8 flag to 3845.I use the character 'x' - 3846Your scanner specification includes recognizing the 8-bit character 3847.I 'x' 3848and you did not specify the \-8 flag, and your scanner defaulted to 7-bit 3849because you used the 3850.B \-Cf 3851or 3852.B \-CF 3853table compression options. See the discussion of the 3854.B \-7 3855flag for details. 3856.PP 3857.I flex scanner push-back overflow - 3858you used 3859.B unput() 3860to push back so much text that the scanner's buffer could not hold 3861both the pushed-back text and the current token in 3862.B yytext. 3863Ideally the scanner should dynamically resize the buffer in this case, but at 3864present it does not. 3865.PP 3866.I 3867input buffer overflow, can't enlarge buffer because scanner uses REJECT - 3868the scanner was working on matching an extremely large token and needed 3869to expand the input buffer. This doesn't work with scanners that use 3870.B 3871REJECT. 3872.PP 3873.I 3874fatal flex scanner internal error--end of buffer missed -
| 2.\" 3.TH FLEX 1 "April 1995" "Version 2.5" 4.SH NAME 5flex \- fast lexical analyzer generator 6.SH SYNOPSIS 7.B flex 8.B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton] 9.B [\-\-help \-\-version] 10.I [filename ...] 11.SH OVERVIEW 12This manual describes 13.I flex, 14a tool for generating programs that perform pattern-matching on text. The 15manual includes both tutorial and reference sections: 16.nf 17 18 Description 19 a brief overview of the tool 20 21 Some Simple Examples 22 23 Format Of The Input File 24 25 Patterns 26 the extended regular expressions used by flex 27 28 How The Input Is Matched 29 the rules for determining what has been matched 30 31 Actions 32 how to specify what to do when a pattern is matched 33 34 The Generated Scanner 35 details regarding the scanner that flex produces; 36 how to control the input source 37 38 Start Conditions 39 introducing context into your scanners, and 40 managing "mini-scanners" 41 42 Multiple Input Buffers 43 how to manipulate multiple input sources; how to 44 scan from strings instead of files 45 46 End-of-file Rules 47 special rules for matching the end of the input 48 49 Miscellaneous Macros 50 a summary of macros available to the actions 51 52 Values Available To The User 53 a summary of values available to the actions 54 55 Interfacing With Yacc 56 connecting flex scanners together with yacc parsers 57 58 Options 59 flex command-line options, and the "%option" 60 directive 61 62 Performance Considerations 63 how to make your scanner go as fast as possible 64 65 Generating C++ Scanners 66 the (experimental) facility for generating C++ 67 scanner classes 68 69 Incompatibilities With Lex And POSIX 70 how flex differs from AT&T lex and the POSIX lex 71 standard 72 73 Diagnostics 74 those error messages produced by flex (or scanners 75 it generates) whose meanings might not be apparent 76 77 Files 78 files used by flex 79 80 Deficiencies / Bugs 81 known problems with flex 82 83 See Also 84 other documentation, related tools 85 86 Author 87 includes contact information 88 89.fi 90.SH DESCRIPTION 91.I flex 92is a tool for generating 93.I scanners: 94programs which recognize lexical patterns in text. 95.I flex 96reads 97the given input files, or its standard input if no file names are given, 98for a description of a scanner to generate. The description is in 99the form of pairs 100of regular expressions and C code, called 101.I rules. flex 102generates as output a C source file, 103.B lex.yy.c, 104which defines a routine 105.B yylex(). 106This file is compiled and linked with the 107.B \-ll 108library to produce an executable. When the executable is run, 109it analyzes its input for occurrences 110of the regular expressions. Whenever it finds one, it executes 111the corresponding C code. 112.SH SOME SIMPLE EXAMPLES 113First some simple examples to get the flavor of how one uses 114.I flex. 115The following 116.I flex 117input specifies a scanner which whenever it encounters the string 118"username" will replace it with the user's login name: 119.nf 120 121 %% 122 username printf( "%s", getlogin() ); 123 124.fi 125By default, any text not matched by a 126.I flex 127scanner 128is copied to the output, so the net effect of this scanner is 129to copy its input file to its output with each occurrence 130of "username" expanded. 131In this input, there is just one rule. "username" is the 132.I pattern 133and the "printf" is the 134.I action. 135The "%%" marks the beginning of the rules. 136.PP 137Here's another simple example: 138.nf 139 140 %{ 141 int num_lines = 0, num_chars = 0; 142 %} 143 144 %% 145 \\n ++num_lines; ++num_chars; 146 . ++num_chars; 147 148 %% 149 main() 150 { 151 yylex(); 152 printf( "# of lines = %d, # of chars = %d\\n", 153 num_lines, num_chars ); 154 } 155 156.fi 157This scanner counts the number of characters and the number 158of lines in its input (it produces no output other than the 159final report on the counts). The first line 160declares two globals, "num_lines" and "num_chars", which are accessible 161both inside 162.B yylex() 163and in the 164.B main() 165routine declared after the second "%%". There are two rules, one 166which matches a newline ("\\n") and increments both the line count and 167the character count, and one which matches any character other than 168a newline (indicated by the "." regular expression). 169.PP 170A somewhat more complicated example: 171.nf 172 173 /* scanner for a toy Pascal-like language */ 174 175 %{ 176 /* need this for the call to atof() below */ 177 #include <math.h> 178 %} 179 180 DIGIT [0-9] 181 ID [a-z][a-z0-9]* 182 183 %% 184 185 {DIGIT}+ { 186 printf( "An integer: %s (%d)\\n", yytext, 187 atoi( yytext ) ); 188 } 189 190 {DIGIT}+"."{DIGIT}* { 191 printf( "A float: %s (%g)\\n", yytext, 192 atof( yytext ) ); 193 } 194 195 if|then|begin|end|procedure|function { 196 printf( "A keyword: %s\\n", yytext ); 197 } 198 199 {ID} printf( "An identifier: %s\\n", yytext ); 200 201 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); 202 203 "{"[^}\\n]*"}" /* eat up one-line comments */ 204 205 [ \\t\\n]+ /* eat up whitespace */ 206 207 . printf( "Unrecognized character: %s\\n", yytext ); 208 209 %% 210 211 main( argc, argv ) 212 int argc; 213 char **argv; 214 { 215 ++argv, --argc; /* skip over program name */ 216 if ( argc > 0 ) 217 yyin = fopen( argv[0], "r" ); 218 else 219 yyin = stdin; 220 221 yylex(); 222 } 223 224.fi 225This is the beginnings of a simple scanner for a language like 226Pascal. It identifies different types of 227.I tokens 228and reports on what it has seen. 229.PP 230The details of this example will be explained in the following 231sections. 232.SH FORMAT OF THE INPUT FILE 233The 234.I flex 235input file consists of three sections, separated by a line with just 236.B %% 237in it: 238.nf 239 240 definitions 241 %% 242 rules 243 %% 244 user code 245 246.fi 247The 248.I definitions 249section contains declarations of simple 250.I name 251definitions to simplify the scanner specification, and declarations of 252.I start conditions, 253which are explained in a later section. 254.PP 255Name definitions have the form: 256.nf 257 258 name definition 259 260.fi 261The "name" is a word beginning with a letter or an underscore ('_') 262followed by zero or more letters, digits, '_', or '-' (dash). 263The definition is taken to begin at the first non-white-space character 264following the name and continuing to the end of the line. 265The definition can subsequently be referred to using "{name}", which 266will expand to "(definition)". For example, 267.nf 268 269 DIGIT [0-9] 270 ID [a-z][a-z0-9]* 271 272.fi 273defines "DIGIT" to be a regular expression which matches a 274single digit, and 275"ID" to be a regular expression which matches a letter 276followed by zero-or-more letters-or-digits. 277A subsequent reference to 278.nf 279 280 {DIGIT}+"."{DIGIT}* 281 282.fi 283is identical to 284.nf 285 286 ([0-9])+"."([0-9])* 287 288.fi 289and matches one-or-more digits followed by a '.' followed 290by zero-or-more digits. 291.PP 292The 293.I rules 294section of the 295.I flex 296input contains a series of rules of the form: 297.nf 298 299 pattern action 300 301.fi 302where the pattern must be unindented and the action must begin 303on the same line. 304.PP 305See below for a further description of patterns and actions. 306.PP 307Finally, the user code section is simply copied to 308.B lex.yy.c 309verbatim. 310It is used for companion routines which call or are called 311by the scanner. The presence of this section is optional; 312if it is missing, the second 313.B %% 314in the input file may be skipped, too. 315.PP 316In the definitions and rules sections, any 317.I indented 318text or text enclosed in 319.B %{ 320and 321.B %} 322is copied verbatim to the output (with the %{}'s removed). 323The %{}'s must appear unindented on lines by themselves. 324.PP 325In the rules section, 326any indented or %{} text appearing before the 327first rule may be used to declare variables 328which are local to the scanning routine and (after the declarations) 329code which is to be executed whenever the scanning routine is entered. 330Other indented or %{} text in the rule section is still copied to the output, 331but its meaning is not well-defined and it may well cause compile-time 332errors (this feature is present for 333.I POSIX 334compliance; see below for other such features). 335.PP 336In the definitions section (but not in the rules section), 337an unindented comment (i.e., a line 338beginning with "/*") is also copied verbatim to the output up 339to the next "*/". 340.SH PATTERNS 341The patterns in the input are written using an extended set of regular 342expressions. These are: 343.nf 344 345 x match the character 'x' 346 . any character (byte) except newline 347 [xyz] a "character class"; in this case, the pattern 348 matches either an 'x', a 'y', or a 'z' 349 [abj-oZ] a "character class" with a range in it; matches 350 an 'a', a 'b', any letter from 'j' through 'o', 351 or a 'Z' 352 [^A-Z] a "negated character class", i.e., any character 353 but those in the class. In this case, any 354 character EXCEPT an uppercase letter. 355 [^A-Z\\n] any character EXCEPT an uppercase letter or 356 a newline 357 r* zero or more r's, where r is any regular expression 358 r+ one or more r's 359 r? zero or one r's (that is, "an optional r") 360 r{2,5} anywhere from two to five r's 361 r{2,} two or more r's 362 r{4} exactly 4 r's 363 {name} the expansion of the "name" definition 364 (see above) 365 "[xyz]\\"foo" 366 the literal string: [xyz]"foo 367 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', 368 then the ANSI-C interpretation of \\x. 369 Otherwise, a literal 'X' (used to escape 370 operators such as '*') 371 \\0 a NUL character (ASCII code 0) 372 \\123 the character with octal value 123 373 \\x2a the character with hexadecimal value 2a 374 (r) match an r; parentheses are used to override 375 precedence (see below) 376 377 378 rs the regular expression r followed by the 379 regular expression s; called "concatenation" 380 381 382 r|s either an r or an s 383 384 385 r/s an r but only if it is followed by an s. The 386 text matched by s is included when determining 387 whether this rule is the "longest match", 388 but is then returned to the input before 389 the action is executed. So the action only 390 sees the text matched by r. This type 391 of pattern is called trailing context". 392 (There are some combinations of r/s that flex 393 cannot match correctly; see notes in the 394 Deficiencies / Bugs section below regarding 395 "dangerous trailing context".) 396 ^r an r, but only at the beginning of a line (i.e., 397 which just starting to scan, or right after a 398 newline has been scanned). 399 r$ an r, but only at the end of a line (i.e., just 400 before a newline). Equivalent to "r/\\n". 401 402 Note that flex's notion of "newline" is exactly 403 whatever the C compiler used to compile flex 404 interprets '\\n' as; in particular, on some DOS 405 systems you must either filter out \\r's in the 406 input yourself, or explicitly use r/\\r\\n for "r$". 407 408 409 <s>r an r, but only in start condition s (see 410 below for discussion of start conditions) 411 <s1,s2,s3>r 412 same, but in any of start conditions s1, 413 s2, or s3 414 <*>r an r in any start condition, even an exclusive one. 415 416 417 <<EOF>> an end-of-file 418 <s1,s2><<EOF>> 419 an end-of-file when in start condition s1 or s2 420 421.fi 422Note that inside of a character class, all regular expression operators 423lose their special meaning except escape ('\\') and the character class 424operators, '-', ']', and, at the beginning of the class, '^'. 425.PP 426The regular expressions listed above are grouped according to 427precedence, from highest precedence at the top to lowest at the bottom. 428Those grouped together have equal precedence. For example, 429.nf 430 431 foo|bar* 432 433.fi 434is the same as 435.nf 436 437 (foo)|(ba(r*)) 438 439.fi 440since the '*' operator has higher precedence than concatenation, 441and concatenation higher than alternation ('|'). This pattern 442therefore matches 443.I either 444the string "foo" 445.I or 446the string "ba" followed by zero-or-more r's. 447To match "foo" or zero-or-more "bar"'s, use: 448.nf 449 450 foo|(bar)* 451 452.fi 453and to match zero-or-more "foo"'s-or-"bar"'s: 454.nf 455 456 (foo|bar)* 457 458.fi 459.PP 460In addition to characters and ranges of characters, character classes 461can also contain character class 462.I expressions. 463These are expressions enclosed inside 464.B [: 465and 466.B :] 467delimiters (which themselves must appear between the '[' and ']' of the 468character class; other elements may occur inside the character class, too). 469The valid expressions are: 470.nf 471 472 [:alnum:] [:alpha:] [:blank:] 473 [:cntrl:] [:digit:] [:graph:] 474 [:lower:] [:print:] [:punct:] 475 [:space:] [:upper:] [:xdigit:] 476 477.fi 478These expressions all designate a set of characters equivalent to 479the corresponding standard C 480.B isXXX 481function. For example, 482.B [:alnum:] 483designates those characters for which 484.B isalnum() 485returns true - i.e., any alphabetic or numeric. 486Some systems don't provide 487.B isblank(), 488so flex defines 489.B [:blank:] 490as a blank or a tab. 491.PP 492For example, the following character classes are all equivalent: 493.nf 494 495 [[:alnum:]] 496 [[:alpha:][:digit:]] 497 [[:alpha:]0-9] 498 [a-zA-Z0-9] 499 500.fi 501If your scanner is case-insensitive (the 502.B \-i 503flag), then 504.B [:upper:] 505and 506.B [:lower:] 507are equivalent to 508.B [:alpha:]. 509.PP 510Some notes on patterns: 511.IP - 512A negated character class such as the example "[^A-Z]" 513above 514.I will match a newline 515unless "\\n" (or an equivalent escape sequence) is one of the 516characters explicitly present in the negated character class 517(e.g., "[^A-Z\\n]"). This is unlike how many other regular 518expression tools treat negated character classes, but unfortunately 519the inconsistency is historically entrenched. 520Matching newlines means that a pattern like [^"]* can match the entire 521input unless there's another quote in the input. 522.IP - 523A rule can have at most one instance of trailing context (the '/' operator 524or the '$' operator). The start condition, '^', and "<<EOF>>" patterns 525can only occur at the beginning of a pattern, and, as well as with '/' and '$', 526cannot be grouped inside parentheses. A '^' which does not occur at 527the beginning of a rule or a '$' which does not occur at the end of 528a rule loses its special properties and is treated as a normal character. 529.IP 530The following are illegal: 531.nf 532 533 foo/bar$ 534 <sc1>foo<sc2>bar 535 536.fi 537Note that the first of these, can be written "foo/bar\\n". 538.IP 539The following will result in '$' or '^' being treated as a normal character: 540.nf 541 542 foo|(bar$) 543 foo|^bar 544 545.fi 546If what's wanted is a "foo" or a bar-followed-by-a-newline, the following 547could be used (the special '|' action is explained below): 548.nf 549 550 foo | 551 bar$ /* action goes here */ 552 553.fi 554A similar trick will work for matching a foo or a 555bar-at-the-beginning-of-a-line. 556.SH HOW THE INPUT IS MATCHED 557When the generated scanner is run, it analyzes its input looking 558for strings which match any of its patterns. If it finds more than 559one match, it takes the one matching the most text (for trailing 560context rules, this includes the length of the trailing part, even 561though it will then be returned to the input). If it finds two 562or more matches of the same length, the 563rule listed first in the 564.I flex 565input file is chosen. 566.PP 567Once the match is determined, the text corresponding to the match 568(called the 569.I token) 570is made available in the global character pointer 571.B yytext, 572and its length in the global integer 573.B yyleng. 574The 575.I action 576corresponding to the matched pattern is then executed (a more 577detailed description of actions follows), and then the remaining 578input is scanned for another match. 579.PP 580If no match is found, then the 581.I default rule 582is executed: the next character in the input is considered matched and 583copied to the standard output. Thus, the simplest legal 584.I flex 585input is: 586.nf 587 588 %% 589 590.fi 591which generates a scanner that simply copies its input (one character 592at a time) to its output. 593.PP 594Note that 595.B yytext 596can be defined in two different ways: either as a character 597.I pointer 598or as a character 599.I array. 600You can control which definition 601.I flex 602uses by including one of the special directives 603.B %pointer 604or 605.B %array 606in the first (definitions) section of your flex input. The default is 607.B %pointer, 608unless you use the 609.B -l 610lex compatibility option, in which case 611.B yytext 612will be an array. 613The advantage of using 614.B %pointer 615is substantially faster scanning and no buffer overflow when matching 616very large tokens (unless you run out of dynamic memory). The disadvantage 617is that you are restricted in how your actions can modify 618.B yytext 619(see the next section), and calls to the 620.B unput() 621function destroys the present contents of 622.B yytext, 623which can be a considerable porting headache when moving between different 624.I lex 625versions. 626.PP 627The advantage of 628.B %array 629is that you can then modify 630.B yytext 631to your heart's content, and calls to 632.B unput() 633do not destroy 634.B yytext 635(see below). Furthermore, existing 636.I lex 637programs sometimes access 638.B yytext 639externally using declarations of the form: 640.nf 641 extern char yytext[]; 642.fi 643This definition is erroneous when used with 644.B %pointer, 645but correct for 646.B %array. 647.PP 648.B %array 649defines 650.B yytext 651to be an array of 652.B YYLMAX 653characters, which defaults to a fairly large value. You can change 654the size by simply #define'ing 655.B YYLMAX 656to a different value in the first section of your 657.I flex 658input. As mentioned above, with 659.B %pointer 660yytext grows dynamically to accommodate large tokens. While this means your 661.B %pointer 662scanner can accommodate very large tokens (such as matching entire blocks 663of comments), bear in mind that each time the scanner must resize 664.B yytext 665it also must rescan the entire token from the beginning, so matching such 666tokens can prove slow. 667.B yytext 668presently does 669.I not 670dynamically grow if a call to 671.B unput() 672results in too much text being pushed back; instead, a run-time error results. 673.PP 674Also note that you cannot use 675.B %array 676with C++ scanner classes 677(the 678.B c++ 679option; see below). 680.SH ACTIONS 681Each pattern in a rule has a corresponding action, which can be any 682arbitrary C statement. The pattern ends at the first non-escaped 683whitespace character; the remainder of the line is its action. If the 684action is empty, then when the pattern is matched the input token 685is simply discarded. For example, here is the specification for a program 686which deletes all occurrences of "zap me" from its input: 687.nf 688 689 %% 690 "zap me" 691 692.fi 693(It will copy all other characters in the input to the output since 694they will be matched by the default rule.) 695.PP 696Here is a program which compresses multiple blanks and tabs down to 697a single blank, and throws away whitespace found at the end of a line: 698.nf 699 700 %% 701 [ \\t]+ putchar( ' ' ); 702 [ \\t]+$ /* ignore this token */ 703 704.fi 705.PP 706If the action contains a '{', then the action spans till the balancing '}' 707is found, and the action may cross multiple lines. 708.I flex 709knows about C strings and comments and won't be fooled by braces found 710within them, but also allows actions to begin with 711.B %{ 712and will consider the action to be all the text up to the next 713.B %} 714(regardless of ordinary braces inside the action). 715.PP 716An action consisting solely of a vertical bar ('|') means "same as 717the action for the next rule." See below for an illustration. 718.PP 719Actions can include arbitrary C code, including 720.B return 721statements to return a value to whatever routine called 722.B yylex(). 723Each time 724.B yylex() 725is called it continues processing tokens from where it last left 726off until it either reaches 727the end of the file or executes a return. 728.PP 729Actions are free to modify 730.B yytext 731except for lengthening it (adding 732characters to its end--these will overwrite later characters in the 733input stream). This however does not apply when using 734.B %array 735(see above); in that case, 736.B yytext 737may be freely modified in any way. 738.PP 739Actions are free to modify 740.B yyleng 741except they should not do so if the action also includes use of 742.B yymore() 743(see below). 744.PP 745There are a number of special directives which can be included within 746an action: 747.IP - 748.B ECHO 749copies yytext to the scanner's output. 750.IP - 751.B BEGIN 752followed by the name of a start condition places the scanner in the 753corresponding start condition (see below). 754.IP - 755.B REJECT 756directs the scanner to proceed on to the "second best" rule which matched the 757input (or a prefix of the input). The rule is chosen as described 758above in "How the Input is Matched", and 759.B yytext 760and 761.B yyleng 762set up appropriately. 763It may either be one which matched as much text 764as the originally chosen rule but came later in the 765.I flex 766input file, or one which matched less text. 767For example, the following will both count the 768words in the input and call the routine special() whenever "frob" is seen: 769.nf 770 771 int word_count = 0; 772 %% 773 774 frob special(); REJECT; 775 [^ \\t\\n]+ ++word_count; 776 777.fi 778Without the 779.B REJECT, 780any "frob"'s in the input would not be counted as words, since the 781scanner normally executes only one action per token. 782Multiple 783.B REJECT's 784are allowed, each one finding the next best choice to the currently 785active rule. For example, when the following scanner scans the token 786"abcd", it will write "abcdabcaba" to the output: 787.nf 788 789 %% 790 a | 791 ab | 792 abc | 793 abcd ECHO; REJECT; 794 .|\\n /* eat up any unmatched character */ 795 796.fi 797(The first three rules share the fourth's action since they use 798the special '|' action.) 799.B REJECT 800is a particularly expensive feature in terms of scanner performance; 801if it is used in 802.I any 803of the scanner's actions it will slow down 804.I all 805of the scanner's matching. Furthermore, 806.B REJECT 807cannot be used with the 808.I -Cf 809or 810.I -CF 811options (see below). 812.IP 813Note also that unlike the other special actions, 814.B REJECT 815is a 816.I branch; 817code immediately following it in the action will 818.I not 819be executed. 820.IP - 821.B yymore() 822tells the scanner that the next time it matches a rule, the corresponding 823token should be 824.I appended 825onto the current value of 826.B yytext 827rather than replacing it. For example, given the input "mega-kludge" 828the following will write "mega-mega-kludge" to the output: 829.nf 830 831 %% 832 mega- ECHO; yymore(); 833 kludge ECHO; 834 835.fi 836First "mega-" is matched and echoed to the output. Then "kludge" 837is matched, but the previous "mega-" is still hanging around at the 838beginning of 839.B yytext 840so the 841.B ECHO 842for the "kludge" rule will actually write "mega-kludge". 843.PP 844Two notes regarding use of 845.B yymore(). 846First, 847.B yymore() 848depends on the value of 849.I yyleng 850correctly reflecting the size of the current token, so you must not 851modify 852.I yyleng 853if you are using 854.B yymore(). 855Second, the presence of 856.B yymore() 857in the scanner's action entails a minor performance penalty in the 858scanner's matching speed. 859.IP - 860.B yyless(n) 861returns all but the first 862.I n 863characters of the current token back to the input stream, where they 864will be rescanned when the scanner looks for the next match. 865.B yytext 866and 867.B yyleng 868are adjusted appropriately (e.g., 869.B yyleng 870will now be equal to 871.I n 872). For example, on the input "foobar" the following will write out 873"foobarbar": 874.nf 875 876 %% 877 foobar ECHO; yyless(3); 878 [a-z]+ ECHO; 879 880.fi 881An argument of 0 to 882.B yyless 883will cause the entire current input string to be scanned again. Unless you've 884changed how the scanner will subsequently process its input (using 885.B BEGIN, 886for example), this will result in an endless loop. 887.PP 888Note that 889.B yyless 890is a macro and can only be used in the flex input file, not from 891other source files. 892.IP - 893.B unput(c) 894puts the character 895.I c 896back onto the input stream. It will be the next character scanned. 897The following action will take the current token and cause it 898to be rescanned enclosed in parentheses. 899.nf 900 901 { 902 int i; 903 /* Copy yytext because unput() trashes yytext */ 904 char *yycopy = strdup( yytext ); 905 unput( ')' ); 906 for ( i = yyleng - 1; i >= 0; --i ) 907 unput( yycopy[i] ); 908 unput( '(' ); 909 free( yycopy ); 910 } 911 912.fi 913Note that since each 914.B unput() 915puts the given character back at the 916.I beginning 917of the input stream, pushing back strings must be done back-to-front. 918.PP 919An important potential problem when using 920.B unput() 921is that if you are using 922.B %pointer 923(the default), a call to 924.B unput() 925.I destroys 926the contents of 927.I yytext, 928starting with its rightmost character and devouring one character to 929the left with each call. If you need the value of yytext preserved 930after a call to 931.B unput() 932(as in the above example), 933you must either first copy it elsewhere, or build your scanner using 934.B %array 935instead (see How The Input Is Matched). 936.PP 937Finally, note that you cannot put back 938.B EOF 939to attempt to mark the input stream with an end-of-file. 940.IP - 941.B input() 942reads the next character from the input stream. For example, 943the following is one way to eat up C comments: 944.nf 945 946 %% 947 "/*" { 948 register int c; 949 950 for ( ; ; ) 951 { 952 while ( (c = input()) != '*' && 953 c != EOF ) 954 ; /* eat up text of comment */ 955 956 if ( c == '*' ) 957 { 958 while ( (c = input()) == '*' ) 959 ; 960 if ( c == '/' ) 961 break; /* found the end */ 962 } 963 964 if ( c == EOF ) 965 { 966 error( "EOF in comment" ); 967 break; 968 } 969 } 970 } 971 972.fi 973(Note that if the scanner is compiled using 974.B C++, 975then 976.B input() 977is instead referred to as 978.B yyinput(), 979in order to avoid a name clash with the 980.B C++ 981stream by the name of 982.I input.) 983.IP - 984.B YY_FLUSH_BUFFER 985flushes the scanner's internal buffer 986so that the next time the scanner attempts to match a token, it will 987first refill the buffer using 988.B YY_INPUT 989(see The Generated Scanner, below). This action is a special case 990of the more general 991.B yy_flush_buffer() 992function, described below in the section Multiple Input Buffers. 993.IP - 994.B yyterminate() 995can be used in lieu of a return statement in an action. It terminates 996the scanner and returns a 0 to the scanner's caller, indicating "all done". 997By default, 998.B yyterminate() 999is also called when an end-of-file is encountered. It is a macro and 1000may be redefined. 1001.SH THE GENERATED SCANNER 1002The output of 1003.I flex 1004is the file 1005.B lex.yy.c, 1006which contains the scanning routine 1007.B yylex(), 1008a number of tables used by it for matching tokens, and a number 1009of auxiliary routines and macros. By default, 1010.B yylex() 1011is declared as follows: 1012.nf 1013 1014 int yylex() 1015 { 1016 ... various definitions and the actions in here ... 1017 } 1018 1019.fi 1020(If your environment supports function prototypes, then it will 1021be "int yylex( void )".) This definition may be changed by defining 1022the "YY_DECL" macro. For example, you could use: 1023.nf 1024 1025 #define YY_DECL float lexscan( a, b ) float a, b; 1026 1027.fi 1028to give the scanning routine the name 1029.I lexscan, 1030returning a float, and taking two floats as arguments. Note that 1031if you give arguments to the scanning routine using a 1032K&R-style/non-prototyped function declaration, you must terminate 1033the definition with a semi-colon (;). 1034.PP 1035Whenever 1036.B yylex() 1037is called, it scans tokens from the global input file 1038.I yyin 1039(which defaults to stdin). It continues until it either reaches 1040an end-of-file (at which point it returns the value 0) or 1041one of its actions executes a 1042.I return 1043statement. 1044.PP 1045If the scanner reaches an end-of-file, subsequent calls are undefined 1046unless either 1047.I yyin 1048is pointed at a new input file (in which case scanning continues from 1049that file), or 1050.B yyrestart() 1051is called. 1052.B yyrestart() 1053takes one argument, a 1054.B FILE * 1055pointer (which can be nil, if you've set up 1056.B YY_INPUT 1057to scan from a source other than 1058.I yyin), 1059and initializes 1060.I yyin 1061for scanning from that file. Essentially there is no difference between 1062just assigning 1063.I yyin 1064to a new input file or using 1065.B yyrestart() 1066to do so; the latter is available for compatibility with previous versions 1067of 1068.I flex, 1069and because it can be used to switch input files in the middle of scanning. 1070It can also be used to throw away the current input buffer, by calling 1071it with an argument of 1072.I yyin; 1073but better is to use 1074.B YY_FLUSH_BUFFER 1075(see above). 1076Note that 1077.B yyrestart() 1078does 1079.I not 1080reset the start condition to 1081.B INITIAL 1082(see Start Conditions, below). 1083.PP 1084If 1085.B yylex() 1086stops scanning due to executing a 1087.I return 1088statement in one of the actions, the scanner may then be called again and it 1089will resume scanning where it left off. 1090.PP 1091By default (and for purposes of efficiency), the scanner uses 1092block-reads rather than simple 1093.I getc() 1094calls to read characters from 1095.I yyin. 1096The nature of how it gets its input can be controlled by defining the 1097.B YY_INPUT 1098macro. 1099YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its 1100action is to place up to 1101.I max_size 1102characters in the character array 1103.I buf 1104and return in the integer variable 1105.I result 1106either the 1107number of characters read or the constant YY_NULL (0 on Unix systems) 1108to indicate EOF. The default YY_INPUT reads from the 1109global file-pointer "yyin". 1110.PP 1111A sample definition of YY_INPUT (in the definitions 1112section of the input file): 1113.nf 1114 1115 %{ 1116 #define YY_INPUT(buf,result,max_size) \\ 1117 { \\ 1118 int c = getchar(); \\ 1119 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ 1120 } 1121 %} 1122 1123.fi 1124This definition will change the input processing to occur 1125one character at a time. 1126.PP 1127When the scanner receives an end-of-file indication from YY_INPUT, 1128it then checks the 1129.B yywrap() 1130function. If 1131.B yywrap() 1132returns false (zero), then it is assumed that the 1133function has gone ahead and set up 1134.I yyin 1135to point to another input file, and scanning continues. If it returns 1136true (non-zero), then the scanner terminates, returning 0 to its 1137caller. Note that in either case, the start condition remains unchanged; 1138it does 1139.I not 1140revert to 1141.B INITIAL. 1142.PP 1143If you do not supply your own version of 1144.B yywrap(), 1145then you must either use 1146.B %option noyywrap 1147(in which case the scanner behaves as though 1148.B yywrap() 1149returned 1), or you must link with 1150.B \-ll 1151to obtain the default version of the routine, which always returns 1. 1152.PP 1153Three routines are available for scanning from in-memory buffers rather 1154than files: 1155.B yy_scan_string(), yy_scan_bytes(), 1156and 1157.B yy_scan_buffer(). 1158See the discussion of them below in the section Multiple Input Buffers. 1159.PP 1160The scanner writes its 1161.B ECHO 1162output to the 1163.I yyout 1164global (default, stdout), which may be redefined by the user simply 1165by assigning it to some other 1166.B FILE 1167pointer. 1168.SH START CONDITIONS 1169.I flex 1170provides a mechanism for conditionally activating rules. Any rule 1171whose pattern is prefixed with "<sc>" will only be active when 1172the scanner is in the start condition named "sc". For example, 1173.nf 1174 1175 <STRING>[^"]* { /* eat up the string body ... */ 1176 ... 1177 } 1178 1179.fi 1180will be active only when the scanner is in the "STRING" start 1181condition, and 1182.nf 1183 1184 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ 1185 ... 1186 } 1187 1188.fi 1189will be active only when the current start condition is 1190either "INITIAL", "STRING", or "QUOTE". 1191.PP 1192Start conditions 1193are declared in the definitions (first) section of the input 1194using unindented lines beginning with either 1195.B %s 1196or 1197.B %x 1198followed by a list of names. 1199The former declares 1200.I inclusive 1201start conditions, the latter 1202.I exclusive 1203start conditions. A start condition is activated using the 1204.B BEGIN 1205action. Until the next 1206.B BEGIN 1207action is executed, rules with the given start 1208condition will be active and 1209rules with other start conditions will be inactive. 1210If the start condition is 1211.I inclusive, 1212then rules with no start conditions at all will also be active. 1213If it is 1214.I exclusive, 1215then 1216.I only 1217rules qualified with the start condition will be active. 1218A set of rules contingent on the same exclusive start condition 1219describe a scanner which is independent of any of the other rules in the 1220.I flex 1221input. Because of this, 1222exclusive start conditions make it easy to specify "mini-scanners" 1223which scan portions of the input that are syntactically different 1224from the rest (e.g., comments). 1225.PP 1226If the distinction between inclusive and exclusive start conditions 1227is still a little vague, here's a simple example illustrating the 1228connection between the two. The set of rules: 1229.nf 1230 1231 %s example 1232 %% 1233 1234 <example>foo do_something(); 1235 1236 bar something_else(); 1237 1238.fi 1239is equivalent to 1240.nf 1241 1242 %x example 1243 %% 1244 1245 <example>foo do_something(); 1246 1247 <INITIAL,example>bar something_else(); 1248 1249.fi 1250Without the 1251.B <INITIAL,example> 1252qualifier, the 1253.I bar 1254pattern in the second example wouldn't be active (i.e., couldn't match) 1255when in start condition 1256.B example. 1257If we just used 1258.B <example> 1259to qualify 1260.I bar, 1261though, then it would only be active in 1262.B example 1263and not in 1264.B INITIAL, 1265while in the first example it's active in both, because in the first 1266example the 1267.B example 1268start condition is an 1269.I inclusive 1270.B (%s) 1271start condition. 1272.PP 1273Also note that the special start-condition specifier 1274.B <*> 1275matches every start condition. Thus, the above example could also 1276have been written; 1277.nf 1278 1279 %x example 1280 %% 1281 1282 <example>foo do_something(); 1283 1284 <*>bar something_else(); 1285 1286.fi 1287.PP 1288The default rule (to 1289.B ECHO 1290any unmatched character) remains active in start conditions. It 1291is equivalent to: 1292.nf 1293 1294 <*>.|\\n ECHO; 1295 1296.fi 1297.PP 1298.B BEGIN(0) 1299returns to the original state where only the rules with 1300no start conditions are active. This state can also be 1301referred to as the start-condition "INITIAL", so 1302.B BEGIN(INITIAL) 1303is equivalent to 1304.B BEGIN(0). 1305(The parentheses around the start condition name are not required but 1306are considered good style.) 1307.PP 1308.B BEGIN 1309actions can also be given as indented code at the beginning 1310of the rules section. For example, the following will cause 1311the scanner to enter the "SPECIAL" start condition whenever 1312.B yylex() 1313is called and the global variable 1314.I enter_special 1315is true: 1316.nf 1317 1318 int enter_special; 1319 1320 %x SPECIAL 1321 %% 1322 if ( enter_special ) 1323 BEGIN(SPECIAL); 1324 1325 <SPECIAL>blahblahblah 1326 ...more rules follow... 1327 1328.fi 1329.PP 1330To illustrate the uses of start conditions, 1331here is a scanner which provides two different interpretations 1332of a string like "123.456". By default it will treat it as 1333three tokens, the integer "123", a dot ('.'), and the integer "456". 1334But if the string is preceded earlier in the line by the string 1335"expect-floats" 1336it will treat it as a single token, the floating-point number 1337123.456: 1338.nf 1339 1340 %{ 1341 #include <math.h> 1342 %} 1343 %s expect 1344 1345 %% 1346 expect-floats BEGIN(expect); 1347 1348 <expect>[0-9]+"."[0-9]+ { 1349 printf( "found a float, = %f\\n", 1350 atof( yytext ) ); 1351 } 1352 <expect>\\n { 1353 /* that's the end of the line, so 1354 * we need another "expect-number" 1355 * before we'll recognize any more 1356 * numbers 1357 */ 1358 BEGIN(INITIAL); 1359 } 1360 1361 [0-9]+ { 1362 printf( "found an integer, = %d\\n", 1363 atoi( yytext ) ); 1364 } 1365 1366 "." printf( "found a dot\\n" ); 1367 1368.fi 1369Here is a scanner which recognizes (and discards) C comments while 1370maintaining a count of the current input line. 1371.nf 1372 1373 %x comment 1374 %% 1375 int line_num = 1; 1376 1377 "/*" BEGIN(comment); 1378 1379 <comment>[^*\\n]* /* eat anything that's not a '*' */ 1380 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ 1381 <comment>\\n ++line_num; 1382 <comment>"*"+"/" BEGIN(INITIAL); 1383 1384.fi 1385This scanner goes to a bit of trouble to match as much 1386text as possible with each rule. In general, when attempting to write 1387a high-speed scanner try to match as much possible in each rule, as 1388it's a big win. 1389.PP 1390Note that start-conditions names are really integer values and 1391can be stored as such. Thus, the above could be extended in the 1392following fashion: 1393.nf 1394 1395 %x comment foo 1396 %% 1397 int line_num = 1; 1398 int comment_caller; 1399 1400 "/*" { 1401 comment_caller = INITIAL; 1402 BEGIN(comment); 1403 } 1404 1405 ... 1406 1407 <foo>"/*" { 1408 comment_caller = foo; 1409 BEGIN(comment); 1410 } 1411 1412 <comment>[^*\\n]* /* eat anything that's not a '*' */ 1413 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ 1414 <comment>\\n ++line_num; 1415 <comment>"*"+"/" BEGIN(comment_caller); 1416 1417.fi 1418Furthermore, you can access the current start condition using 1419the integer-valued 1420.B YY_START 1421macro. For example, the above assignments to 1422.I comment_caller 1423could instead be written 1424.nf 1425 1426 comment_caller = YY_START; 1427 1428.fi 1429Flex provides 1430.B YYSTATE 1431as an alias for 1432.B YY_START 1433(since that is what's used by AT&T 1434.I lex). 1435.PP 1436Note that start conditions do not have their own name-space; %s's and %x's 1437declare names in the same fashion as #define's. 1438.PP 1439Finally, here's an example of how to match C-style quoted strings using 1440exclusive start conditions, including expanded escape sequences (but 1441not including checking for a string that's too long): 1442.nf 1443 1444 %x str 1445 1446 %% 1447 char string_buf[MAX_STR_CONST]; 1448 char *string_buf_ptr; 1449 1450 1451 \\" string_buf_ptr = string_buf; BEGIN(str); 1452 1453 <str>\\" { /* saw closing quote - all done */ 1454 BEGIN(INITIAL); 1455 *string_buf_ptr = '\\0'; 1456 /* return string constant token type and 1457 * value to parser 1458 */ 1459 } 1460 1461 <str>\\n { 1462 /* error - unterminated string constant */ 1463 /* generate error message */ 1464 } 1465 1466 <str>\\\\[0-7]{1,3} { 1467 /* octal escape sequence */ 1468 int result; 1469 1470 (void) sscanf( yytext + 1, "%o", &result ); 1471 1472 if ( result > 0xff ) 1473 /* error, constant is out-of-bounds */ 1474 1475 *string_buf_ptr++ = result; 1476 } 1477 1478 <str>\\\\[0-9]+ { 1479 /* generate error - bad escape sequence; something 1480 * like '\\48' or '\\0777777' 1481 */ 1482 } 1483 1484 <str>\\\\n *string_buf_ptr++ = '\\n'; 1485 <str>\\\\t *string_buf_ptr++ = '\\t'; 1486 <str>\\\\r *string_buf_ptr++ = '\\r'; 1487 <str>\\\\b *string_buf_ptr++ = '\\b'; 1488 <str>\\\\f *string_buf_ptr++ = '\\f'; 1489 1490 <str>\\\\(.|\\n) *string_buf_ptr++ = yytext[1]; 1491 1492 <str>[^\\\\\\n\\"]+ { 1493 char *yptr = yytext; 1494 1495 while ( *yptr ) 1496 *string_buf_ptr++ = *yptr++; 1497 } 1498 1499.fi 1500.PP 1501Often, such as in some of the examples above, you wind up writing a 1502whole bunch of rules all preceded by the same start condition(s). Flex 1503makes this a little easier and cleaner by introducing a notion of 1504start condition 1505.I scope. 1506A start condition scope is begun with: 1507.nf 1508 1509 <SCs>{ 1510 1511.fi 1512where 1513.I SCs 1514is a list of one or more start conditions. Inside the start condition 1515scope, every rule automatically has the prefix 1516.I <SCs> 1517applied to it, until a 1518.I '}' 1519which matches the initial 1520.I '{'. 1521So, for example, 1522.nf 1523 1524 <ESC>{ 1525 "\\\\n" return '\\n'; 1526 "\\\\r" return '\\r'; 1527 "\\\\f" return '\\f'; 1528 "\\\\0" return '\\0'; 1529 } 1530 1531.fi 1532is equivalent to: 1533.nf 1534 1535 <ESC>"\\\\n" return '\\n'; 1536 <ESC>"\\\\r" return '\\r'; 1537 <ESC>"\\\\f" return '\\f'; 1538 <ESC>"\\\\0" return '\\0'; 1539 1540.fi 1541Start condition scopes may be nested. 1542.PP 1543Three routines are available for manipulating stacks of start conditions: 1544.TP 1545.B void yy_push_state(int new_state) 1546pushes the current start condition onto the top of the start condition 1547stack and switches to 1548.I new_state 1549as though you had used 1550.B BEGIN new_state 1551(recall that start condition names are also integers). 1552.TP 1553.B void yy_pop_state() 1554pops the top of the stack and switches to it via 1555.B BEGIN. 1556.TP 1557.B int yy_top_state() 1558returns the top of the stack without altering the stack's contents. 1559.PP 1560The start condition stack grows dynamically and so has no built-in 1561size limitation. If memory is exhausted, program execution aborts. 1562.PP 1563To use start condition stacks, your scanner must include a 1564.B %option stack 1565directive (see Options below). 1566.SH MULTIPLE INPUT BUFFERS 1567Some scanners (such as those which support "include" files) 1568require reading from several input streams. As 1569.I flex 1570scanners do a large amount of buffering, one cannot control 1571where the next input will be read from by simply writing a 1572.B YY_INPUT 1573which is sensitive to the scanning context. 1574.B YY_INPUT 1575is only called when the scanner reaches the end of its buffer, which 1576may be a long time after scanning a statement such as an "include" 1577which requires switching the input source. 1578.PP 1579To negotiate these sorts of problems, 1580.I flex 1581provides a mechanism for creating and switching between multiple 1582input buffers. An input buffer is created by using: 1583.nf 1584 1585 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) 1586 1587.fi 1588which takes a 1589.I FILE 1590pointer and a size and creates a buffer associated with the given 1591file and large enough to hold 1592.I size 1593characters (when in doubt, use 1594.B YY_BUF_SIZE 1595for the size). It returns a 1596.B YY_BUFFER_STATE 1597handle, which may then be passed to other routines (see below). The 1598.B YY_BUFFER_STATE 1599type is a pointer to an opaque 1600.B struct yy_buffer_state 1601structure, so you may safely initialize YY_BUFFER_STATE variables to 1602.B ((YY_BUFFER_STATE) 0) 1603if you wish, and also refer to the opaque structure in order to 1604correctly declare input buffers in source files other than that 1605of your scanner. Note that the 1606.I FILE 1607pointer in the call to 1608.B yy_create_buffer 1609is only used as the value of 1610.I yyin 1611seen by 1612.B YY_INPUT; 1613if you redefine 1614.B YY_INPUT 1615so it no longer uses 1616.I yyin, 1617then you can safely pass a nil 1618.I FILE 1619pointer to 1620.B yy_create_buffer. 1621You select a particular buffer to scan from using: 1622.nf 1623 1624 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) 1625 1626.fi 1627switches the scanner's input buffer so subsequent tokens will 1628come from 1629.I new_buffer. 1630Note that 1631.B yy_switch_to_buffer() 1632may be used by yywrap() to set things up for continued scanning, instead 1633of opening a new file and pointing 1634.I yyin 1635at it. Note also that switching input sources via either 1636.B yy_switch_to_buffer() 1637or 1638.B yywrap() 1639does 1640.I not 1641change the start condition. 1642.nf 1643 1644 void yy_delete_buffer( YY_BUFFER_STATE buffer ) 1645 1646.fi 1647is used to reclaim the storage associated with a buffer. ( 1648.B buffer 1649can be nil, in which case the routine does nothing.) 1650You can also clear the current contents of a buffer using: 1651.nf 1652 1653 void yy_flush_buffer( YY_BUFFER_STATE buffer ) 1654 1655.fi 1656This function discards the buffer's contents, 1657so the next time the scanner attempts to match a token from the 1658buffer, it will first fill the buffer anew using 1659.B YY_INPUT. 1660.PP 1661.B yy_new_buffer() 1662is an alias for 1663.B yy_create_buffer(), 1664provided for compatibility with the C++ use of 1665.I new 1666and 1667.I delete 1668for creating and destroying dynamic objects. 1669.PP 1670Finally, the 1671.B YY_CURRENT_BUFFER 1672macro returns a 1673.B YY_BUFFER_STATE 1674handle to the current buffer. 1675.PP 1676Here is an example of using these features for writing a scanner 1677which expands include files (the 1678.B <<EOF>> 1679feature is discussed below): 1680.nf 1681 1682 /* the "incl" state is used for picking up the name 1683 * of an include file 1684 */ 1685 %x incl 1686 1687 %{ 1688 #define MAX_INCLUDE_DEPTH 10 1689 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1690 int include_stack_ptr = 0; 1691 %} 1692 1693 %% 1694 include BEGIN(incl); 1695 1696 [a-z]+ ECHO; 1697 [^a-z\\n]*\\n? ECHO; 1698 1699 <incl>[ \\t]* /* eat the whitespace */ 1700 <incl>[^ \\t\\n]+ { /* got the include file name */ 1701 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 1702 { 1703 fprintf( stderr, "Includes nested too deeply" ); 1704 exit( 1 ); 1705 } 1706 1707 include_stack[include_stack_ptr++] = 1708 YY_CURRENT_BUFFER; 1709 1710 yyin = fopen( yytext, "r" ); 1711 1712 if ( ! yyin ) 1713 error( ... ); 1714 1715 yy_switch_to_buffer( 1716 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 1717 1718 BEGIN(INITIAL); 1719 } 1720 1721 <<EOF>> { 1722 if ( --include_stack_ptr < 0 ) 1723 { 1724 yyterminate(); 1725 } 1726 1727 else 1728 { 1729 yy_delete_buffer( YY_CURRENT_BUFFER ); 1730 yy_switch_to_buffer( 1731 include_stack[include_stack_ptr] ); 1732 } 1733 } 1734 1735.fi 1736Three routines are available for setting up input buffers for 1737scanning in-memory strings instead of files. All of them create 1738a new input buffer for scanning the string, and return a corresponding 1739.B YY_BUFFER_STATE 1740handle (which you should delete with 1741.B yy_delete_buffer() 1742when done with it). They also switch to the new buffer using 1743.B yy_switch_to_buffer(), 1744so the next call to 1745.B yylex() 1746will start scanning the string. 1747.TP 1748.B yy_scan_string(const char *str) 1749scans a NUL-terminated string. 1750.TP 1751.B yy_scan_bytes(const char *bytes, int len) 1752scans 1753.I len 1754bytes (including possibly NUL's) 1755starting at location 1756.I bytes. 1757.PP 1758Note that both of these functions create and scan a 1759.I copy 1760of the string or bytes. (This may be desirable, since 1761.B yylex() 1762modifies the contents of the buffer it is scanning.) You can avoid the 1763copy by using: 1764.TP 1765.B yy_scan_buffer(char *base, yy_size_t size) 1766which scans in place the buffer starting at 1767.I base, 1768consisting of 1769.I size 1770bytes, the last two bytes of which 1771.I must 1772be 1773.B YY_END_OF_BUFFER_CHAR 1774(ASCII NUL). 1775These last two bytes are not scanned; thus, scanning 1776consists of 1777.B base[0] 1778through 1779.B base[size-2], 1780inclusive. 1781.IP 1782If you fail to set up 1783.I base 1784in this manner (i.e., forget the final two 1785.B YY_END_OF_BUFFER_CHAR 1786bytes), then 1787.B yy_scan_buffer() 1788returns a nil pointer instead of creating a new input buffer. 1789.IP 1790The type 1791.B yy_size_t 1792is an integral type to which you can cast an integer expression 1793reflecting the size of the buffer. 1794.SH END-OF-FILE RULES 1795The special rule "<<EOF>>" indicates 1796actions which are to be taken when an end-of-file is 1797encountered and yywrap() returns non-zero (i.e., indicates 1798no further files to process). The action must finish 1799by doing one of four things: 1800.IP - 1801assigning 1802.I yyin 1803to a new input file (in previous versions of flex, after doing the 1804assignment you had to call the special action 1805.B YY_NEW_FILE; 1806this is no longer necessary); 1807.IP - 1808executing a 1809.I return 1810statement; 1811.IP - 1812executing the special 1813.B yyterminate() 1814action; 1815.IP - 1816or, switching to a new buffer using 1817.B yy_switch_to_buffer() 1818as shown in the example above. 1819.PP 1820<<EOF>> rules may not be used with other 1821patterns; they may only be qualified with a list of start 1822conditions. If an unqualified <<EOF>> rule is given, it 1823applies to 1824.I all 1825start conditions which do not already have <<EOF>> actions. To 1826specify an <<EOF>> rule for only the initial start condition, use 1827.nf 1828 1829 <INITIAL><<EOF>> 1830 1831.fi 1832.PP 1833These rules are useful for catching things like unclosed comments. 1834An example: 1835.nf 1836 1837 %x quote 1838 %% 1839 1840 ...other rules for dealing with quotes... 1841 1842 <quote><<EOF>> { 1843 error( "unterminated quote" ); 1844 yyterminate(); 1845 } 1846 <<EOF>> { 1847 if ( *++filelist ) 1848 yyin = fopen( *filelist, "r" ); 1849 else 1850 yyterminate(); 1851 } 1852 1853.fi 1854.SH MISCELLANEOUS MACROS 1855The macro 1856.B YY_USER_ACTION 1857can be defined to provide an action 1858which is always executed prior to the matched rule's action. For example, 1859it could be #define'd to call a routine to convert yytext to lower-case. 1860When 1861.B YY_USER_ACTION 1862is invoked, the variable 1863.I yy_act 1864gives the number of the matched rule (rules are numbered starting with 1). 1865Suppose you want to profile how often each of your rules is matched. The 1866following would do the trick: 1867.nf 1868 1869 #define YY_USER_ACTION ++ctr[yy_act] 1870 1871.fi 1872where 1873.I ctr 1874is an array to hold the counts for the different rules. Note that 1875the macro 1876.B YY_NUM_RULES 1877gives the total number of rules (including the default rule, even if 1878you use 1879.B \-s), 1880so a correct declaration for 1881.I ctr 1882is: 1883.nf 1884 1885 int ctr[YY_NUM_RULES]; 1886 1887.fi 1888.PP 1889The macro 1890.B YY_USER_INIT 1891may be defined to provide an action which is always executed before 1892the first scan (and before the scanner's internal initializations are done). 1893For example, it could be used to call a routine to read 1894in a data table or open a logging file. 1895.PP 1896The macro 1897.B yy_set_interactive(is_interactive) 1898can be used to control whether the current buffer is considered 1899.I interactive. 1900An interactive buffer is processed more slowly, 1901but must be used when the scanner's input source is indeed 1902interactive to avoid problems due to waiting to fill buffers 1903(see the discussion of the 1904.B \-I 1905flag below). A non-zero value 1906in the macro invocation marks the buffer as interactive, a zero 1907value as non-interactive. Note that use of this macro overrides 1908.B %option interactive , 1909.B %option always-interactive 1910or 1911.B %option never-interactive 1912(see Options below). 1913.B yy_set_interactive() 1914must be invoked prior to beginning to scan the buffer that is 1915(or is not) to be considered interactive. 1916.PP 1917The macro 1918.B yy_set_bol(at_bol) 1919can be used to control whether the current buffer's scanning 1920context for the next token match is done as though at the 1921beginning of a line. A non-zero macro argument makes rules anchored with 1922\&'^' active, while a zero argument makes '^' rules inactive. 1923.PP 1924The macro 1925.B YY_AT_BOL() 1926returns true if the next token scanned from the current buffer 1927will have '^' rules active, false otherwise. 1928.PP 1929In the generated scanner, the actions are all gathered in one large 1930switch statement and separated using 1931.B YY_BREAK, 1932which may be redefined. By default, it is simply a "break", to separate 1933each rule's action from the following rule's. 1934Redefining 1935.B YY_BREAK 1936allows, for example, C++ users to 1937#define YY_BREAK to do nothing (while being very careful that every 1938rule ends with a "break" or a "return"!) to avoid suffering from 1939unreachable statement warnings where because a rule's action ends with 1940"return", the 1941.B YY_BREAK 1942is inaccessible. 1943.SH VALUES AVAILABLE TO THE USER 1944This section summarizes the various values available to the user 1945in the rule actions. 1946.IP - 1947.B char *yytext 1948holds the text of the current token. It may be modified but not lengthened 1949(you cannot append characters to the end). 1950.IP 1951If the special directive 1952.B %array 1953appears in the first section of the scanner description, then 1954.B yytext 1955is instead declared 1956.B char yytext[YYLMAX], 1957where 1958.B YYLMAX 1959is a macro definition that you can redefine in the first section 1960if you don't like the default value (generally 8KB). Using 1961.B %array 1962results in somewhat slower scanners, but the value of 1963.B yytext 1964becomes immune to calls to 1965.I input() 1966and 1967.I unput(), 1968which potentially destroy its value when 1969.B yytext 1970is a character pointer. The opposite of 1971.B %array 1972is 1973.B %pointer, 1974which is the default. 1975.IP 1976You cannot use 1977.B %array 1978when generating C++ scanner classes 1979(the 1980.B \-+ 1981flag). 1982.IP - 1983.B int yyleng 1984holds the length of the current token. 1985.IP - 1986.B FILE *yyin 1987is the file which by default 1988.I flex 1989reads from. It may be redefined but doing so only makes sense before 1990scanning begins or after an EOF has been encountered. Changing it in 1991the midst of scanning will have unexpected results since 1992.I flex 1993buffers its input; use 1994.B yyrestart() 1995instead. 1996Once scanning terminates because an end-of-file 1997has been seen, you can assign 1998.I yyin 1999at the new input file and then call the scanner again to continue scanning. 2000.IP - 2001.B void yyrestart( FILE *new_file ) 2002may be called to point 2003.I yyin 2004at the new input file. The switch-over to the new file is immediate 2005(any previously buffered-up input is lost). Note that calling 2006.B yyrestart() 2007with 2008.I yyin 2009as an argument thus throws away the current input buffer and continues 2010scanning the same input file. 2011.IP - 2012.B FILE *yyout 2013is the file to which 2014.B ECHO 2015actions are done. It can be reassigned by the user. 2016.IP - 2017.B YY_CURRENT_BUFFER 2018returns a 2019.B YY_BUFFER_STATE 2020handle to the current buffer. 2021.IP - 2022.B YY_START 2023returns an integer value corresponding to the current start 2024condition. You can subsequently use this value with 2025.B BEGIN 2026to return to that start condition. 2027.SH INTERFACING WITH YACC 2028One of the main uses of 2029.I flex 2030is as a companion to the 2031.I yacc 2032parser-generator. 2033.I yacc 2034parsers expect to call a routine named 2035.B yylex() 2036to find the next input token. The routine is supposed to 2037return the type of the next token as well as putting any associated 2038value in the global 2039.B yylval. 2040To use 2041.I flex 2042with 2043.I yacc, 2044one specifies the 2045.B \-d 2046option to 2047.I yacc 2048to instruct it to generate the file 2049.B y.tab.h 2050containing definitions of all the 2051.B %tokens 2052appearing in the 2053.I yacc 2054input. This file is then included in the 2055.I flex 2056scanner. For example, if one of the tokens is "TOK_NUMBER", 2057part of the scanner might look like: 2058.nf 2059 2060 %{ 2061 #include "y.tab.h" 2062 %} 2063 2064 %% 2065 2066 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2067 2068.fi 2069.SH OPTIONS 2070.I flex 2071has the following options: 2072.TP 2073.B \-b 2074Generate backing-up information to 2075.I lex.backup. 2076This is a list of scanner states which require backing up 2077and the input characters on which they do so. By adding rules one 2078can remove backing-up states. If 2079.I all 2080backing-up states are eliminated and 2081.B \-Cf 2082or 2083.B \-CF 2084is used, the generated scanner will run faster (see the 2085.B \-p 2086flag). Only users who wish to squeeze every last cycle out of their 2087scanners need worry about this option. (See the section on Performance 2088Considerations below.) 2089.TP 2090.B \-c 2091is a do-nothing, deprecated option included for POSIX compliance. 2092.TP 2093.B \-d 2094makes the generated scanner run in 2095.I debug 2096mode. Whenever a pattern is recognized and the global 2097.B yy_flex_debug 2098is non-zero (which is the default), 2099the scanner will write to 2100.I stderr 2101a line of the form: 2102.nf 2103 2104 --accepting rule at line 53 ("the matched text") 2105 2106.fi 2107The line number refers to the location of the rule in the file 2108defining the scanner (i.e., the file that was fed to flex). Messages 2109are also generated when the scanner backs up, accepts the 2110default rule, reaches the end of its input buffer (or encounters 2111a NUL; at this point, the two look the same as far as the scanner's concerned), 2112or reaches an end-of-file. 2113.TP 2114.B \-f 2115specifies 2116.I fast scanner. 2117No table compression is done and stdio is bypassed. 2118The result is large but fast. This option is equivalent to 2119.B \-Cfr 2120(see below). 2121.TP 2122.B \-h 2123generates a "help" summary of 2124.I flex's 2125options to 2126.I stdout 2127and then exits. 2128.B \-? 2129and 2130.B \-\-help 2131are synonyms for 2132.B \-h. 2133.TP 2134.B \-i 2135instructs 2136.I flex 2137to generate a 2138.I case-insensitive 2139scanner. The case of letters given in the 2140.I flex 2141input patterns will 2142be ignored, and tokens in the input will be matched regardless of case. The 2143matched text given in 2144.I yytext 2145will have the preserved case (i.e., it will not be folded). 2146.TP 2147.B \-l 2148turns on maximum compatibility with the original AT&T 2149.I lex 2150implementation. Note that this does not mean 2151.I full 2152compatibility. Use of this option costs a considerable amount of 2153performance, and it cannot be used with the 2154.B \-+, -f, -F, -Cf, 2155or 2156.B -CF 2157options. For details on the compatibilities it provides, see the section 2158"Incompatibilities With Lex And POSIX" below. This option also results 2159in the name 2160.B YY_FLEX_LEX_COMPAT 2161being #define'd in the generated scanner. 2162.TP 2163.B \-n 2164is another do-nothing, deprecated option included only for 2165POSIX compliance. 2166.TP 2167.B \-p 2168generates a performance report to stderr. The report 2169consists of comments regarding features of the 2170.I flex 2171input file which will cause a serious loss of performance in the resulting 2172scanner. If you give the flag twice, you will also get comments regarding 2173features that lead to minor performance losses. 2174.IP 2175Note that the use of 2176.B REJECT, 2177.B %option yylineno, 2178and variable trailing context (see the Deficiencies / Bugs section below) 2179entails a substantial performance penalty; use of 2180.I yymore(), 2181the 2182.B ^ 2183operator, 2184and the 2185.B \-I 2186flag entail minor performance penalties. 2187.TP 2188.B \-s 2189causes the 2190.I default rule 2191(that unmatched scanner input is echoed to 2192.I stdout) 2193to be suppressed. If the scanner encounters input that does not 2194match any of its rules, it aborts with an error. This option is 2195useful for finding holes in a scanner's rule set. 2196.TP 2197.B \-t 2198instructs 2199.I flex 2200to write the scanner it generates to standard output instead 2201of 2202.B lex.yy.c. 2203.TP 2204.B \-v 2205specifies that 2206.I flex 2207should write to 2208.I stderr 2209a summary of statistics regarding the scanner it generates. 2210Most of the statistics are meaningless to the casual 2211.I flex 2212user, but the first line identifies the version of 2213.I flex 2214(same as reported by 2215.B \-V), 2216and the next line the flags used when generating the scanner, including 2217those that are on by default. 2218.TP 2219.B \-w 2220suppresses warning messages. 2221.TP 2222.B \-B 2223instructs 2224.I flex 2225to generate a 2226.I batch 2227scanner, the opposite of 2228.I interactive 2229scanners generated by 2230.B \-I 2231(see below). In general, you use 2232.B \-B 2233when you are 2234.I certain 2235that your scanner will never be used interactively, and you want to 2236squeeze a 2237.I little 2238more performance out of it. If your goal is instead to squeeze out a 2239.I lot 2240more performance, you should be using the 2241.B \-Cf 2242or 2243.B \-CF 2244options (discussed below), which turn on 2245.B \-B 2246automatically anyway. 2247.TP 2248.B \-F 2249specifies that the 2250.ul 2251fast 2252scanner table representation should be used (and stdio 2253bypassed). This representation is 2254about as fast as the full table representation 2255.B (-f), 2256and for some sets of patterns will be considerably smaller (and for 2257others, larger). In general, if the pattern set contains both "keywords" 2258and a catch-all, "identifier" rule, such as in the set: 2259.nf 2260 2261 "case" return TOK_CASE; 2262 "switch" return TOK_SWITCH; 2263 ... 2264 "default" return TOK_DEFAULT; 2265 [a-z]+ return TOK_ID; 2266 2267.fi 2268then you're better off using the full table representation. If only 2269the "identifier" rule is present and you then use a hash table or some such 2270to detect the keywords, you're better off using 2271.B -F. 2272.IP 2273This option is equivalent to 2274.B \-CFr 2275(see below). It cannot be used with 2276.B \-+. 2277.TP 2278.B \-I 2279instructs 2280.I flex 2281to generate an 2282.I interactive 2283scanner. An interactive scanner is one that only looks ahead to decide 2284what token has been matched if it absolutely must. It turns out that 2285always looking one extra character ahead, even if the scanner has already 2286seen enough text to disambiguate the current token, is a bit faster than 2287only looking ahead when necessary. But scanners that always look ahead 2288give dreadful interactive performance; for example, when a user types 2289a newline, it is not recognized as a newline token until they enter 2290.I another 2291token, which often means typing in another whole line. 2292.IP 2293.I Flex 2294scanners default to 2295.I interactive 2296unless you use the 2297.B \-Cf 2298or 2299.B \-CF 2300table-compression options (see below). That's because if you're looking 2301for high-performance you should be using one of these options, so if you 2302didn't, 2303.I flex 2304assumes you'd rather trade off a bit of run-time performance for intuitive 2305interactive behavior. Note also that you 2306.I cannot 2307use 2308.B \-I 2309in conjunction with 2310.B \-Cf 2311or 2312.B \-CF. 2313Thus, this option is not really needed; it is on by default for all those 2314cases in which it is allowed. 2315.IP 2316Note that if 2317.B isatty() 2318returns false for the scanner input, flex will revert to batch mode, even if 2319.B \-I 2320was specified. To force interactive mode no matter what, use 2321.B %option always-interactive 2322(see Options below). 2323.IP 2324You can force a scanner to 2325.I not 2326be interactive by using 2327.B \-B 2328(see above). 2329.TP 2330.B \-L 2331instructs 2332.I flex 2333not to generate 2334.B #line 2335directives. Without this option, 2336.I flex 2337peppers the generated scanner 2338with #line directives so error messages in the actions will be correctly 2339located with respect to either the original 2340.I flex 2341input file (if the errors are due to code in the input file), or 2342.B lex.yy.c 2343(if the errors are 2344.I flex's 2345fault -- you should report these sorts of errors to the email address 2346given below). 2347.TP 2348.B \-T 2349makes 2350.I flex 2351run in 2352.I trace 2353mode. It will generate a lot of messages to 2354.I stderr 2355concerning 2356the form of the input and the resultant non-deterministic and deterministic 2357finite automata. This option is mostly for use in maintaining 2358.I flex. 2359.TP 2360.B \-V 2361prints the version number to 2362.I stdout 2363and exits. 2364.B \-\-version 2365is a synonym for 2366.B \-V. 2367.TP 2368.B \-7 2369instructs 2370.I flex 2371to generate a 7-bit scanner, i.e., one which can only recognize 7-bit 2372characters in its input. The advantage of using 2373.B \-7 2374is that the scanner's tables can be up to half the size of those generated 2375using the 2376.B \-8 2377option (see below). The disadvantage is that such scanners often hang 2378or crash if their input contains an 8-bit character. 2379.IP 2380Note, however, that unless you generate your scanner using the 2381.B \-Cf 2382or 2383.B \-CF 2384table compression options, use of 2385.B \-7 2386will save only a small amount of table space, and make your scanner 2387considerably less portable. 2388.I Flex's 2389default behavior is to generate an 8-bit scanner unless you use the 2390.B \-Cf 2391or 2392.B \-CF, 2393in which case 2394.I flex 2395defaults to generating 7-bit scanners unless your site was always 2396configured to generate 8-bit scanners (as will often be the case 2397with non-USA sites). You can tell whether flex generated a 7-bit 2398or an 8-bit scanner by inspecting the flag summary in the 2399.B \-v 2400output as described above. 2401.IP 2402Note that if you use 2403.B \-Cfe 2404or 2405.B \-CFe 2406(those table compression options, but also using equivalence classes as 2407discussed see below), flex still defaults to generating an 8-bit 2408scanner, since usually with these compression options full 8-bit tables 2409are not much more expensive than 7-bit tables. 2410.TP 2411.B \-8 2412instructs 2413.I flex 2414to generate an 8-bit scanner, i.e., one which can recognize 8-bit 2415characters. This flag is only needed for scanners generated using 2416.B \-Cf 2417or 2418.B \-CF, 2419as otherwise flex defaults to generating an 8-bit scanner anyway. 2420.IP 2421See the discussion of 2422.B \-7 2423above for flex's default behavior and the tradeoffs between 7-bit 2424and 8-bit scanners. 2425.TP 2426.B \-+ 2427specifies that you want flex to generate a C++ 2428scanner class. See the section on Generating C++ Scanners below for 2429details. 2430.TP 2431.B \-C[aefFmr] 2432controls the degree of table compression and, more generally, trade-offs 2433between small scanners and fast scanners. 2434.IP 2435.B \-Ca 2436("align") instructs flex to trade off larger tables in the 2437generated scanner for faster performance because the elements of 2438the tables are better aligned for memory access and computation. On some 2439RISC architectures, fetching and manipulating longwords is more efficient 2440than with smaller-sized units such as shortwords. This option can 2441double the size of the tables used by your scanner. 2442.IP 2443.B \-Ce 2444directs 2445.I flex 2446to construct 2447.I equivalence classes, 2448i.e., sets of characters 2449which have identical lexical properties (for example, if the only 2450appearance of digits in the 2451.I flex 2452input is in the character class 2453"[0-9]" then the digits '0', '1', ..., '9' will all be put 2454in the same equivalence class). Equivalence classes usually give 2455dramatic reductions in the final table/object file sizes (typically 2456a factor of 2-5) and are pretty cheap performance-wise (one array 2457look-up per character scanned). 2458.IP 2459.B \-Cf 2460specifies that the 2461.I full 2462scanner tables should be generated - 2463.I flex 2464should not compress the 2465tables by taking advantages of similar transition functions for 2466different states. 2467.IP 2468.B \-CF 2469specifies that the alternate fast scanner representation (described 2470above under the 2471.B \-F 2472flag) 2473should be used. This option cannot be used with 2474.B \-+. 2475.IP 2476.B \-Cm 2477directs 2478.I flex 2479to construct 2480.I meta-equivalence classes, 2481which are sets of equivalence classes (or characters, if equivalence 2482classes are not being used) that are commonly used together. Meta-equivalence 2483classes are often a big win when using compressed tables, but they 2484have a moderate performance impact (one or two "if" tests and one 2485array look-up per character scanned). 2486.IP 2487.B \-Cr 2488causes the generated scanner to 2489.I bypass 2490use of the standard I/O library (stdio) for input. Instead of calling 2491.B fread() 2492or 2493.B getc(), 2494the scanner will use the 2495.B read() 2496system call, resulting in a performance gain which varies from system 2497to system, but in general is probably negligible unless you are also using 2498.B \-Cf 2499or 2500.B \-CF. 2501Using 2502.B \-Cr 2503can cause strange behavior if, for example, you read from 2504.I yyin 2505using stdio prior to calling the scanner (because the scanner will miss 2506whatever text your previous reads left in the stdio input buffer). 2507.IP 2508.B \-Cr 2509has no effect if you define 2510.B YY_INPUT 2511(see The Generated Scanner above). 2512.IP 2513A lone 2514.B \-C 2515specifies that the scanner tables should be compressed but neither 2516equivalence classes nor meta-equivalence classes should be used. 2517.IP 2518The options 2519.B \-Cf 2520or 2521.B \-CF 2522and 2523.B \-Cm 2524do not make sense together - there is no opportunity for meta-equivalence 2525classes if the table is not being compressed. Otherwise the options 2526may be freely mixed, and are cumulative. 2527.IP 2528The default setting is 2529.B \-Cem, 2530which specifies that 2531.I flex 2532should generate equivalence classes 2533and meta-equivalence classes. This setting provides the highest 2534degree of table compression. You can trade off 2535faster-executing scanners at the cost of larger tables with 2536the following generally being true: 2537.nf 2538 2539 slowest & smallest 2540 -Cem 2541 -Cm 2542 -Ce 2543 -C 2544 -C{f,F}e 2545 -C{f,F} 2546 -C{f,F}a 2547 fastest & largest 2548 2549.fi 2550Note that scanners with the smallest tables are usually generated and 2551compiled the quickest, so 2552during development you will usually want to use the default, maximal 2553compression. 2554.IP 2555.B \-Cfe 2556is often a good compromise between speed and size for production 2557scanners. 2558.TP 2559.B \-ooutput 2560directs flex to write the scanner to the file 2561.B output 2562instead of 2563.B lex.yy.c. 2564If you combine 2565.B \-o 2566with the 2567.B \-t 2568option, then the scanner is written to 2569.I stdout 2570but its 2571.B #line 2572directives (see the 2573.B \\-L 2574option above) refer to the file 2575.B output. 2576.TP 2577.B \-Pprefix 2578changes the default 2579.I "yy" 2580prefix used by 2581.I flex 2582for all globally-visible variable and function names to instead be 2583.I prefix. 2584For example, 2585.B \-Pfoo 2586changes the name of 2587.B yytext 2588to 2589.B footext. 2590It also changes the name of the default output file from 2591.B lex.yy.c 2592to 2593.B lex.foo.c. 2594Here are all of the names affected: 2595.nf 2596 2597 yy_create_buffer 2598 yy_delete_buffer 2599 yy_flex_debug 2600 yy_init_buffer 2601 yy_flush_buffer 2602 yy_load_buffer_state 2603 yy_switch_to_buffer 2604 yyin 2605 yyleng 2606 yylex 2607 yylineno 2608 yyout 2609 yyrestart 2610 yytext 2611 yywrap 2612 2613.fi 2614(If you are using a C++ scanner, then only 2615.B yywrap 2616and 2617.B yyFlexLexer 2618are affected.) 2619Within your scanner itself, you can still refer to the global variables 2620and functions using either version of their name; but externally, they 2621have the modified name. 2622.IP 2623This option lets you easily link together multiple 2624.I flex 2625programs into the same executable. Note, though, that using this 2626option also renames 2627.B yywrap(), 2628so you now 2629.I must 2630either 2631provide your own (appropriately-named) version of the routine for your 2632scanner, or use 2633.B %option noyywrap, 2634as linking with 2635.B \-ll 2636no longer provides one for you by default. 2637.TP 2638.B \-Sskeleton_file 2639overrides the default skeleton file from which 2640.I flex 2641constructs its scanners. You'll never need this option unless you are doing 2642.I flex 2643maintenance or development. 2644.PP 2645.I flex 2646also provides a mechanism for controlling options within the 2647scanner specification itself, rather than from the flex command-line. 2648This is done by including 2649.B %option 2650directives in the first section of the scanner specification. 2651You can specify multiple options with a single 2652.B %option 2653directive, and multiple directives in the first section of your flex input 2654file. 2655.PP 2656Most options are given simply as names, optionally preceded by the 2657word "no" (with no intervening whitespace) to negate their meaning. 2658A number are equivalent to flex flags or their negation: 2659.nf 2660 2661 7bit -7 option 2662 8bit -8 option 2663 align -Ca option 2664 backup -b option 2665 batch -B option 2666 c++ -+ option 2667 2668 caseful or 2669 case-sensitive opposite of -i (default) 2670 2671 case-insensitive or 2672 caseless -i option 2673 2674 debug -d option 2675 default opposite of -s option 2676 ecs -Ce option 2677 fast -F option 2678 full -f option 2679 interactive -I option 2680 lex-compat -l option 2681 meta-ecs -Cm option 2682 perf-report -p option 2683 read -Cr option 2684 stdout -t option 2685 verbose -v option 2686 warn opposite of -w option 2687 (use "%option nowarn" for -w) 2688 2689 array equivalent to "%array" 2690 pointer equivalent to "%pointer" (default) 2691 2692.fi 2693Some 2694.B %option's 2695provide features otherwise not available: 2696.TP 2697.B always-interactive 2698instructs flex to generate a scanner which always considers its input 2699"interactive". Normally, on each new input file the scanner calls 2700.B isatty() 2701in an attempt to determine whether 2702the scanner's input source is interactive and thus should be read a 2703character at a time. When this option is used, however, then no 2704such call is made. 2705.TP 2706.B main 2707directs flex to provide a default 2708.B main() 2709program for the scanner, which simply calls 2710.B yylex(). 2711This option implies 2712.B noyywrap 2713(see below). 2714.TP 2715.B never-interactive 2716instructs flex to generate a scanner which never considers its input 2717"interactive" (again, no call made to 2718.B isatty()). 2719This is the opposite of 2720.B always-interactive. 2721.TP 2722.B stack 2723enables the use of start condition stacks (see Start Conditions above). 2724.TP 2725.B stdinit 2726if set (i.e., 2727.B %option stdinit) 2728initializes 2729.I yyin 2730and 2731.I yyout 2732to 2733.I stdin 2734and 2735.I stdout, 2736instead of the default of 2737.I nil. 2738Some existing 2739.I lex 2740programs depend on this behavior, even though it is not compliant with 2741ANSI C, which does not require 2742.I stdin 2743and 2744.I stdout 2745to be compile-time constant. 2746.TP 2747.B yylineno 2748directs 2749.I flex 2750to generate a scanner that maintains the number of the current line 2751read from its input in the global variable 2752.B yylineno. 2753This option is implied by 2754.B %option lex-compat. 2755.TP 2756.B yywrap 2757if unset (i.e., 2758.B %option noyywrap), 2759makes the scanner not call 2760.B yywrap() 2761upon an end-of-file, but simply assume that there are no more 2762files to scan (until the user points 2763.I yyin 2764at a new file and calls 2765.B yylex() 2766again). 2767.PP 2768.I flex 2769scans your rule actions to determine whether you use the 2770.B REJECT 2771or 2772.B yymore() 2773features. The 2774.B reject 2775and 2776.B yymore 2777options are available to override its decision as to whether you use the 2778options, either by setting them (e.g., 2779.B %option reject) 2780to indicate the feature is indeed used, or 2781unsetting them to indicate it actually is not used 2782(e.g., 2783.B %option noyymore). 2784.PP 2785Three options take string-delimited values, offset with '=': 2786.nf 2787 2788 %option outfile="ABC" 2789 2790.fi 2791is equivalent to 2792.B -oABC, 2793and 2794.nf 2795 2796 %option prefix="XYZ" 2797 2798.fi 2799is equivalent to 2800.B -PXYZ. 2801Finally, 2802.nf 2803 2804 %option yyclass="foo" 2805 2806.fi 2807only applies when generating a C++ scanner ( 2808.B \-+ 2809option). It informs 2810.I flex 2811that you have derived 2812.B foo 2813as a subclass of 2814.B yyFlexLexer, 2815so 2816.I flex 2817will place your actions in the member function 2818.B foo::yylex() 2819instead of 2820.B yyFlexLexer::yylex(). 2821It also generates a 2822.B yyFlexLexer::yylex() 2823member function that emits a run-time error (by invoking 2824.B yyFlexLexer::LexerError()) 2825if called. 2826See Generating C++ Scanners, below, for additional information. 2827.PP 2828A number of options are available for lint purists who want to suppress 2829the appearance of unneeded routines in the generated scanner. Each of the 2830following, if unset 2831(e.g., 2832.B %option nounput 2833), results in the corresponding routine not appearing in 2834the generated scanner: 2835.nf 2836 2837 input, unput 2838 yy_push_state, yy_pop_state, yy_top_state 2839 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2840 2841.fi 2842(though 2843.B yy_push_state() 2844and friends won't appear anyway unless you use 2845.B %option stack). 2846.SH PERFORMANCE CONSIDERATIONS 2847The main design goal of 2848.I flex 2849is that it generate high-performance scanners. It has been optimized 2850for dealing well with large sets of rules. Aside from the effects on 2851scanner speed of the table compression 2852.B \-C 2853options outlined above, 2854there are a number of options/actions which degrade performance. These 2855are, from most expensive to least: 2856.nf 2857 2858 REJECT 2859 %option yylineno 2860 arbitrary trailing context 2861 2862 pattern sets that require backing up 2863 %array 2864 %option interactive 2865 %option always-interactive 2866 2867 '^' beginning-of-line operator 2868 yymore() 2869 2870.fi 2871with the first three all being quite expensive and the last two 2872being quite cheap. Note also that 2873.B unput() 2874is implemented as a routine call that potentially does quite a bit of 2875work, while 2876.B yyless() 2877is a quite-cheap macro; so if just putting back some excess text you 2878scanned, use 2879.B yyless(). 2880.PP 2881.B REJECT 2882should be avoided at all costs when performance is important. 2883It is a particularly expensive option. 2884.PP 2885Getting rid of backing up is messy and often may be an enormous 2886amount of work for a complicated scanner. In principal, one begins 2887by using the 2888.B \-b 2889flag to generate a 2890.I lex.backup 2891file. For example, on the input 2892.nf 2893 2894 %% 2895 foo return TOK_KEYWORD; 2896 foobar return TOK_KEYWORD; 2897 2898.fi 2899the file looks like: 2900.nf 2901 2902 State #6 is non-accepting - 2903 associated rule line numbers: 2904 2 3 2905 out-transitions: [ o ] 2906 jam-transitions: EOF [ \\001-n p-\\177 ] 2907 2908 State #8 is non-accepting - 2909 associated rule line numbers: 2910 3 2911 out-transitions: [ a ] 2912 jam-transitions: EOF [ \\001-` b-\\177 ] 2913 2914 State #9 is non-accepting - 2915 associated rule line numbers: 2916 3 2917 out-transitions: [ r ] 2918 jam-transitions: EOF [ \\001-q s-\\177 ] 2919 2920 Compressed tables always back up. 2921 2922.fi 2923The first few lines tell us that there's a scanner state in 2924which it can make a transition on an 'o' but not on any other 2925character, and that in that state the currently scanned text does not match 2926any rule. The state occurs when trying to match the rules found 2927at lines 2 and 3 in the input file. 2928If the scanner is in that state and then reads 2929something other than an 'o', it will have to back up to find 2930a rule which is matched. With 2931a bit of headscratching one can see that this must be the 2932state it's in when it has seen "fo". When this has happened, 2933if anything other than another 'o' is seen, the scanner will 2934have to back up to simply match the 'f' (by the default rule). 2935.PP 2936The comment regarding State #8 indicates there's a problem 2937when "foob" has been scanned. Indeed, on any character other 2938than an 'a', the scanner will have to back up to accept "foo". 2939Similarly, the comment for State #9 concerns when "fooba" has 2940been scanned and an 'r' does not follow. 2941.PP 2942The final comment reminds us that there's no point going to 2943all the trouble of removing backing up from the rules unless 2944we're using 2945.B \-Cf 2946or 2947.B \-CF, 2948since there's no performance gain doing so with compressed scanners. 2949.PP 2950The way to remove the backing up is to add "error" rules: 2951.nf 2952 2953 %% 2954 foo return TOK_KEYWORD; 2955 foobar return TOK_KEYWORD; 2956 2957 fooba | 2958 foob | 2959 fo { 2960 /* false alarm, not really a keyword */ 2961 return TOK_ID; 2962 } 2963 2964.fi 2965.PP 2966Eliminating backing up among a list of keywords can also be 2967done using a "catch-all" rule: 2968.nf 2969 2970 %% 2971 foo return TOK_KEYWORD; 2972 foobar return TOK_KEYWORD; 2973 2974 [a-z]+ return TOK_ID; 2975 2976.fi 2977This is usually the best solution when appropriate. 2978.PP 2979Backing up messages tend to cascade. 2980With a complicated set of rules it's not uncommon to get hundreds 2981of messages. If one can decipher them, though, it often 2982only takes a dozen or so rules to eliminate the backing up (though 2983it's easy to make a mistake and have an error rule accidentally match 2984a valid token. A possible future 2985.I flex 2986feature will be to automatically add rules to eliminate backing up). 2987.PP 2988It's important to keep in mind that you gain the benefits of eliminating 2989backing up only if you eliminate 2990.I every 2991instance of backing up. Leaving just one means you gain nothing. 2992.PP 2993.I Variable 2994trailing context (where both the leading and trailing parts do not have 2995a fixed length) entails almost the same performance loss as 2996.B REJECT 2997(i.e., substantial). So when possible a rule like: 2998.nf 2999 3000 %% 3001 mouse|rat/(cat|dog) run(); 3002 3003.fi 3004is better written: 3005.nf 3006 3007 %% 3008 mouse/cat|dog run(); 3009 rat/cat|dog run(); 3010 3011.fi 3012or as 3013.nf 3014 3015 %% 3016 mouse|rat/cat run(); 3017 mouse|rat/dog run(); 3018 3019.fi 3020Note that here the special '|' action does 3021.I not 3022provide any savings, and can even make things worse (see 3023Deficiencies / Bugs below). 3024.LP 3025Another area where the user can increase a scanner's performance 3026(and one that's easier to implement) arises from the fact that 3027the longer the tokens matched, the faster the scanner will run. 3028This is because with long tokens the processing of most input 3029characters takes place in the (short) inner scanning loop, and 3030does not often have to go through the additional work of setting up 3031the scanning environment (e.g., 3032.B yytext) 3033for the action. Recall the scanner for C comments: 3034.nf 3035 3036 %x comment 3037 %% 3038 int line_num = 1; 3039 3040 "/*" BEGIN(comment); 3041 3042 <comment>[^*\\n]* 3043 <comment>"*"+[^*/\\n]* 3044 <comment>\\n ++line_num; 3045 <comment>"*"+"/" BEGIN(INITIAL); 3046 3047.fi 3048This could be sped up by writing it as: 3049.nf 3050 3051 %x comment 3052 %% 3053 int line_num = 1; 3054 3055 "/*" BEGIN(comment); 3056 3057 <comment>[^*\\n]* 3058 <comment>[^*\\n]*\\n ++line_num; 3059 <comment>"*"+[^*/\\n]* 3060 <comment>"*"+[^*/\\n]*\\n ++line_num; 3061 <comment>"*"+"/" BEGIN(INITIAL); 3062 3063.fi 3064Now instead of each newline requiring the processing of another 3065action, recognizing the newlines is "distributed" over the other rules 3066to keep the matched text as long as possible. Note that 3067.I adding 3068rules does 3069.I not 3070slow down the scanner! The speed of the scanner is independent 3071of the number of rules or (modulo the considerations given at the 3072beginning of this section) how complicated the rules are with 3073regard to operators such as '*' and '|'. 3074.PP 3075A final example in speeding up a scanner: suppose you want to scan 3076through a file containing identifiers and keywords, one per line 3077and with no other extraneous characters, and recognize all the 3078keywords. A natural first approach is: 3079.nf 3080 3081 %% 3082 asm | 3083 auto | 3084 break | 3085 ... etc ... 3086 volatile | 3087 while /* it's a keyword */ 3088 3089 .|\\n /* it's not a keyword */ 3090 3091.fi 3092To eliminate the back-tracking, introduce a catch-all rule: 3093.nf 3094 3095 %% 3096 asm | 3097 auto | 3098 break | 3099 ... etc ... 3100 volatile | 3101 while /* it's a keyword */ 3102 3103 [a-z]+ | 3104 .|\\n /* it's not a keyword */ 3105 3106.fi 3107Now, if it's guaranteed that there's exactly one word per line, 3108then we can reduce the total number of matches by a half by 3109merging in the recognition of newlines with that of the other 3110tokens: 3111.nf 3112 3113 %% 3114 asm\\n | 3115 auto\\n | 3116 break\\n | 3117 ... etc ... 3118 volatile\\n | 3119 while\\n /* it's a keyword */ 3120 3121 [a-z]+\\n | 3122 .|\\n /* it's not a keyword */ 3123 3124.fi 3125One has to be careful here, as we have now reintroduced backing up 3126into the scanner. In particular, while 3127.I we 3128know that there will never be any characters in the input stream 3129other than letters or newlines, 3130.I flex 3131can't figure this out, and it will plan for possibly needing to back up 3132when it has scanned a token like "auto" and then the next character 3133is something other than a newline or a letter. Previously it would 3134then just match the "auto" rule and be done, but now it has no "auto" 3135rule, only a "auto\\n" rule. To eliminate the possibility of backing up, 3136we could either duplicate all rules but without final newlines, or, 3137since we never expect to encounter such an input and therefore don't 3138how it's classified, we can introduce one more catch-all rule, this 3139one which doesn't include a newline: 3140.nf 3141 3142 %% 3143 asm\\n | 3144 auto\\n | 3145 break\\n | 3146 ... etc ... 3147 volatile\\n | 3148 while\\n /* it's a keyword */ 3149 3150 [a-z]+\\n | 3151 [a-z]+ | 3152 .|\\n /* it's not a keyword */ 3153 3154.fi 3155Compiled with 3156.B \-Cf, 3157this is about as fast as one can get a 3158.I flex 3159scanner to go for this particular problem. 3160.PP 3161A final note: 3162.I flex 3163is slow when matching NUL's, particularly when a token contains 3164multiple NUL's. 3165It's best to write rules which match 3166.I short 3167amounts of text if it's anticipated that the text will often include NUL's. 3168.PP 3169Another final note regarding performance: as mentioned above in the section 3170How the Input is Matched, dynamically resizing 3171.B yytext 3172to accommodate huge tokens is a slow process because it presently requires that 3173the (huge) token be rescanned from the beginning. Thus if performance is 3174vital, you should attempt to match "large" quantities of text but not 3175"huge" quantities, where the cutoff between the two is at about 8K 3176characters/token. 3177.SH GENERATING C++ SCANNERS 3178.I flex 3179provides two different ways to generate scanners for use with C++. The 3180first way is to simply compile a scanner generated by 3181.I flex 3182using a C++ compiler instead of a C compiler. You should not encounter 3183any compilations errors (please report any you find to the email address 3184given in the Author section below). You can then use C++ code in your 3185rule actions instead of C code. Note that the default input source for 3186your scanner remains 3187.I yyin, 3188and default echoing is still done to 3189.I yyout. 3190Both of these remain 3191.I FILE * 3192variables and not C++ 3193.I streams. 3194.PP 3195You can also use 3196.I flex 3197to generate a C++ scanner class, using the 3198.B \-+ 3199option (or, equivalently, 3200.B %option c++), 3201which is automatically specified if the name of the flex 3202executable ends in a '+', such as 3203.I flex++. 3204When using this option, flex defaults to generating the scanner to the file 3205.B lex.yy.cc 3206instead of 3207.B lex.yy.c. 3208The generated scanner includes the header file 3209.I FlexLexer.h, 3210which defines the interface to two C++ classes. 3211.PP 3212The first class, 3213.B FlexLexer, 3214provides an abstract base class defining the general scanner class 3215interface. It provides the following member functions: 3216.TP 3217.B const char* YYText() 3218returns the text of the most recently matched token, the equivalent of 3219.B yytext. 3220.TP 3221.B int YYLeng() 3222returns the length of the most recently matched token, the equivalent of 3223.B yyleng. 3224.TP 3225.B int lineno() const 3226returns the current input line number 3227(see 3228.B %option yylineno), 3229or 3230.B 1 3231if 3232.B %option yylineno 3233was not used. 3234.TP 3235.B void set_debug( int flag ) 3236sets the debugging flag for the scanner, equivalent to assigning to 3237.B yy_flex_debug 3238(see the Options section above). Note that you must build the scanner 3239using 3240.B %option debug 3241to include debugging information in it. 3242.TP 3243.B int debug() const 3244returns the current setting of the debugging flag. 3245.PP 3246Also provided are member functions equivalent to 3247.B yy_switch_to_buffer(), 3248.B yy_create_buffer() 3249(though the first argument is an 3250.B istream* 3251object pointer and not a 3252.B FILE*), 3253.B yy_flush_buffer(), 3254.B yy_delete_buffer(), 3255and 3256.B yyrestart() 3257(again, the first argument is a 3258.B istream* 3259object pointer). 3260.PP 3261The second class defined in 3262.I FlexLexer.h 3263is 3264.B yyFlexLexer, 3265which is derived from 3266.B FlexLexer. 3267It defines the following additional member functions: 3268.TP 3269.B 3270yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3271constructs a 3272.B yyFlexLexer 3273object using the given streams for input and output. If not specified, 3274the streams default to 3275.B cin 3276and 3277.B cout, 3278respectively. 3279.TP 3280.B virtual int yylex() 3281performs the same role is 3282.B yylex() 3283does for ordinary flex scanners: it scans the input stream, consuming 3284tokens, until a rule's action returns a value. If you derive a subclass 3285.B S 3286from 3287.B yyFlexLexer 3288and want to access the member functions and variables of 3289.B S 3290inside 3291.B yylex(), 3292then you need to use 3293.B %option yyclass="S" 3294to inform 3295.I flex 3296that you will be using that subclass instead of 3297.B yyFlexLexer. 3298In this case, rather than generating 3299.B yyFlexLexer::yylex(), 3300.I flex 3301generates 3302.B S::yylex() 3303(and also generates a dummy 3304.B yyFlexLexer::yylex() 3305that calls 3306.B yyFlexLexer::LexerError() 3307if called). 3308.TP 3309.B 3310virtual void switch_streams(istream* new_in = 0, 3311.B 3312ostream* new_out = 0) 3313reassigns 3314.B yyin 3315to 3316.B new_in 3317(if non-nil) 3318and 3319.B yyout 3320to 3321.B new_out 3322(ditto), deleting the previous input buffer if 3323.B yyin 3324is reassigned. 3325.TP 3326.B 3327int yylex( istream* new_in, ostream* new_out = 0 ) 3328first switches the input streams via 3329.B switch_streams( new_in, new_out ) 3330and then returns the value of 3331.B yylex(). 3332.PP 3333In addition, 3334.B yyFlexLexer 3335defines the following protected virtual functions which you can redefine 3336in derived classes to tailor the scanner: 3337.TP 3338.B 3339virtual int LexerInput( char* buf, int max_size ) 3340reads up to 3341.B max_size 3342characters into 3343.B buf 3344and returns the number of characters read. To indicate end-of-input, 3345return 0 characters. Note that "interactive" scanners (see the 3346.B \-B 3347and 3348.B \-I 3349flags) define the macro 3350.B YY_INTERACTIVE. 3351If you redefine 3352.B LexerInput() 3353and need to take different actions depending on whether or not 3354the scanner might be scanning an interactive input source, you can 3355test for the presence of this name via 3356.B #ifdef. 3357.TP 3358.B 3359virtual void LexerOutput( const char* buf, int size ) 3360writes out 3361.B size 3362characters from the buffer 3363.B buf, 3364which, while NUL-terminated, may also contain "internal" NUL's if 3365the scanner's rules can match text with NUL's in them. 3366.TP 3367.B 3368virtual void LexerError( const char* msg ) 3369reports a fatal error message. The default version of this function 3370writes the message to the stream 3371.B cerr 3372and exits. 3373.PP 3374Note that a 3375.B yyFlexLexer 3376object contains its 3377.I entire 3378scanning state. Thus you can use such objects to create reentrant 3379scanners. You can instantiate multiple instances of the same 3380.B yyFlexLexer 3381class, and you can also combine multiple C++ scanner classes together 3382in the same program using the 3383.B \-P 3384option discussed above. 3385.PP 3386Finally, note that the 3387.B %array 3388feature is not available to C++ scanner classes; you must use 3389.B %pointer 3390(the default). 3391.PP 3392Here is an example of a simple C++ scanner: 3393.nf 3394 3395 // An example of using the flex C++ scanner class. 3396 3397 %{ 3398 int mylineno = 0; 3399 %} 3400 3401 string \\"[^\\n"]+\\" 3402 3403 ws [ \\t]+ 3404 3405 alpha [A-Za-z] 3406 dig [0-9] 3407 name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])* 3408 num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)? 3409 num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)? 3410 number {num1}|{num2} 3411 3412 %% 3413 3414 {ws} /* skip blanks and tabs */ 3415 3416 "/*" { 3417 int c; 3418 3419 while((c = yyinput()) != 0) 3420 { 3421 if(c == '\\n') 3422 ++mylineno; 3423 3424 else if(c == '*') 3425 { 3426 if((c = yyinput()) == '/') 3427 break; 3428 else 3429 unput(c); 3430 } 3431 } 3432 } 3433 3434 {number} cout << "number " << YYText() << '\\n'; 3435 3436 \\n mylineno++; 3437 3438 {name} cout << "name " << YYText() << '\\n'; 3439 3440 {string} cout << "string " << YYText() << '\\n'; 3441 3442 %% 3443 3444 int main( int /* argc */, char** /* argv */ ) 3445 { 3446 FlexLexer* lexer = new yyFlexLexer; 3447 while(lexer->yylex() != 0) 3448 ; 3449 return 0; 3450 } 3451.fi 3452If you want to create multiple (different) lexer classes, you use the 3453.B \-P 3454flag (or the 3455.B prefix= 3456option) to rename each 3457.B yyFlexLexer 3458to some other 3459.B xxFlexLexer. 3460You then can include 3461.B <FlexLexer.h> 3462in your other sources once per lexer class, first renaming 3463.B yyFlexLexer 3464as follows: 3465.nf 3466 3467 #undef yyFlexLexer 3468 #define yyFlexLexer xxFlexLexer 3469 #include <FlexLexer.h> 3470 3471 #undef yyFlexLexer 3472 #define yyFlexLexer zzFlexLexer 3473 #include <FlexLexer.h> 3474 3475.fi 3476if, for example, you used 3477.B %option prefix="xx" 3478for one of your scanners and 3479.B %option prefix="zz" 3480for the other. 3481.PP 3482IMPORTANT: the present form of the scanning class is 3483.I experimental 3484and may change considerably between major releases. 3485.SH INCOMPATIBILITIES WITH LEX AND POSIX 3486.I flex 3487is a rewrite of the AT&T Unix 3488.I lex 3489tool (the two implementations do not share any code, though), 3490with some extensions and incompatibilities, both of which 3491are of concern to those who wish to write scanners acceptable 3492to either implementation. Flex is fully compliant with the POSIX 3493.I lex 3494specification, except that when using 3495.B %pointer 3496(the default), a call to 3497.B unput() 3498destroys the contents of 3499.B yytext, 3500which is counter to the POSIX specification. 3501.PP 3502In this section we discuss all of the known areas of incompatibility 3503between flex, AT&T lex, and the POSIX specification. 3504.PP 3505.I flex's 3506.B \-l 3507option turns on maximum compatibility with the original AT&T 3508.I lex 3509implementation, at the cost of a major loss in the generated scanner's 3510performance. We note below which incompatibilities can be overcome 3511using the 3512.B \-l 3513option. 3514.PP 3515.I flex 3516is fully compatible with 3517.I lex 3518with the following exceptions: 3519.IP - 3520The undocumented 3521.I lex 3522scanner internal variable 3523.B yylineno 3524is not supported unless 3525.B \-l 3526or 3527.B %option yylineno 3528is used. 3529.IP 3530.B yylineno 3531should be maintained on a per-buffer basis, rather than a per-scanner 3532(single global variable) basis. 3533.IP 3534.B yylineno 3535is not part of the POSIX specification. 3536.IP - 3537The 3538.B input() 3539routine is not redefinable, though it may be called to read characters 3540following whatever has been matched by a rule. If 3541.B input() 3542encounters an end-of-file the normal 3543.B yywrap() 3544processing is done. A ``real'' end-of-file is returned by 3545.B input() 3546as 3547.I EOF. 3548.IP 3549Input is instead controlled by defining the 3550.B YY_INPUT 3551macro. 3552.IP 3553The 3554.I flex 3555restriction that 3556.B input() 3557cannot be redefined is in accordance with the POSIX specification, 3558which simply does not specify any way of controlling the 3559scanner's input other than by making an initial assignment to 3560.I yyin. 3561.IP - 3562The 3563.B unput() 3564routine is not redefinable. This restriction is in accordance with POSIX. 3565.IP - 3566.I flex 3567scanners are not as reentrant as 3568.I lex 3569scanners. In particular, if you have an interactive scanner and 3570an interrupt handler which long-jumps out of the scanner, and 3571the scanner is subsequently called again, you may get the following 3572message: 3573.nf 3574 3575 fatal flex scanner internal error--end of buffer missed 3576 3577.fi 3578To reenter the scanner, first use 3579.nf 3580 3581 yyrestart( yyin ); 3582 3583.fi 3584Note that this call will throw away any buffered input; usually this 3585isn't a problem with an interactive scanner. 3586.IP 3587Also note that flex C++ scanner classes 3588.I are 3589reentrant, so if using C++ is an option for you, you should use 3590them instead. See "Generating C++ Scanners" above for details. 3591.IP - 3592.B output() 3593is not supported. 3594Output from the 3595.B ECHO 3596macro is done to the file-pointer 3597.I yyout 3598(default 3599.I stdout). 3600.IP 3601.B output() 3602is not part of the POSIX specification. 3603.IP - 3604.I lex 3605does not support exclusive start conditions (%x), though they 3606are in the POSIX specification. 3607.IP - 3608When definitions are expanded, 3609.I flex 3610encloses them in parentheses. 3611With lex, the following: 3612.nf 3613 3614 NAME [A-Z][A-Z0-9]* 3615 %% 3616 foo{NAME}? printf( "Found it\\n" ); 3617 %% 3618 3619.fi 3620will not match the string "foo" because when the macro 3621is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" 3622and the precedence is such that the '?' is associated with 3623"[A-Z0-9]*". With 3624.I flex, 3625the rule will be expanded to 3626"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. 3627.IP 3628Note that if the definition begins with 3629.B ^ 3630or ends with 3631.B $ 3632then it is 3633.I not 3634expanded with parentheses, to allow these operators to appear in 3635definitions without losing their special meanings. But the 3636.B <s>, /, 3637and 3638.B <<EOF>> 3639operators cannot be used in a 3640.I flex 3641definition. 3642.IP 3643Using 3644.B \-l 3645results in the 3646.I lex 3647behavior of no parentheses around the definition. 3648.IP 3649The POSIX specification is that the definition be enclosed in parentheses. 3650.IP - 3651Some implementations of 3652.I lex 3653allow a rule's action to begin on a separate line, if the rule's pattern 3654has trailing whitespace: 3655.nf 3656 3657 %% 3658 foo|bar<space here> 3659 { foobar_action(); } 3660 3661.fi 3662.I flex 3663does not support this feature. 3664.IP - 3665The 3666.I lex 3667.B %r 3668(generate a Ratfor scanner) option is not supported. It is not part 3669of the POSIX specification. 3670.IP - 3671After a call to 3672.B unput(), 3673.I yytext 3674is undefined until the next token is matched, unless the scanner 3675was built using 3676.B %array. 3677This is not the case with 3678.I lex 3679or the POSIX specification. The 3680.B \-l 3681option does away with this incompatibility. 3682.IP - 3683The precedence of the 3684.B {} 3685(numeric range) operator is different. 3686.I lex 3687interprets "abc{1,3}" as "match one, two, or 3688three occurrences of 'abc'", whereas 3689.I flex 3690interprets it as "match 'ab' 3691followed by one, two, or three occurrences of 'c'". The latter is 3692in agreement with the POSIX specification. 3693.IP - 3694The precedence of the 3695.B ^ 3696operator is different. 3697.I lex 3698interprets "^foo|bar" as "match either 'foo' at the beginning of a line, 3699or 'bar' anywhere", whereas 3700.I flex 3701interprets it as "match either 'foo' or 'bar' if they come at the beginning 3702of a line". The latter is in agreement with the POSIX specification. 3703.IP - 3704The special table-size declarations such as 3705.B %a 3706supported by 3707.I lex 3708are not required by 3709.I flex 3710scanners; 3711.I flex 3712ignores them. 3713.IP - 3714The name 3715.B FLEX_SCANNER 3716is #define'd so scanners may be written for use with either 3717.I flex 3718or 3719.I lex. 3720Scanners also include 3721.B YY_FLEX_MAJOR_VERSION 3722and 3723.B YY_FLEX_MINOR_VERSION 3724indicating which version of 3725.I flex 3726generated the scanner 3727(for example, for the 2.5 release, these defines would be 2 and 5 3728respectively). 3729.PP 3730The following 3731.I flex 3732features are not included in 3733.I lex 3734or the POSIX specification: 3735.nf 3736 3737 C++ scanners 3738 %option 3739 start condition scopes 3740 start condition stacks 3741 interactive/non-interactive scanners 3742 yy_scan_string() and friends 3743 yyterminate() 3744 yy_set_interactive() 3745 yy_set_bol() 3746 YY_AT_BOL() 3747 <<EOF>> 3748 <*> 3749 YY_DECL 3750 YY_START 3751 YY_USER_ACTION 3752 YY_USER_INIT 3753 #line directives 3754 %{}'s around actions 3755 multiple actions on a line 3756 3757.fi 3758plus almost all of the flex flags. 3759The last feature in the list refers to the fact that with 3760.I flex 3761you can put multiple actions on the same line, separated with 3762semi-colons, while with 3763.I lex, 3764the following 3765.nf 3766 3767 foo handle_foo(); ++num_foos_seen; 3768 3769.fi 3770is (rather surprisingly) truncated to 3771.nf 3772 3773 foo handle_foo(); 3774 3775.fi 3776.I flex 3777does not truncate the action. Actions that are not enclosed in 3778braces are simply terminated at the end of the line. 3779.SH DIAGNOSTICS 3780.I warning, rule cannot be matched 3781indicates that the given rule 3782cannot be matched because it follows other rules that will 3783always match the same text as it. For 3784example, in the following "foo" cannot be matched because it comes after 3785an identifier "catch-all" rule: 3786.nf 3787 3788 [a-z]+ got_identifier(); 3789 foo got_foo(); 3790 3791.fi 3792Using 3793.B REJECT 3794in a scanner suppresses this warning. 3795.PP 3796.I warning, 3797.B \-s 3798.I 3799option given but default rule can be matched 3800means that it is possible (perhaps only in a particular start condition) 3801that the default rule (match any single character) is the only one 3802that will match a particular input. Since 3803.B \-s 3804was given, presumably this is not intended. 3805.PP 3806.I reject_used_but_not_detected undefined 3807or 3808.I yymore_used_but_not_detected undefined - 3809These errors can occur at compile time. They indicate that the 3810scanner uses 3811.B REJECT 3812or 3813.B yymore() 3814but that 3815.I flex 3816failed to notice the fact, meaning that 3817.I flex 3818scanned the first two sections looking for occurrences of these actions 3819and failed to find any, but somehow you snuck some in (via a #include 3820file, for example). Use 3821.B %option reject 3822or 3823.B %option yymore 3824to indicate to flex that you really do use these features. 3825.PP 3826.I flex scanner jammed - 3827a scanner compiled with 3828.B \-s 3829has encountered an input string which wasn't matched by 3830any of its rules. This error can also occur due to internal problems. 3831.PP 3832.I token too large, exceeds YYLMAX - 3833your scanner uses 3834.B %array 3835and one of its rules matched a string longer than the 3836.B YYLMAX 3837constant (8K bytes by default). You can increase the value by 3838#define'ing 3839.B YYLMAX 3840in the definitions section of your 3841.I flex 3842input. 3843.PP 3844.I scanner requires \-8 flag to 3845.I use the character 'x' - 3846Your scanner specification includes recognizing the 8-bit character 3847.I 'x' 3848and you did not specify the \-8 flag, and your scanner defaulted to 7-bit 3849because you used the 3850.B \-Cf 3851or 3852.B \-CF 3853table compression options. See the discussion of the 3854.B \-7 3855flag for details. 3856.PP 3857.I flex scanner push-back overflow - 3858you used 3859.B unput() 3860to push back so much text that the scanner's buffer could not hold 3861both the pushed-back text and the current token in 3862.B yytext. 3863Ideally the scanner should dynamically resize the buffer in this case, but at 3864present it does not. 3865.PP 3866.I 3867input buffer overflow, can't enlarge buffer because scanner uses REJECT - 3868the scanner was working on matching an extremely large token and needed 3869to expand the input buffer. This doesn't work with scanners that use 3870.B 3871REJECT. 3872.PP 3873.I 3874fatal flex scanner internal error--end of buffer missed -
|
3876has jumped out (or over) the scanner's activation frame. Before 3877reentering the scanner, use: 3878.nf 3879 3880 yyrestart( yyin ); 3881 3882.fi 3883or, as noted above, switch to using the C++ scanner class. 3884.PP 3885.I too many start conditions in <> construct! - 3886you listed more start conditions in a <> construct than exist (so 3887you must have listed at least one of them twice). 3888.SH FILES 3889.TP 3890.B \-ll 3891library with which scanners must be linked. 3892.TP 3893.I lex.yy.c 3894generated scanner (called 3895.I lexyy.c 3896on some systems). 3897.TP 3898.I lex.yy.cc 3899generated C++ scanner class, when using 3900.B -+. 3901.TP 3902.I <FlexLexer.h> 3903header file defining the C++ scanner base class, 3904.B FlexLexer, 3905and its derived class, 3906.B yyFlexLexer. 3907.TP 3908.I flex.skl 3909skeleton scanner. This file is only used when building flex, not when 3910flex executes. 3911.TP 3912.I lex.backup 3913backing-up information for 3914.B \-b 3915flag (called 3916.I lex.bck 3917on some systems). 3918.SH DEFICIENCIES / BUGS 3919Some trailing context 3920patterns cannot be properly matched and generate 3921warning messages ("dangerous trailing context"). These are 3922patterns where the ending of the 3923first part of the rule matches the beginning of the second 3924part, such as "zx*/xy*", where the 'x*' matches the 'x' at 3925the beginning of the trailing context. (Note that the POSIX draft 3926states that the text matched by such patterns is undefined.) 3927.PP 3928For some trailing context rules, parts which are actually fixed-length are 3929not recognized as such, leading to the above mentioned performance loss. 3930In particular, parts using '|' or {n} (such as "foo{3}") are always 3931considered variable-length. 3932.PP 3933Combining trailing context with the special '|' action can result in 3934.I fixed 3935trailing context being turned into the more expensive 3936.I variable 3937trailing context. For example, in the following: 3938.nf 3939 3940 %% 3941 abc | 3942 xyz/def 3943 3944.fi 3945.PP 3946Use of 3947.B unput() 3948invalidates yytext and yyleng, unless the 3949.B %array 3950directive 3951or the 3952.B \-l 3953option has been used. 3954.PP 3955Pattern-matching of NUL's is substantially slower than matching other 3956characters. 3957.PP 3958Dynamic resizing of the input buffer is slow, as it entails rescanning 3959all the text matched so far by the current (generally huge) token. 3960.PP 3961Due to both buffering of input and read-ahead, you cannot intermix 3962calls to <stdio.h> routines, such as, for example, 3963.B getchar(), 3964with 3965.I flex 3966rules and expect it to work. Call 3967.B input() 3968instead. 3969.PP 3970The total table entries listed by the 3971.B \-v 3972flag excludes the number of table entries needed to determine 3973what rule has been matched. The number of entries is equal 3974to the number of DFA states if the scanner does not use 3975.B REJECT, 3976and somewhat greater than the number of states if it does. 3977.PP 3978.B REJECT 3979cannot be used with the 3980.B \-f 3981or 3982.B \-F 3983options. 3984.PP 3985The 3986.I flex 3987internal algorithms need documentation. 3988.SH SEE ALSO 3989lex(1), yacc(1), sed(1), awk(1). 3990.PP 3991John Levine, Tony Mason, and Doug Brown, 3992.I Lex & Yacc, 3993O'Reilly and Associates. Be sure to get the 2nd edition. 3994.PP 3995M. E. Lesk and E. Schmidt, 3996.I LEX \- Lexical Analyzer Generator 3997.PP 3998Alfred Aho, Ravi Sethi and Jeffrey Ullman, 3999.I Compilers: Principles, Techniques and Tools, 4000Addison-Wesley (1986). Describes the pattern-matching techniques used by 4001.I flex 4002(deterministic finite automata). 4003.SH AUTHOR 4004Vern Paxson, with the help of many ideas and much inspiration from 4005Van Jacobson. Original version by Jef Poskanzer. The fast table 4006representation is a partial implementation of a design done by Van 4007Jacobson. The implementation was done by Kevin Gong and Vern Paxson. 4008.PP 4009Thanks to the many 4010.I flex 4011beta-testers, feedbackers, and contributors, especially Francois Pinard, 4012Casey Leedom, 4013Robert Abramovitz, 4014Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai, 4015Neal Becker, Nelson H.F. Beebe, benson@odi.com, 4016Karl Berry, Peter A. Bigot, Simon Blanchard, 4017Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, 4018Brian Clapper, J.T. Conklin, 4019Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David 4020Daniels, Chris G. Demetriou, Theo Deraadt, 4021Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, 4022Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl, 4023Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz, 4024Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel, 4025Jan Hajic, Charles Hemphill, NORO Hideo, 4026Jarkko Hietaniemi, Scott Hofmann, 4027Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, 4028Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, 4029Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, 4030Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, 4031Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht, 4032Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle, 4033David Loffredo, Mike Long, 4034Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall, 4035Bengt Martensson, Chris Metcalf, 4036Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, 4037G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, 4038Richard Ohnemus, Karsten Pahnke, 4039Sven Panne, Roland Pesch, Walter Pelissero, Gaumond 4040Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, 4041Frederic Raimbault, Pat Rankin, Rick Richardson, 4042Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini, 4043Andreas Scherer, Darrell Schiebel, Raf Schietekat, 4044Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, 4045Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist, 4046Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor, 4047Chris Thewalt, Richard M. Timoney, Jodi Tsai, 4048Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken 4049Yap, Ron Zellar, Nathan Zelle, David Zuhn, 4050and those whose names have slipped my marginal 4051mail-archiving skills but whose contributions are appreciated all the 4052same. 4053.PP 4054Thanks to Keith Bostic, Jon Forrest, Noah Friedman, 4055John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. 4056Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various 4057distribution headaches. 4058.PP 4059Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to 4060Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom 4061Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to 4062Eric Hughes for support of multiple buffers. 4063.PP 4064This work was primarily done when I was with the Real Time Systems Group 4065at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there 4066for the support I received. 4067.PP 4068Send comments to vern@ee.lbl.gov.
| 3876has jumped out (or over) the scanner's activation frame. Before 3877reentering the scanner, use: 3878.nf 3879 3880 yyrestart( yyin ); 3881 3882.fi 3883or, as noted above, switch to using the C++ scanner class. 3884.PP 3885.I too many start conditions in <> construct! - 3886you listed more start conditions in a <> construct than exist (so 3887you must have listed at least one of them twice). 3888.SH FILES 3889.TP 3890.B \-ll 3891library with which scanners must be linked. 3892.TP 3893.I lex.yy.c 3894generated scanner (called 3895.I lexyy.c 3896on some systems). 3897.TP 3898.I lex.yy.cc 3899generated C++ scanner class, when using 3900.B -+. 3901.TP 3902.I <FlexLexer.h> 3903header file defining the C++ scanner base class, 3904.B FlexLexer, 3905and its derived class, 3906.B yyFlexLexer. 3907.TP 3908.I flex.skl 3909skeleton scanner. This file is only used when building flex, not when 3910flex executes. 3911.TP 3912.I lex.backup 3913backing-up information for 3914.B \-b 3915flag (called 3916.I lex.bck 3917on some systems). 3918.SH DEFICIENCIES / BUGS 3919Some trailing context 3920patterns cannot be properly matched and generate 3921warning messages ("dangerous trailing context"). These are 3922patterns where the ending of the 3923first part of the rule matches the beginning of the second 3924part, such as "zx*/xy*", where the 'x*' matches the 'x' at 3925the beginning of the trailing context. (Note that the POSIX draft 3926states that the text matched by such patterns is undefined.) 3927.PP 3928For some trailing context rules, parts which are actually fixed-length are 3929not recognized as such, leading to the above mentioned performance loss. 3930In particular, parts using '|' or {n} (such as "foo{3}") are always 3931considered variable-length. 3932.PP 3933Combining trailing context with the special '|' action can result in 3934.I fixed 3935trailing context being turned into the more expensive 3936.I variable 3937trailing context. For example, in the following: 3938.nf 3939 3940 %% 3941 abc | 3942 xyz/def 3943 3944.fi 3945.PP 3946Use of 3947.B unput() 3948invalidates yytext and yyleng, unless the 3949.B %array 3950directive 3951or the 3952.B \-l 3953option has been used. 3954.PP 3955Pattern-matching of NUL's is substantially slower than matching other 3956characters. 3957.PP 3958Dynamic resizing of the input buffer is slow, as it entails rescanning 3959all the text matched so far by the current (generally huge) token. 3960.PP 3961Due to both buffering of input and read-ahead, you cannot intermix 3962calls to <stdio.h> routines, such as, for example, 3963.B getchar(), 3964with 3965.I flex 3966rules and expect it to work. Call 3967.B input() 3968instead. 3969.PP 3970The total table entries listed by the 3971.B \-v 3972flag excludes the number of table entries needed to determine 3973what rule has been matched. The number of entries is equal 3974to the number of DFA states if the scanner does not use 3975.B REJECT, 3976and somewhat greater than the number of states if it does. 3977.PP 3978.B REJECT 3979cannot be used with the 3980.B \-f 3981or 3982.B \-F 3983options. 3984.PP 3985The 3986.I flex 3987internal algorithms need documentation. 3988.SH SEE ALSO 3989lex(1), yacc(1), sed(1), awk(1). 3990.PP 3991John Levine, Tony Mason, and Doug Brown, 3992.I Lex & Yacc, 3993O'Reilly and Associates. Be sure to get the 2nd edition. 3994.PP 3995M. E. Lesk and E. Schmidt, 3996.I LEX \- Lexical Analyzer Generator 3997.PP 3998Alfred Aho, Ravi Sethi and Jeffrey Ullman, 3999.I Compilers: Principles, Techniques and Tools, 4000Addison-Wesley (1986). Describes the pattern-matching techniques used by 4001.I flex 4002(deterministic finite automata). 4003.SH AUTHOR 4004Vern Paxson, with the help of many ideas and much inspiration from 4005Van Jacobson. Original version by Jef Poskanzer. The fast table 4006representation is a partial implementation of a design done by Van 4007Jacobson. The implementation was done by Kevin Gong and Vern Paxson. 4008.PP 4009Thanks to the many 4010.I flex 4011beta-testers, feedbackers, and contributors, especially Francois Pinard, 4012Casey Leedom, 4013Robert Abramovitz, 4014Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai, 4015Neal Becker, Nelson H.F. Beebe, benson@odi.com, 4016Karl Berry, Peter A. Bigot, Simon Blanchard, 4017Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, 4018Brian Clapper, J.T. Conklin, 4019Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David 4020Daniels, Chris G. Demetriou, Theo Deraadt, 4021Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, 4022Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl, 4023Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz, 4024Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel, 4025Jan Hajic, Charles Hemphill, NORO Hideo, 4026Jarkko Hietaniemi, Scott Hofmann, 4027Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, 4028Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, 4029Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, 4030Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, 4031Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht, 4032Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle, 4033David Loffredo, Mike Long, 4034Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall, 4035Bengt Martensson, Chris Metcalf, 4036Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, 4037G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, 4038Richard Ohnemus, Karsten Pahnke, 4039Sven Panne, Roland Pesch, Walter Pelissero, Gaumond 4040Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, 4041Frederic Raimbault, Pat Rankin, Rick Richardson, 4042Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini, 4043Andreas Scherer, Darrell Schiebel, Raf Schietekat, 4044Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, 4045Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist, 4046Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor, 4047Chris Thewalt, Richard M. Timoney, Jodi Tsai, 4048Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken 4049Yap, Ron Zellar, Nathan Zelle, David Zuhn, 4050and those whose names have slipped my marginal 4051mail-archiving skills but whose contributions are appreciated all the 4052same. 4053.PP 4054Thanks to Keith Bostic, Jon Forrest, Noah Friedman, 4055John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. 4056Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various 4057distribution headaches. 4058.PP 4059Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to 4060Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom 4061Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to 4062Eric Hughes for support of multiple buffers. 4063.PP 4064This work was primarily done when I was with the Real Time Systems Group 4065at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there 4066for the support I received. 4067.PP 4068Send comments to vern@ee.lbl.gov.
|