lex.1 (2309) | lex.1 (16519) |
---|---|
1.TH FLEX 1 "November 1993" "Version 2.4" | 1.TH FLEX 1 "April 1995" "Version 2.5" |
2.SH NAME 3flex \- fast lexical analyzer generator 4.SH SYNOPSIS 5.B flex | 2.SH NAME 3flex \- fast lexical analyzer generator 4.SH SYNOPSIS 5.B flex |
6.B [\-bcdfhilnpstvwBFILTV78+ \-C[aefFmr] \-Pprefix \-Sskeleton] | 6.B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton] 7.B [\-\-help \-\-version] |
7.I [filename ...] | 8.I [filename ...] |
9.SH OVERVIEW 10This manual describes 11.I flex, 12a tool for generating programs that perform pattern-matching on text. The 13manual includes both tutorial and reference sections: 14.nf 15 16 Description 17 a brief overview of the tool 18 19 Some Simple Examples 20 21 Format Of The Input File 22 23 Patterns 24 the extended regular expressions used by flex 25 26 How The Input Is Matched 27 the rules for determining what has been matched 28 29 Actions 30 how to specify what to do when a pattern is matched 31 32 The Generated Scanner 33 details regarding the scanner that flex produces; 34 how to control the input source 35 36 Start Conditions 37 introducing context into your scanners, and 38 managing "mini-scanners" 39 40 Multiple Input Buffers 41 how to manipulate multiple input sources; how to 42 scan from strings instead of files 43 44 End-of-file Rules 45 special rules for matching the end of the input 46 47 Miscellaneous Macros 48 a summary of macros available to the actions 49 50 Values Available To The User 51 a summary of values available to the actions 52 53 Interfacing With Yacc 54 connecting flex scanners together with yacc parsers 55 56 Options 57 flex command-line options, and the "%option" 58 directive 59 60 Performance Considerations 61 how to make your scanner go as fast as possible 62 63 Generating C++ Scanners 64 the (experimental) facility for generating C++ 65 scanner classes 66 67 Incompatibilities With Lex And POSIX 68 how flex differs from AT&T lex and the POSIX lex 69 standard 70 71 Diagnostics 72 those error messages produced by flex (or scanners 73 it generates) whose meanings might not be apparent 74 75 Files 76 files used by flex 77 78 Deficiencies / Bugs 79 known problems with flex 80 81 See Also 82 other documentation, related tools 83 84 Author 85 includes contact information 86 87.fi |
|
8.SH DESCRIPTION 9.I flex 10is a tool for generating 11.I scanners: 12programs which recognized lexical patterns in text. 13.I flex 14reads 15the given input files, or its standard input if no file names are given, --- 6 unchanged lines hidden (view full) --- 22which defines a routine 23.B yylex(). 24This file is compiled and linked with the 25.B \-ll 26library to produce an executable. When the executable is run, 27it analyzes its input for occurrences 28of the regular expressions. Whenever it finds one, it executes 29the corresponding C code. | 88.SH DESCRIPTION 89.I flex 90is a tool for generating 91.I scanners: 92programs which recognized lexical patterns in text. 93.I flex 94reads 95the given input files, or its standard input if no file names are given, --- 6 unchanged lines hidden (view full) --- 102which defines a routine 103.B yylex(). 104This file is compiled and linked with the 105.B \-ll 106library to produce an executable. When the executable is run, 107it analyzes its input for occurrences 108of the regular expressions. Whenever it finds one, it executes 109the corresponding C code. |
110.SH SOME SIMPLE EXAMPLES |
|
30.PP | 111.PP |
31For full documentation, see 32.B lexdoc(1). 33This manual entry is intended for use as a quick reference. | 112First some simple examples to get the flavor of how one uses 113.I flex. 114The following 115.I flex 116input specifies a scanner which whenever it encounters the string 117"username" will replace it with the user's login name: 118.nf 119 120 %% 121 username printf( "%s", getlogin() ); 122 123.fi 124By default, any text not matched by a 125.I flex 126scanner 127is copied to the output, so the net effect of this scanner is 128to copy its input file to its output with each occurrence 129of "username" expanded. 130In this input, there is just one rule. "username" is the 131.I pattern 132and the "printf" is the 133.I action. 134The "%%" marks the beginning of the rules. 135.PP 136Here's another simple example: 137.nf 138 139 int num_lines = 0, num_chars = 0; 140 141 %% 142 \\n ++num_lines; ++num_chars; 143 . ++num_chars; 144 145 %% 146 main() 147 { 148 yylex(); 149 printf( "# of lines = %d, # of chars = %d\\n", 150 num_lines, num_chars ); 151 } 152 153.fi 154This scanner counts the number of characters and the number 155of lines in its input (it produces no output other than the 156final report on the counts). The first line 157declares two globals, "num_lines" and "num_chars", which are accessible 158both inside 159.B yylex() 160and in the 161.B main() 162routine declared after the second "%%". There are two rules, one 163which matches a newline ("\\n") and increments both the line count and 164the character count, and one which matches any character other than 165a newline (indicated by the "." regular expression). 166.PP 167A somewhat more complicated example: 168.nf 169 170 /* scanner for a toy Pascal-like language */ 171 172 %{ 173 /* need this for the call to atof() below */ 174 #include <math.h> 175 %} 176 177 DIGIT [0-9] 178 ID [a-z][a-z0-9]* 179 180 %% 181 182 {DIGIT}+ { 183 printf( "An integer: %s (%d)\\n", yytext, 184 atoi( yytext ) ); 185 } 186 187 {DIGIT}+"."{DIGIT}* { 188 printf( "A float: %s (%g)\\n", yytext, 189 atof( yytext ) ); 190 } 191 192 if|then|begin|end|procedure|function { 193 printf( "A keyword: %s\\n", yytext ); 194 } 195 196 {ID} printf( "An identifier: %s\\n", yytext ); 197 198 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); 199 200 "{"[^}\\n]*"}" /* eat up one-line comments */ 201 202 [ \\t\\n]+ /* eat up whitespace */ 203 204 . printf( "Unrecognized character: %s\\n", yytext ); 205 206 %% 207 208 main( argc, argv ) 209 int argc; 210 char **argv; 211 { 212 ++argv, --argc; /* skip over program name */ 213 if ( argc > 0 ) 214 yyin = fopen( argv[0], "r" ); 215 else 216 yyin = stdin; 217 218 yylex(); 219 } 220 221.fi 222This is the beginnings of a simple scanner for a language like 223Pascal. It identifies different types of 224.I tokens 225and reports on what it has seen. 226.PP 227The details of this example will be explained in the following 228sections. 229.SH FORMAT OF THE INPUT FILE 230The 231.I flex 232input file consists of three sections, separated by a line with just 233.B %% 234in it: 235.nf 236 237 definitions 238 %% 239 rules 240 %% 241 user code 242 243.fi 244The 245.I definitions 246section contains declarations of simple 247.I name 248definitions to simplify the scanner specification, and declarations of 249.I start conditions, 250which are explained in a later section. 251.PP 252Name definitions have the form: 253.nf 254 255 name definition 256 257.fi 258The "name" is a word beginning with a letter or an underscore ('_') 259followed by zero or more letters, digits, '_', or '-' (dash). 260The definition is taken to begin at the first non-white-space character 261following the name and continuing to the end of the line. 262The definition can subsequently be referred to using "{name}", which 263will expand to "(definition)". For example, 264.nf 265 266 DIGIT [0-9] 267 ID [a-z][a-z0-9]* 268 269.fi 270defines "DIGIT" to be a regular expression which matches a 271single digit, and 272"ID" to be a regular expression which matches a letter 273followed by zero-or-more letters-or-digits. 274A subsequent reference to 275.nf 276 277 {DIGIT}+"."{DIGIT}* 278 279.fi 280is identical to 281.nf 282 283 ([0-9])+"."([0-9])* 284 285.fi 286and matches one-or-more digits followed by a '.' followed 287by zero-or-more digits. 288.PP 289The 290.I rules 291section of the 292.I flex 293input contains a series of rules of the form: 294.nf 295 296 pattern action 297 298.fi 299where the pattern must be unindented and the action must begin 300on the same line. 301.PP 302See below for a further description of patterns and actions. 303.PP 304Finally, the user code section is simply copied to 305.B lex.yy.c 306verbatim. 307It is used for companion routines which call or are called 308by the scanner. The presence of this section is optional; 309if it is missing, the second 310.B %% 311in the input file may be skipped, too. 312.PP 313In the definitions and rules sections, any 314.I indented 315text or text enclosed in 316.B %{ 317and 318.B %} 319is copied verbatim to the output (with the %{}'s removed). 320The %{}'s must appear unindented on lines by themselves. 321.PP 322In the rules section, 323any indented or %{} text appearing before the 324first rule may be used to declare variables 325which are local to the scanning routine and (after the declarations) 326code which is to be executed whenever the scanning routine is entered. 327Other indented or %{} text in the rule section is still copied to the output, 328but its meaning is not well-defined and it may well cause compile-time 329errors (this feature is present for 330.I POSIX 331compliance; see below for other such features). 332.PP 333In the definitions section (but not in the rules section), 334an unindented comment (i.e., a line 335beginning with "/*") is also copied verbatim to the output up 336to the next "*/". 337.SH PATTERNS 338The patterns in the input are written using an extended set of regular 339expressions. These are: 340.nf 341 342 x match the character 'x' 343 . any character (byte) except newline 344 [xyz] a "character class"; in this case, the pattern 345 matches either an 'x', a 'y', or a 'z' 346 [abj-oZ] a "character class" with a range in it; matches 347 an 'a', a 'b', any letter from 'j' through 'o', 348 or a 'Z' 349 [^A-Z] a "negated character class", i.e., any character 350 but those in the class. In this case, any 351 character EXCEPT an uppercase letter. 352 [^A-Z\\n] any character EXCEPT an uppercase letter or 353 a newline 354 r* zero or more r's, where r is any regular expression 355 r+ one or more r's 356 r? zero or one r's (that is, "an optional r") 357 r{2,5} anywhere from two to five r's 358 r{2,} two or more r's 359 r{4} exactly 4 r's 360 {name} the expansion of the "name" definition 361 (see above) 362 "[xyz]\\"foo" 363 the literal string: [xyz]"foo 364 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', 365 then the ANSI-C interpretation of \\x. 366 Otherwise, a literal 'X' (used to escape 367 operators such as '*') 368 \\0 a NUL character (ASCII code 0) 369 \\123 the character with octal value 123 370 \\x2a the character with hexadecimal value 2a 371 (r) match an r; parentheses are used to override 372 precedence (see below) 373 374 375 rs the regular expression r followed by the 376 regular expression s; called "concatenation" 377 378 379 r|s either an r or an s 380 381 382 r/s an r but only if it is followed by an s. The 383 text matched by s is included when determining 384 whether this rule is the "longest match", 385 but is then returned to the input before 386 the action is executed. So the action only 387 sees the text matched by r. This type 388 of pattern is called trailing context". 389 (There are some combinations of r/s that flex 390 cannot match correctly; see notes in the 391 Deficiencies / Bugs section below regarding 392 "dangerous trailing context".) 393 ^r an r, but only at the beginning of a line (i.e., 394 which just starting to scan, or right after a 395 newline has been scanned). 396 r$ an r, but only at the end of a line (i.e., just 397 before a newline). Equivalent to "r/\\n". 398 399 Note that flex's notion of "newline" is exactly 400 whatever the C compiler used to compile flex 401 interprets '\\n' as; in particular, on some DOS 402 systems you must either filter out \\r's in the 403 input yourself, or explicitly use r/\\r\\n for "r$". 404 405 406 <s>r an r, but only in start condition s (see 407 below for discussion of start conditions) 408 <s1,s2,s3>r 409 same, but in any of start conditions s1, 410 s2, or s3 411 <*>r an r in any start condition, even an exclusive one. 412 413 414 <<EOF>> an end-of-file 415 <s1,s2><<EOF>> 416 an end-of-file when in start condition s1 or s2 417 418.fi 419Note that inside of a character class, all regular expression operators 420lose their special meaning except escape ('\\') and the character class 421operators, '-', ']', and, at the beginning of the class, '^'. 422.PP 423The regular expressions listed above are grouped according to 424precedence, from highest precedence at the top to lowest at the bottom. 425Those grouped together have equal precedence. For example, 426.nf 427 428 foo|bar* 429 430.fi 431is the same as 432.nf 433 434 (foo)|(ba(r*)) 435 436.fi 437since the '*' operator has higher precedence than concatenation, 438and concatenation higher than alternation ('|'). This pattern 439therefore matches 440.I either 441the string "foo" 442.I or 443the string "ba" followed by zero-or-more r's. 444To match "foo" or zero-or-more "bar"'s, use: 445.nf 446 447 foo|(bar)* 448 449.fi 450and to match zero-or-more "foo"'s-or-"bar"'s: 451.nf 452 453 (foo|bar)* 454 455.fi 456.PP 457In addition to characters and ranges of characters, character classes 458can also contain character class 459.I expressions. 460These are expressions enclosed inside 461.B [: 462and 463.B :] 464delimiters (which themselves must appear between the '[' and ']' of the 465character class; other elements may occur inside the character class, too). 466The valid expressions are: 467.nf 468 469 [:alnum:] [:alpha:] [:blank:] 470 [:cntrl:] [:digit:] [:graph:] 471 [:lower:] [:print:] [:punct:] 472 [:space:] [:upper:] [:xdigit:] 473 474.fi 475These expressions all designate a set of characters equivalent to 476the corresponding standard C 477.B isXXX 478function. For example, 479.B [:alnum:] 480designates those characters for which 481.B isalnum() 482returns true - i.e., any alphabetic or numeric. 483Some systems don't provide 484.B isblank(), 485so flex defines 486.B [:blank:] 487as a blank or a tab. 488.PP 489For example, the following character classes are all equivalent: 490.nf 491 492 [[:alnum:]] 493 [[:alpha:][:digit:] 494 [[:alpha:]0-9] 495 [a-zA-Z0-9] 496 497.fi 498If your scanner is case-insensitive (the 499.B \-i 500flag), then 501.B [:upper:] 502and 503.B [:lower:] 504are equivalent to 505.B [:alpha:]. 506.PP 507Some notes on patterns: 508.IP - 509A negated character class such as the example "[^A-Z]" 510above 511.I will match a newline 512unless "\\n" (or an equivalent escape sequence) is one of the 513characters explicitly present in the negated character class 514(e.g., "[^A-Z\\n]"). This is unlike how many other regular 515expression tools treat negated character classes, but unfortunately 516the inconsistency is historically entrenched. 517Matching newlines means that a pattern like [^"]* can match the entire 518input unless there's another quote in the input. 519.IP - 520A rule can have at most one instance of trailing context (the '/' operator 521or the '$' operator). The start condition, '^', and "<<EOF>>" patterns 522can only occur at the beginning of a pattern, and, as well as with '/' and '$', 523cannot be grouped inside parentheses. A '^' which does not occur at 524the beginning of a rule or a '$' which does not occur at the end of 525a rule loses its special properties and is treated as a normal character. 526.IP 527The following are illegal: 528.nf 529 530 foo/bar$ 531 <sc1>foo<sc2>bar 532 533.fi 534Note that the first of these, can be written "foo/bar\\n". 535.IP 536The following will result in '$' or '^' being treated as a normal character: 537.nf 538 539 foo|(bar$) 540 foo|^bar 541 542.fi 543If what's wanted is a "foo" or a bar-followed-by-a-newline, the following 544could be used (the special '|' action is explained below): 545.nf 546 547 foo | 548 bar$ /* action goes here */ 549 550.fi 551A similar trick will work for matching a foo or a 552bar-at-the-beginning-of-a-line. 553.SH HOW THE INPUT IS MATCHED 554When the generated scanner is run, it analyzes its input looking 555for strings which match any of its patterns. If it finds more than 556one match, it takes the one matching the most text (for trailing 557context rules, this includes the length of the trailing part, even 558though it will then be returned to the input). If it finds two 559or more matches of the same length, the 560rule listed first in the 561.I flex 562input file is chosen. 563.PP 564Once the match is determined, the text corresponding to the match 565(called the 566.I token) 567is made available in the global character pointer 568.B yytext, 569and its length in the global integer 570.B yyleng. 571The 572.I action 573corresponding to the matched pattern is then executed (a more 574detailed description of actions follows), and then the remaining 575input is scanned for another match. 576.PP 577If no match is found, then the 578.I default rule 579is executed: the next character in the input is considered matched and 580copied to the standard output. Thus, the simplest legal 581.I flex 582input is: 583.nf 584 585 %% 586 587.fi 588which generates a scanner that simply copies its input (one character 589at a time) to its output. 590.PP 591Note that 592.B yytext 593can be defined in two different ways: either as a character 594.I pointer 595or as a character 596.I array. 597You can control which definition 598.I flex 599uses by including one of the special directives 600.B %pointer 601or 602.B %array 603in the first (definitions) section of your flex input. The default is 604.B %pointer, 605unless you use the 606.B -l 607lex compatibility option, in which case 608.B yytext 609will be an array. 610The advantage of using 611.B %pointer 612is substantially faster scanning and no buffer overflow when matching 613very large tokens (unless you run out of dynamic memory). The disadvantage 614is that you are restricted in how your actions can modify 615.B yytext 616(see the next section), and calls to the 617.B unput() 618function destroys the present contents of 619.B yytext, 620which can be a considerable porting headache when moving between different 621.I lex 622versions. 623.PP 624The advantage of 625.B %array 626is that you can then modify 627.B yytext 628to your heart's content, and calls to 629.B unput() 630do not destroy 631.B yytext 632(see below). Furthermore, existing 633.I lex 634programs sometimes access 635.B yytext 636externally using declarations of the form: 637.nf 638 extern char yytext[]; 639.fi 640This definition is erroneous when used with 641.B %pointer, 642but correct for 643.B %array. 644.PP 645.B %array 646defines 647.B yytext 648to be an array of 649.B YYLMAX 650characters, which defaults to a fairly large value. You can change 651the size by simply #define'ing 652.B YYLMAX 653to a different value in the first section of your 654.I flex 655input. As mentioned above, with 656.B %pointer 657yytext grows dynamically to accommodate large tokens. While this means your 658.B %pointer 659scanner can accommodate very large tokens (such as matching entire blocks 660of comments), bear in mind that each time the scanner must resize 661.B yytext 662it also must rescan the entire token from the beginning, so matching such 663tokens can prove slow. 664.B yytext 665presently does 666.I not 667dynamically grow if a call to 668.B unput() 669results in too much text being pushed back; instead, a run-time error results. 670.PP 671Also note that you cannot use 672.B %array 673with C++ scanner classes 674(the 675.B c++ 676option; see below). 677.SH ACTIONS 678Each pattern in a rule has a corresponding action, which can be any 679arbitrary C statement. The pattern ends at the first non-escaped 680whitespace character; the remainder of the line is its action. If the 681action is empty, then when the pattern is matched the input token 682is simply discarded. For example, here is the specification for a program 683which deletes all occurrences of "zap me" from its input: 684.nf 685 686 %% 687 "zap me" 688 689.fi 690(It will copy all other characters in the input to the output since 691they will be matched by the default rule.) 692.PP 693Here is a program which compresses multiple blanks and tabs down to 694a single blank, and throws away whitespace found at the end of a line: 695.nf 696 697 %% 698 [ \\t]+ putchar( ' ' ); 699 [ \\t]+$ /* ignore this token */ 700 701.fi 702.PP 703If the action contains a '{', then the action spans till the balancing '}' 704is found, and the action may cross multiple lines. 705.I flex 706knows about C strings and comments and won't be fooled by braces found 707within them, but also allows actions to begin with 708.B %{ 709and will consider the action to be all the text up to the next 710.B %} 711(regardless of ordinary braces inside the action). 712.PP 713An action consisting solely of a vertical bar ('|') means "same as 714the action for the next rule." See below for an illustration. 715.PP 716Actions can include arbitrary C code, including 717.B return 718statements to return a value to whatever routine called 719.B yylex(). 720Each time 721.B yylex() 722is called it continues processing tokens from where it last left 723off until it either reaches 724the end of the file or executes a return. 725.PP 726Actions are free to modify 727.B yytext 728except for lengthening it (adding 729characters to its end--these will overwrite later characters in the 730input stream). This however does not apply when using 731.B %array 732(see above); in that case, 733.B yytext 734may be freely modified in any way. 735.PP 736Actions are free to modify 737.B yyleng 738except they should not do so if the action also includes use of 739.B yymore() 740(see below). 741.PP 742There are a number of special directives which can be included within 743an action: 744.IP - 745.B ECHO 746copies yytext to the scanner's output. 747.IP - 748.B BEGIN 749followed by the name of a start condition places the scanner in the 750corresponding start condition (see below). 751.IP - 752.B REJECT 753directs the scanner to proceed on to the "second best" rule which matched the 754input (or a prefix of the input). The rule is chosen as described 755above in "How the Input is Matched", and 756.B yytext 757and 758.B yyleng 759set up appropriately. 760It may either be one which matched as much text 761as the originally chosen rule but came later in the 762.I flex 763input file, or one which matched less text. 764For example, the following will both count the 765words in the input and call the routine special() whenever "frob" is seen: 766.nf 767 768 int word_count = 0; 769 %% 770 771 frob special(); REJECT; 772 [^ \\t\\n]+ ++word_count; 773 774.fi 775Without the 776.B REJECT, 777any "frob"'s in the input would not be counted as words, since the 778scanner normally executes only one action per token. 779Multiple 780.B REJECT's 781are allowed, each one finding the next best choice to the currently 782active rule. For example, when the following scanner scans the token 783"abcd", it will write "abcdabcaba" to the output: 784.nf 785 786 %% 787 a | 788 ab | 789 abc | 790 abcd ECHO; REJECT; 791 .|\\n /* eat up any unmatched character */ 792 793.fi 794(The first three rules share the fourth's action since they use 795the special '|' action.) 796.B REJECT 797is a particularly expensive feature in terms of scanner performance; 798if it is used in 799.I any 800of the scanner's actions it will slow down 801.I all 802of the scanner's matching. Furthermore, 803.B REJECT 804cannot be used with the 805.I -Cf 806or 807.I -CF 808options (see below). 809.IP 810Note also that unlike the other special actions, 811.B REJECT 812is a 813.I branch; 814code immediately following it in the action will 815.I not 816be executed. 817.IP - 818.B yymore() 819tells the scanner that the next time it matches a rule, the corresponding 820token should be 821.I appended 822onto the current value of 823.B yytext 824rather than replacing it. For example, given the input "mega-kludge" 825the following will write "mega-mega-kludge" to the output: 826.nf 827 828 %% 829 mega- ECHO; yymore(); 830 kludge ECHO; 831 832.fi 833First "mega-" is matched and echoed to the output. Then "kludge" 834is matched, but the previous "mega-" is still hanging around at the 835beginning of 836.B yytext 837so the 838.B ECHO 839for the "kludge" rule will actually write "mega-kludge". 840.PP 841Two notes regarding use of 842.B yymore(). 843First, 844.B yymore() 845depends on the value of 846.I yyleng 847correctly reflecting the size of the current token, so you must not 848modify 849.I yyleng 850if you are using 851.B yymore(). 852Second, the presence of 853.B yymore() 854in the scanner's action entails a minor performance penalty in the 855scanner's matching speed. 856.IP - 857.B yyless(n) 858returns all but the first 859.I n 860characters of the current token back to the input stream, where they 861will be rescanned when the scanner looks for the next match. 862.B yytext 863and 864.B yyleng 865are adjusted appropriately (e.g., 866.B yyleng 867will now be equal to 868.I n 869). For example, on the input "foobar" the following will write out 870"foobarbar": 871.nf 872 873 %% 874 foobar ECHO; yyless(3); 875 [a-z]+ ECHO; 876 877.fi 878An argument of 0 to 879.B yyless 880will cause the entire current input string to be scanned again. Unless you've 881changed how the scanner will subsequently process its input (using 882.B BEGIN, 883for example), this will result in an endless loop. 884.PP 885Note that 886.B yyless 887is a macro and can only be used in the flex input file, not from 888other source files. 889.IP - 890.B unput(c) 891puts the character 892.I c 893back onto the input stream. It will be the next character scanned. 894The following action will take the current token and cause it 895to be rescanned enclosed in parentheses. 896.nf 897 898 { 899 int i; 900 /* Copy yytext because unput() trashes yytext */ 901 char *yycopy = strdup( yytext ); 902 unput( ')' ); 903 for ( i = yyleng - 1; i >= 0; --i ) 904 unput( yycopy[i] ); 905 unput( '(' ); 906 free( yycopy ); 907 } 908 909.fi 910Note that since each 911.B unput() 912puts the given character back at the 913.I beginning 914of the input stream, pushing back strings must be done back-to-front. 915.PP 916An important potential problem when using 917.B unput() 918is that if you are using 919.B %pointer 920(the default), a call to 921.B unput() 922.I destroys 923the contents of 924.I yytext, 925starting with its rightmost character and devouring one character to 926the left with each call. If you need the value of yytext preserved 927after a call to 928.B unput() 929(as in the above example), 930you must either first copy it elsewhere, or build your scanner using 931.B %array 932instead (see How The Input Is Matched). 933.PP 934Finally, note that you cannot put back 935.B EOF 936to attempt to mark the input stream with an end-of-file. 937.IP - 938.B input() 939reads the next character from the input stream. For example, 940the following is one way to eat up C comments: 941.nf 942 943 %% 944 "/*" { 945 register int c; 946 947 for ( ; ; ) 948 { 949 while ( (c = input()) != '*' && 950 c != EOF ) 951 ; /* eat up text of comment */ 952 953 if ( c == '*' ) 954 { 955 while ( (c = input()) == '*' ) 956 ; 957 if ( c == '/' ) 958 break; /* found the end */ 959 } 960 961 if ( c == EOF ) 962 { 963 error( "EOF in comment" ); 964 break; 965 } 966 } 967 } 968 969.fi 970(Note that if the scanner is compiled using 971.B C++, 972then 973.B input() 974is instead referred to as 975.B yyinput(), 976in order to avoid a name clash with the 977.B C++ 978stream by the name of 979.I input.) 980.IP - 981.B YY_FLUSH_BUFFER 982flushes the scanner's internal buffer 983so that the next time the scanner attempts to match a token, it will 984first refill the buffer using 985.B YY_INPUT 986(see The Generated Scanner, below). This action is a special case 987of the more general 988.B yy_flush_buffer() 989function, described below in the section Multiple Input Buffers. 990.IP - 991.B yyterminate() 992can be used in lieu of a return statement in an action. It terminates 993the scanner and returns a 0 to the scanner's caller, indicating "all done". 994By default, 995.B yyterminate() 996is also called when an end-of-file is encountered. It is a macro and 997may be redefined. 998.SH THE GENERATED SCANNER 999The output of 1000.I flex 1001is the file 1002.B lex.yy.c, 1003which contains the scanning routine 1004.B yylex(), 1005a number of tables used by it for matching tokens, and a number 1006of auxiliary routines and macros. By default, 1007.B yylex() 1008is declared as follows: 1009.nf 1010 1011 int yylex() 1012 { 1013 ... various definitions and the actions in here ... 1014 } 1015 1016.fi 1017(If your environment supports function prototypes, then it will 1018be "int yylex( void )".) This definition may be changed by defining 1019the "YY_DECL" macro. For example, you could use: 1020.nf 1021 1022 #define YY_DECL float lexscan( a, b ) float a, b; 1023 1024.fi 1025to give the scanning routine the name 1026.I lexscan, 1027returning a float, and taking two floats as arguments. Note that 1028if you give arguments to the scanning routine using a 1029K&R-style/non-prototyped function declaration, you must terminate 1030the definition with a semi-colon (;). 1031.PP 1032Whenever 1033.B yylex() 1034is called, it scans tokens from the global input file 1035.I yyin 1036(which defaults to stdin). It continues until it either reaches 1037an end-of-file (at which point it returns the value 0) or 1038one of its actions executes a 1039.I return 1040statement. 1041.PP 1042If the scanner reaches an end-of-file, subsequent calls are undefined 1043unless either 1044.I yyin 1045is pointed at a new input file (in which case scanning continues from 1046that file), or 1047.B yyrestart() 1048is called. 1049.B yyrestart() 1050takes one argument, a 1051.B FILE * 1052pointer (which can be nil, if you've set up 1053.B YY_INPUT 1054to scan from a source other than 1055.I yyin), 1056and initializes 1057.I yyin 1058for scanning from that file. Essentially there is no difference between 1059just assigning 1060.I yyin 1061to a new input file or using 1062.B yyrestart() 1063to do so; the latter is available for compatibility with previous versions 1064of 1065.I flex, 1066and because it can be used to switch input files in the middle of scanning. 1067It can also be used to throw away the current input buffer, by calling 1068it with an argument of 1069.I yyin; 1070but better is to use 1071.B YY_FLUSH_BUFFER 1072(see above). 1073Note that 1074.B yyrestart() 1075does 1076.I not 1077reset the start condition to 1078.B INITIAL 1079(see Start Conditions, below). 1080.PP 1081If 1082.B yylex() 1083stops scanning due to executing a 1084.I return 1085statement in one of the actions, the scanner may then be called again and it 1086will resume scanning where it left off. 1087.PP 1088By default (and for purposes of efficiency), the scanner uses 1089block-reads rather than simple 1090.I getc() 1091calls to read characters from 1092.I yyin. 1093The nature of how it gets its input can be controlled by defining the 1094.B YY_INPUT 1095macro. 1096YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its 1097action is to place up to 1098.I max_size 1099characters in the character array 1100.I buf 1101and return in the integer variable 1102.I result 1103either the 1104number of characters read or the constant YY_NULL (0 on Unix systems) 1105to indicate EOF. The default YY_INPUT reads from the 1106global file-pointer "yyin". 1107.PP 1108A sample definition of YY_INPUT (in the definitions 1109section of the input file): 1110.nf 1111 1112 %{ 1113 #define YY_INPUT(buf,result,max_size) \\ 1114 { \\ 1115 int c = getchar(); \\ 1116 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ 1117 } 1118 %} 1119 1120.fi 1121This definition will change the input processing to occur 1122one character at a time. 1123.PP 1124When the scanner receives an end-of-file indication from YY_INPUT, 1125it then checks the 1126.B yywrap() 1127function. If 1128.B yywrap() 1129returns false (zero), then it is assumed that the 1130function has gone ahead and set up 1131.I yyin 1132to point to another input file, and scanning continues. If it returns 1133true (non-zero), then the scanner terminates, returning 0 to its 1134caller. Note that in either case, the start condition remains unchanged; 1135it does 1136.I not 1137revert to 1138.B INITIAL. 1139.PP 1140If you do not supply your own version of 1141.B yywrap(), 1142then you must either use 1143.B %option noyywrap 1144(in which case the scanner behaves as though 1145.B yywrap() 1146returned 1), or you must link with 1147.B \-ll 1148to obtain the default version of the routine, which always returns 1. 1149.PP 1150Three routines are available for scanning from in-memory buffers rather 1151than files: 1152.B yy_scan_string(), yy_scan_bytes(), 1153and 1154.B yy_scan_buffer(). 1155See the discussion of them below in the section Multiple Input Buffers. 1156.PP 1157The scanner writes its 1158.B ECHO 1159output to the 1160.I yyout 1161global (default, stdout), which may be redefined by the user simply 1162by assigning it to some other 1163.B FILE 1164pointer. 1165.SH START CONDITIONS 1166.I flex 1167provides a mechanism for conditionally activating rules. Any rule 1168whose pattern is prefixed with "<sc>" will only be active when 1169the scanner is in the start condition named "sc". For example, 1170.nf 1171 1172 <STRING>[^"]* { /* eat up the string body ... */ 1173 ... 1174 } 1175 1176.fi 1177will be active only when the scanner is in the "STRING" start 1178condition, and 1179.nf 1180 1181 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */ 1182 ... 1183 } 1184 1185.fi 1186will be active only when the current start condition is 1187either "INITIAL", "STRING", or "QUOTE". 1188.PP 1189Start conditions 1190are declared in the definitions (first) section of the input 1191using unindented lines beginning with either 1192.B %s 1193or 1194.B %x 1195followed by a list of names. 1196The former declares 1197.I inclusive 1198start conditions, the latter 1199.I exclusive 1200start conditions. A start condition is activated using the 1201.B BEGIN 1202action. Until the next 1203.B BEGIN 1204action is executed, rules with the given start 1205condition will be active and 1206rules with other start conditions will be inactive. 1207If the start condition is 1208.I inclusive, 1209then rules with no start conditions at all will also be active. 1210If it is 1211.I exclusive, 1212then 1213.I only 1214rules qualified with the start condition will be active. 1215A set of rules contingent on the same exclusive start condition 1216describe a scanner which is independent of any of the other rules in the 1217.I flex 1218input. Because of this, 1219exclusive start conditions make it easy to specify "mini-scanners" 1220which scan portions of the input that are syntactically different 1221from the rest (e.g., comments). 1222.PP 1223If the distinction between inclusive and exclusive start conditions 1224is still a little vague, here's a simple example illustrating the 1225connection between the two. The set of rules: 1226.nf 1227 1228 %s example 1229 %% 1230 1231 <example>foo do_something(); 1232 1233 bar something_else(); 1234 1235.fi 1236is equivalent to 1237.nf 1238 1239 %x example 1240 %% 1241 1242 <example>foo do_something(); 1243 1244 <INITIAL,example>bar something_else(); 1245 1246.fi 1247Without the 1248.B <INITIAL,example> 1249qualifier, the 1250.I bar 1251pattern in the second example wouldn't be active (i.e., couldn't match) 1252when in start condition 1253.B example. 1254If we just used 1255.B <example> 1256to qualify 1257.I bar, 1258though, then it would only be active in 1259.B example 1260and not in 1261.B INITIAL, 1262while in the first example it's active in both, because in the first 1263example the 1264.B example 1265startion condition is an 1266.I inclusive 1267.B (%s) 1268start condition. 1269.PP 1270Also note that the special start-condition specifier 1271.B <*> 1272matches every start condition. Thus, the above example could also 1273have been written; 1274.nf 1275 1276 %x example 1277 %% 1278 1279 <example>foo do_something(); 1280 1281 <*>bar something_else(); 1282 1283.fi 1284.PP 1285The default rule (to 1286.B ECHO 1287any unmatched character) remains active in start conditions. It 1288is equivalent to: 1289.nf 1290 1291 <*>.|\\n ECHO; 1292 1293.fi 1294.PP 1295.B BEGIN(0) 1296returns to the original state where only the rules with 1297no start conditions are active. This state can also be 1298referred to as the start-condition "INITIAL", so 1299.B BEGIN(INITIAL) 1300is equivalent to 1301.B BEGIN(0). 1302(The parentheses around the start condition name are not required but 1303are considered good style.) 1304.PP 1305.B BEGIN 1306actions can also be given as indented code at the beginning 1307of the rules section. For example, the following will cause 1308the scanner to enter the "SPECIAL" start condition whenever 1309.B yylex() 1310is called and the global variable 1311.I enter_special 1312is true: 1313.nf 1314 1315 int enter_special; 1316 1317 %x SPECIAL 1318 %% 1319 if ( enter_special ) 1320 BEGIN(SPECIAL); 1321 1322 <SPECIAL>blahblahblah 1323 ...more rules follow... 1324 1325.fi 1326.PP 1327To illustrate the uses of start conditions, 1328here is a scanner which provides two different interpretations 1329of a string like "123.456". By default it will treat it as 1330three tokens, the integer "123", a dot ('.'), and the integer "456". 1331But if the string is preceded earlier in the line by the string 1332"expect-floats" 1333it will treat it as a single token, the floating-point number 1334123.456: 1335.nf 1336 1337 %{ 1338 #include <math.h> 1339 %} 1340 %s expect 1341 1342 %% 1343 expect-floats BEGIN(expect); 1344 1345 <expect>[0-9]+"."[0-9]+ { 1346 printf( "found a float, = %f\\n", 1347 atof( yytext ) ); 1348 } 1349 <expect>\\n { 1350 /* that's the end of the line, so 1351 * we need another "expect-number" 1352 * before we'll recognize any more 1353 * numbers 1354 */ 1355 BEGIN(INITIAL); 1356 } 1357 1358 [0-9]+ { 1359 printf( "found an integer, = %d\\n", 1360 atoi( yytext ) ); 1361 } 1362 1363 "." printf( "found a dot\\n" ); 1364 1365.fi 1366Here is a scanner which recognizes (and discards) C comments while 1367maintaining a count of the current input line. 1368.nf 1369 1370 %x comment 1371 %% 1372 int line_num = 1; 1373 1374 "/*" BEGIN(comment); 1375 1376 <comment>[^*\\n]* /* eat anything that's not a '*' */ 1377 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ 1378 <comment>\\n ++line_num; 1379 <comment>"*"+"/" BEGIN(INITIAL); 1380 1381.fi 1382This scanner goes to a bit of trouble to match as much 1383text as possible with each rule. In general, when attempting to write 1384a high-speed scanner try to match as much possible in each rule, as 1385it's a big win. 1386.PP 1387Note that start-conditions names are really integer values and 1388can be stored as such. Thus, the above could be extended in the 1389following fashion: 1390.nf 1391 1392 %x comment foo 1393 %% 1394 int line_num = 1; 1395 int comment_caller; 1396 1397 "/*" { 1398 comment_caller = INITIAL; 1399 BEGIN(comment); 1400 } 1401 1402 ... 1403 1404 <foo>"/*" { 1405 comment_caller = foo; 1406 BEGIN(comment); 1407 } 1408 1409 <comment>[^*\\n]* /* eat anything that's not a '*' */ 1410 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ 1411 <comment>\\n ++line_num; 1412 <comment>"*"+"/" BEGIN(comment_caller); 1413 1414.fi 1415Furthermore, you can access the current start condition using 1416the integer-valued 1417.B YY_START 1418macro. For example, the above assignments to 1419.I comment_caller 1420could instead be written 1421.nf 1422 1423 comment_caller = YY_START; 1424 1425.fi 1426Flex provides 1427.B YYSTATE 1428as an alias for 1429.B YY_START 1430(since that is what's used by AT&T 1431.I lex). 1432.PP 1433Note that start conditions do not have their own name-space; %s's and %x's 1434declare names in the same fashion as #define's. 1435.PP 1436Finally, here's an example of how to match C-style quoted strings using 1437exclusive start conditions, including expanded escape sequences (but 1438not including checking for a string that's too long): 1439.nf 1440 1441 %x str 1442 1443 %% 1444 char string_buf[MAX_STR_CONST]; 1445 char *string_buf_ptr; 1446 1447 1448 \\" string_buf_ptr = string_buf; BEGIN(str); 1449 1450 <str>\\" { /* saw closing quote - all done */ 1451 BEGIN(INITIAL); 1452 *string_buf_ptr = '\\0'; 1453 /* return string constant token type and 1454 * value to parser 1455 */ 1456 } 1457 1458 <str>\\n { 1459 /* error - unterminated string constant */ 1460 /* generate error message */ 1461 } 1462 1463 <str>\\\\[0-7]{1,3} { 1464 /* octal escape sequence */ 1465 int result; 1466 1467 (void) sscanf( yytext + 1, "%o", &result ); 1468 1469 if ( result > 0xff ) 1470 /* error, constant is out-of-bounds */ 1471 1472 *string_buf_ptr++ = result; 1473 } 1474 1475 <str>\\\\[0-9]+ { 1476 /* generate error - bad escape sequence; something 1477 * like '\\48' or '\\0777777' 1478 */ 1479 } 1480 1481 <str>\\\\n *string_buf_ptr++ = '\\n'; 1482 <str>\\\\t *string_buf_ptr++ = '\\t'; 1483 <str>\\\\r *string_buf_ptr++ = '\\r'; 1484 <str>\\\\b *string_buf_ptr++ = '\\b'; 1485 <str>\\\\f *string_buf_ptr++ = '\\f'; 1486 1487 <str>\\\\(.|\\n) *string_buf_ptr++ = yytext[1]; 1488 1489 <str>[^\\\\\\n\\"]+ { 1490 char *yptr = yytext; 1491 1492 while ( *yptr ) 1493 *string_buf_ptr++ = *yptr++; 1494 } 1495 1496.fi 1497.PP 1498Often, such as in some of the examples above, you wind up writing a 1499whole bunch of rules all preceded by the same start condition(s). Flex 1500makes this a little easier and cleaner by introducing a notion of 1501start condition 1502.I scope. 1503A start condition scope is begun with: 1504.nf 1505 1506 <SCs>{ 1507 1508.fi 1509where 1510.I SCs 1511is a list of one or more start conditions. Inside the start condition 1512scope, every rule automatically has the prefix 1513.I <SCs> 1514applied to it, until a 1515.I '}' 1516which matches the initial 1517.I '{'. 1518So, for example, 1519.nf 1520 1521 <ESC>{ 1522 "\\\\n" return '\\n'; 1523 "\\\\r" return '\\r'; 1524 "\\\\f" return '\\f'; 1525 "\\\\0" return '\\0'; 1526 } 1527 1528.fi 1529is equivalent to: 1530.nf 1531 1532 <ESC>"\\\\n" return '\\n'; 1533 <ESC>"\\\\r" return '\\r'; 1534 <ESC>"\\\\f" return '\\f'; 1535 <ESC>"\\\\0" return '\\0'; 1536 1537.fi 1538Start condition scopes may be nested. 1539.PP 1540Three routines are available for manipulating stacks of start conditions: 1541.TP 1542.B void yy_push_state(int new_state) 1543pushes the current start condition onto the top of the start condition 1544stack and switches to 1545.I new_state 1546as though you had used 1547.B BEGIN new_state 1548(recall that start condition names are also integers). 1549.TP 1550.B void yy_pop_state() 1551pops the top of the stack and switches to it via 1552.B BEGIN. 1553.TP 1554.B int yy_top_state() 1555returns the top of the stack without altering the stack's contents. 1556.PP 1557The start condition stack grows dynamically and so has no built-in 1558size limitation. If memory is exhausted, program execution aborts. 1559.PP 1560To use start condition stacks, your scanner must include a 1561.B %option stack 1562directive (see Options below). 1563.SH MULTIPLE INPUT BUFFERS 1564Some scanners (such as those which support "include" files) 1565require reading from several input streams. As 1566.I flex 1567scanners do a large amount of buffering, one cannot control 1568where the next input will be read from by simply writing a 1569.B YY_INPUT 1570which is sensitive to the scanning context. 1571.B YY_INPUT 1572is only called when the scanner reaches the end of its buffer, which 1573may be a long time after scanning a statement such as an "include" 1574which requires switching the input source. 1575.PP 1576To negotiate these sorts of problems, 1577.I flex 1578provides a mechanism for creating and switching between multiple 1579input buffers. An input buffer is created by using: 1580.nf 1581 1582 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) 1583 1584.fi 1585which takes a 1586.I FILE 1587pointer and a size and creates a buffer associated with the given 1588file and large enough to hold 1589.I size 1590characters (when in doubt, use 1591.B YY_BUF_SIZE 1592for the size). It returns a 1593.B YY_BUFFER_STATE 1594handle, which may then be passed to other routines (see below). The 1595.B YY_BUFFER_STATE 1596type is a pointer to an opaque 1597.B struct yy_buffer_state 1598structure, so you may safely initialize YY_BUFFER_STATE variables to 1599.B ((YY_BUFFER_STATE) 0) 1600if you wish, and also refer to the opaque structure in order to 1601correctly declare input buffers in source files other than that 1602of your scanner. Note that the 1603.I FILE 1604pointer in the call to 1605.B yy_create_buffer 1606is only used as the value of 1607.I yyin 1608seen by 1609.B YY_INPUT; 1610if you redefine 1611.B YY_INPUT 1612so it no longer uses 1613.I yyin, 1614then you can safely pass a nil 1615.I FILE 1616pointer to 1617.B yy_create_buffer. 1618You select a particular buffer to scan from using: 1619.nf 1620 1621 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) 1622 1623.fi 1624switches the scanner's input buffer so subsequent tokens will 1625come from 1626.I new_buffer. 1627Note that 1628.B yy_switch_to_buffer() 1629may be used by yywrap() to set things up for continued scanning, instead 1630of opening a new file and pointing 1631.I yyin 1632at it. Note also that switching input sources via either 1633.B yy_switch_to_buffer() 1634or 1635.B yywrap() 1636does 1637.I not 1638change the start condition. 1639.nf 1640 1641 void yy_delete_buffer( YY_BUFFER_STATE buffer ) 1642 1643.fi 1644is used to reclaim the storage associated with a buffer. ( 1645.B buffer 1646can be nil, in which case the routine does nothing.) 1647You can also clear the current contents of a buffer using: 1648.nf 1649 1650 void yy_flush_buffer( YY_BUFFER_STATE buffer ) 1651 1652.fi 1653This function discards the buffer's contents, 1654so the next time the scanner attempts to match a token from the 1655buffer, it will first fill the buffer anew using 1656.B YY_INPUT. 1657.PP 1658.B yy_new_buffer() 1659is an alias for 1660.B yy_create_buffer(), 1661provided for compatibility with the C++ use of 1662.I new 1663and 1664.I delete 1665for creating and destroying dynamic objects. 1666.PP 1667Finally, the 1668.B YY_CURRENT_BUFFER 1669macro returns a 1670.B YY_BUFFER_STATE 1671handle to the current buffer. 1672.PP 1673Here is an example of using these features for writing a scanner 1674which expands include files (the 1675.B <<EOF>> 1676feature is discussed below): 1677.nf 1678 1679 /* the "incl" state is used for picking up the name 1680 * of an include file 1681 */ 1682 %x incl 1683 1684 %{ 1685 #define MAX_INCLUDE_DEPTH 10 1686 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1687 int include_stack_ptr = 0; 1688 %} 1689 1690 %% 1691 include BEGIN(incl); 1692 1693 [a-z]+ ECHO; 1694 [^a-z\\n]*\\n? ECHO; 1695 1696 <incl>[ \\t]* /* eat the whitespace */ 1697 <incl>[^ \\t\\n]+ { /* got the include file name */ 1698 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 1699 { 1700 fprintf( stderr, "Includes nested too deeply" ); 1701 exit( 1 ); 1702 } 1703 1704 include_stack[include_stack_ptr++] = 1705 YY_CURRENT_BUFFER; 1706 1707 yyin = fopen( yytext, "r" ); 1708 1709 if ( ! yyin ) 1710 error( ... ); 1711 1712 yy_switch_to_buffer( 1713 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 1714 1715 BEGIN(INITIAL); 1716 } 1717 1718 <<EOF>> { 1719 if ( --include_stack_ptr < 0 ) 1720 { 1721 yyterminate(); 1722 } 1723 1724 else 1725 { 1726 yy_delete_buffer( YY_CURRENT_BUFFER ); 1727 yy_switch_to_buffer( 1728 include_stack[include_stack_ptr] ); 1729 } 1730 } 1731 1732.fi 1733Three routines are available for setting up input buffers for 1734scanning in-memory strings instead of files. All of them create 1735a new input buffer for scanning the string, and return a corresponding 1736.B YY_BUFFER_STATE 1737handle (which you should delete with 1738.B yy_delete_buffer() 1739when done with it). They also switch to the new buffer using 1740.B yy_switch_to_buffer(), 1741so the next call to 1742.B yylex() 1743will start scanning the string. 1744.TP 1745.B yy_scan_string(const char *str) 1746scans a NUL-terminated string. 1747.TP 1748.B yy_scan_bytes(const char *bytes, int len) 1749scans 1750.I len 1751bytes (including possibly NUL's) 1752starting at location 1753.I bytes. 1754.PP 1755Note that both of these functions create and scan a 1756.I copy 1757of the string or bytes. (This may be desirable, since 1758.B yylex() 1759modifies the contents of the buffer it is scanning.) You can avoid the 1760copy by using: 1761.TP 1762.B yy_scan_buffer(char *base, yy_size_t size) 1763which scans in place the buffer starting at 1764.I base, 1765consisting of 1766.I size 1767bytes, the last two bytes of which 1768.I must 1769be 1770.B YY_END_OF_BUFFER_CHAR 1771(ASCII NUL). 1772These last two bytes are not scanned; thus, scanning 1773consists of 1774.B base[0] 1775through 1776.B base[size-2], 1777inclusive. 1778.IP 1779If you fail to set up 1780.I base 1781in this manner (i.e., forget the final two 1782.B YY_END_OF_BUFFER_CHAR 1783bytes), then 1784.B yy_scan_buffer() 1785returns a nil pointer instead of creating a new input buffer. 1786.IP 1787The type 1788.B yy_size_t 1789is an integral type to which you can cast an integer expression 1790reflecting the size of the buffer. 1791.SH END-OF-FILE RULES 1792The special rule "<<EOF>>" indicates 1793actions which are to be taken when an end-of-file is 1794encountered and yywrap() returns non-zero (i.e., indicates 1795no further files to process). The action must finish 1796by doing one of four things: 1797.IP - 1798assigning 1799.I yyin 1800to a new input file (in previous versions of flex, after doing the 1801assignment you had to call the special action 1802.B YY_NEW_FILE; 1803this is no longer necessary); 1804.IP - 1805executing a 1806.I return 1807statement; 1808.IP - 1809executing the special 1810.B yyterminate() 1811action; 1812.IP - 1813or, switching to a new buffer using 1814.B yy_switch_to_buffer() 1815as shown in the example above. 1816.PP 1817<<EOF>> rules may not be used with other 1818patterns; they may only be qualified with a list of start 1819conditions. If an unqualified <<EOF>> rule is given, it 1820applies to 1821.I all 1822start conditions which do not already have <<EOF>> actions. To 1823specify an <<EOF>> rule for only the initial start condition, use 1824.nf 1825 1826 <INITIAL><<EOF>> 1827 1828.fi 1829.PP 1830These rules are useful for catching things like unclosed comments. 1831An example: 1832.nf 1833 1834 %x quote 1835 %% 1836 1837 ...other rules for dealing with quotes... 1838 1839 <quote><<EOF>> { 1840 error( "unterminated quote" ); 1841 yyterminate(); 1842 } 1843 <<EOF>> { 1844 if ( *++filelist ) 1845 yyin = fopen( *filelist, "r" ); 1846 else 1847 yyterminate(); 1848 } 1849 1850.fi 1851.SH MISCELLANEOUS MACROS 1852The macro 1853.B YY_USER_ACTION 1854can be defined to provide an action 1855which is always executed prior to the matched rule's action. For example, 1856it could be #define'd to call a routine to convert yytext to lower-case. 1857When 1858.B YY_USER_ACTION 1859is invoked, the variable 1860.I yy_act 1861gives the number of the matched rule (rules are numbered starting with 1). 1862Suppose you want to profile how often each of your rules is matched. The 1863following would do the trick: 1864.nf 1865 1866 #define YY_USER_ACTION ++ctr[yy_act] 1867 1868.fi 1869where 1870.I ctr 1871is an array to hold the counts for the different rules. Note that 1872the macro 1873.B YY_NUM_RULES 1874gives the total number of rules (including the default rule, even if 1875you use 1876.B \-s), 1877so a correct declaration for 1878.I ctr 1879is: 1880.nf 1881 1882 int ctr[YY_NUM_RULES]; 1883 1884.fi 1885.PP 1886The macro 1887.B YY_USER_INIT 1888may be defined to provide an action which is always executed before 1889the first scan (and before the scanner's internal initializations are done). 1890For example, it could be used to call a routine to read 1891in a data table or open a logging file. 1892.PP 1893The macro 1894.B yy_set_interactive(is_interactive) 1895can be used to control whether the current buffer is considered 1896.I interactive. 1897An interactive buffer is processed more slowly, 1898but must be used when the scanner's input source is indeed 1899interactive to avoid problems due to waiting to fill buffers 1900(see the discussion of the 1901.B \-I 1902flag below). A non-zero value 1903in the macro invocation marks the buffer as interactive, a zero 1904value as non-interactive. Note that use of this macro overrides 1905.B %option always-interactive 1906or 1907.B %option never-interactive 1908(see Options below). 1909.B yy_set_interactive() 1910must be invoked prior to beginning to scan the buffer that is 1911(or is not) to be considered interactive. 1912.PP 1913The macro 1914.B yy_set_bol(at_bol) 1915can be used to control whether the current buffer's scanning 1916context for the next token match is done as though at the 1917beginning of a line. A non-zero macro argument makes rules anchored with 1918'^' active, while a zero argument makes '^' rules inactive. 1919.PP 1920The macro 1921.B YY_AT_BOL() 1922returns true if the next token scanned from the current buffer 1923will have '^' rules active, false otherwise. 1924.PP 1925In the generated scanner, the actions are all gathered in one large 1926switch statement and separated using 1927.B YY_BREAK, 1928which may be redefined. By default, it is simply a "break", to separate 1929each rule's action from the following rule's. 1930Redefining 1931.B YY_BREAK 1932allows, for example, C++ users to 1933#define YY_BREAK to do nothing (while being very careful that every 1934rule ends with a "break" or a "return"!) to avoid suffering from 1935unreachable statement warnings where because a rule's action ends with 1936"return", the 1937.B YY_BREAK 1938is inaccessible. 1939.SH VALUES AVAILABLE TO THE USER 1940This section summarizes the various values available to the user 1941in the rule actions. 1942.IP - 1943.B char *yytext 1944holds the text of the current token. It may be modified but not lengthened 1945(you cannot append characters to the end). 1946.IP 1947If the special directive 1948.B %array 1949appears in the first section of the scanner description, then 1950.B yytext 1951is instead declared 1952.B char yytext[YYLMAX], 1953where 1954.B YYLMAX 1955is a macro definition that you can redefine in the first section 1956if you don't like the default value (generally 8KB). Using 1957.B %array 1958results in somewhat slower scanners, but the value of 1959.B yytext 1960becomes immune to calls to 1961.I input() 1962and 1963.I unput(), 1964which potentially destroy its value when 1965.B yytext 1966is a character pointer. The opposite of 1967.B %array 1968is 1969.B %pointer, 1970which is the default. 1971.IP 1972You cannot use 1973.B %array 1974when generating C++ scanner classes 1975(the 1976.B \-+ 1977flag). 1978.IP - 1979.B int yyleng 1980holds the length of the current token. 1981.IP - 1982.B FILE *yyin 1983is the file which by default 1984.I flex 1985reads from. It may be redefined but doing so only makes sense before 1986scanning begins or after an EOF has been encountered. Changing it in 1987the midst of scanning will have unexpected results since 1988.I flex 1989buffers its input; use 1990.B yyrestart() 1991instead. 1992Once scanning terminates because an end-of-file 1993has been seen, you can assign 1994.I yyin 1995at the new input file and then call the scanner again to continue scanning. 1996.IP - 1997.B void yyrestart( FILE *new_file ) 1998may be called to point 1999.I yyin 2000at the new input file. The switch-over to the new file is immediate 2001(any previously buffered-up input is lost). Note that calling 2002.B yyrestart() 2003with 2004.I yyin 2005as an argument thus throws away the current input buffer and continues 2006scanning the same input file. 2007.IP - 2008.B FILE *yyout 2009is the file to which 2010.B ECHO 2011actions are done. It can be reassigned by the user. 2012.IP - 2013.B YY_CURRENT_BUFFER 2014returns a 2015.B YY_BUFFER_STATE 2016handle to the current buffer. 2017.IP - 2018.B YY_START 2019returns an integer value corresponding to the current start 2020condition. You can subsequently use this value with 2021.B BEGIN 2022to return to that start condition. 2023.SH INTERFACING WITH YACC 2024One of the main uses of 2025.I flex 2026is as a companion to the 2027.I yacc 2028parser-generator. 2029.I yacc 2030parsers expect to call a routine named 2031.B yylex() 2032to find the next input token. The routine is supposed to 2033return the type of the next token as well as putting any associated 2034value in the global 2035.B yylval. 2036To use 2037.I flex 2038with 2039.I yacc, 2040one specifies the 2041.B \-d 2042option to 2043.I yacc 2044to instruct it to generate the file 2045.B y.tab.h 2046containing definitions of all the 2047.B %tokens 2048appearing in the 2049.I yacc 2050input. This file is then included in the 2051.I flex 2052scanner. For example, if one of the tokens is "TOK_NUMBER", 2053part of the scanner might look like: 2054.nf 2055 2056 %{ 2057 #include "y.tab.h" 2058 %} 2059 2060 %% 2061 2062 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2063 2064.fi |
34.SH OPTIONS 35.I flex 36has the following options: 37.TP 38.B \-b | 2065.SH OPTIONS 2066.I flex 2067has the following options: 2068.TP 2069.B \-b |
39generate backing-up information to | 2070Generate backing-up information to |
40.I lex.backup. | 2071.I lex.backup. |
41This is a list of scanner states which require backing up and the input 42characters on which they do so. By adding rules one can remove 43backing-up states. If all backing-up states are eliminated and | 2072This is a list of scanner states which require backing up 2073and the input characters on which they do so. By adding rules one 2074can remove backing-up states. If 2075.I all 2076backing-up states are eliminated and |
44.B \-Cf 45or 46.B \-CF | 2077.B \-Cf 2078or 2079.B \-CF |
47is used, the generated scanner will run faster. | 2080is used, the generated scanner will run faster (see the 2081.B \-p 2082flag). Only users who wish to squeeze every last cycle out of their 2083scanners need worry about this option. (See the section on Performance 2084Considerations below.) |
48.TP 49.B \-c 50is a do-nothing, deprecated option included for POSIX compliance. | 2085.TP 2086.B \-c 2087is a do-nothing, deprecated option included for POSIX compliance. |
51.IP 52.B NOTE: 53in previous releases of 54.I flex 55.B \-c 56specified table-compression options. This functionality is 57now given by the 58.B \-C 59flag. To ease the the impact of this change, when 60.I flex 61encounters 62.B \-c, 63it currently issues a warning message and assumes that 64.B \-C 65was desired instead. In the future this "promotion" of 66.B \-c 67to 68.B \-C 69will go away in the name of full POSIX compliance (unless 70the POSIX meaning is removed first). | |
71.TP 72.B \-d 73makes the generated scanner run in 74.I debug 75mode. Whenever a pattern is recognized and the global 76.B yy_flex_debug | 2088.TP 2089.B \-d 2090makes the generated scanner run in 2091.I debug 2092mode. Whenever a pattern is recognized and the global 2093.B yy_flex_debug |
77is non-zero (which is the default), the scanner will 78write to | 2094is non-zero (which is the default), 2095the scanner will write to |
79.I stderr 80a line of the form: 81.nf 82 83 --accepting rule at line 53 ("the matched text") 84 85.fi 86The line number refers to the location of the rule in the file 87defining the scanner (i.e., the file that was fed to flex). Messages 88are also generated when the scanner backs up, accepts the 89default rule, reaches the end of its input buffer (or encounters | 2096.I stderr 2097a line of the form: 2098.nf 2099 2100 --accepting rule at line 53 ("the matched text") 2101 2102.fi 2103The line number refers to the location of the rule in the file 2104defining the scanner (i.e., the file that was fed to flex). Messages 2105are also generated when the scanner backs up, accepts the 2106default rule, reaches the end of its input buffer (or encounters |
90a NUL; the two look the same as far as the scanner's concerned), | 2107a NUL; at this point, the two look the same as far as the scanner's concerned), |
91or reaches an end-of-file. 92.TP 93.B \-f 94specifies 95.I fast scanner. 96No table compression is done and stdio is bypassed. 97The result is large but fast. This option is equivalent to 98.B \-Cfr 99(see below). 100.TP 101.B \-h 102generates a "help" summary of 103.I flex's 104options to | 2108or reaches an end-of-file. 2109.TP 2110.B \-f 2111specifies 2112.I fast scanner. 2113No table compression is done and stdio is bypassed. 2114The result is large but fast. This option is equivalent to 2115.B \-Cfr 2116(see below). 2117.TP 2118.B \-h 2119generates a "help" summary of 2120.I flex's 2121options to |
105.I stderr | 2122.I stdout |
106and then exits. | 2123and then exits. |
2124.B \-? 2125and 2126.B \-\-help 2127are synonyms for 2128.B \-h. |
|
107.TP 108.B \-i 109instructs 110.I flex 111to generate a 112.I case-insensitive 113scanner. The case of letters given in the 114.I flex 115input patterns will 116be ignored, and tokens in the input will be matched regardless of case. The 117matched text given in 118.I yytext 119will have the preserved case (i.e., it will not be folded). 120.TP 121.B \-l | 2129.TP 2130.B \-i 2131instructs 2132.I flex 2133to generate a 2134.I case-insensitive 2135scanner. The case of letters given in the 2136.I flex 2137input patterns will 2138be ignored, and tokens in the input will be matched regardless of case. The 2139matched text given in 2140.I yytext 2141will have the preserved case (i.e., it will not be folded). 2142.TP 2143.B \-l |
122turns on maximum compatibility with the original AT&T lex implementation, 123at a considerable performance cost. This option is incompatible with 124.B \-+, \-f, \-F, \-Cf, | 2144turns on maximum compatibility with the original AT&T 2145.I lex 2146implementation. Note that this does not mean 2147.I full 2148compatibility. Use of this option costs a considerable amount of 2149performance, and it cannot be used with the 2150.B \-+, -f, -F, -Cf, |
125or | 2151or |
126.B \-CF. 127See 128.I lexdoc(1) 129for details. | 2152.B -CF 2153options. For details on the compatibilities it provides, see the section 2154"Incompatibilities With Lex And POSIX" below. This option also results 2155in the name 2156.B YY_FLEX_LEX_COMPAT 2157being #define'd in the generated scanner. |
130.TP 131.B \-n 132is another do-nothing, deprecated option included only for 133POSIX compliance. 134.TP 135.B \-p 136generates a performance report to stderr. The report 137consists of comments regarding features of the 138.I flex | 2158.TP 2159.B \-n 2160is another do-nothing, deprecated option included only for 2161POSIX compliance. 2162.TP 2163.B \-p 2164generates a performance report to stderr. The report 2165consists of comments regarding features of the 2166.I flex |
139input file which will cause a loss of performance in the resulting scanner. 140If you give the flag twice, you will also get comments regarding | 2167input file which will cause a serious loss of performance in the resulting 2168scanner. If you give the flag twice, you will also get comments regarding |
141features that lead to minor performance losses. | 2169features that lead to minor performance losses. |
2170.IP 2171Note that the use of 2172.B REJECT, 2173.B %option yylineno, 2174and variable trailing context (see the Deficiencies / Bugs section below) 2175entails a substantial performance penalty; use of 2176.I yymore(), 2177the 2178.B ^ 2179operator, 2180and the 2181.B \-I 2182flag entail minor performance penalties. |
|
142.TP 143.B \-s 144causes the 145.I default rule 146(that unmatched scanner input is echoed to 147.I stdout) 148to be suppressed. If the scanner encounters input that does not | 2183.TP 2184.B \-s 2185causes the 2186.I default rule 2187(that unmatched scanner input is echoed to 2188.I stdout) 2189to be suppressed. If the scanner encounters input that does not |
149match any of its rules, it aborts with an error. | 2190match any of its rules, it aborts with an error. This option is 2191useful for finding holes in a scanner's rule set. |
150.TP 151.B \-t 152instructs 153.I flex 154to write the scanner it generates to standard output instead 155of 156.B lex.yy.c. 157.TP 158.B \-v 159specifies that 160.I flex 161should write to 162.I stderr 163a summary of statistics regarding the scanner it generates. | 2192.TP 2193.B \-t 2194instructs 2195.I flex 2196to write the scanner it generates to standard output instead 2197of 2198.B lex.yy.c. 2199.TP 2200.B \-v 2201specifies that 2202.I flex 2203should write to 2204.I stderr 2205a summary of statistics regarding the scanner it generates. |
2206Most of the statistics are meaningless to the casual 2207.I flex 2208user, but the first line identifies the version of 2209.I flex 2210(same as reported by 2211.B \-V), 2212and the next line the flags used when generating the scanner, including 2213those that are on by default. |
|
164.TP 165.B \-w 166suppresses warning messages. 167.TP 168.B \-B 169instructs 170.I flex 171to generate a 172.I batch | 2214.TP 2215.B \-w 2216suppresses warning messages. 2217.TP 2218.B \-B 2219instructs 2220.I flex 2221to generate a 2222.I batch |
173scanner instead of an | 2223scanner, the opposite of |
174.I interactive | 2224.I interactive |
175scanner (see | 2225scanners generated by |
176.B \-I | 2226.B \-I |
177below). See 178.I lexdoc(1) 179for details. Scanners using | 2227(see below). In general, you use 2228.B \-B 2229when you are 2230.I certain 2231that your scanner will never be used interactively, and you want to 2232squeeze a 2233.I little 2234more performance out of it. If your goal is instead to squeeze out a 2235.I lot 2236more performance, you should be using the |
180.B \-Cf 181or 182.B \-CF | 2237.B \-Cf 2238or 2239.B \-CF |
183compression options automatically specify this option, too. | 2240options (discussed below), which turn on 2241.B \-B 2242automatically anyway. |
184.TP 185.B \-F 186specifies that the 187.ul 188fast | 2243.TP 2244.B \-F 2245specifies that the 2246.ul 2247fast |
189scanner table representation should be used (and stdio bypassed). 190This representation is about as fast as the full table representation | 2248scanner table representation should be used (and stdio 2249bypassed). This representation is 2250about as fast as the full table representation |
191.B (-f), 192and for some sets of patterns will be considerably smaller (and for | 2251.B (-f), 2252and for some sets of patterns will be considerably smaller (and for |
193others, larger). It cannot be used with the 194.B \-+ 195option. See 196.B lexdoc(1) 197for more details. | 2253others, larger). In general, if the pattern set contains both "keywords" 2254and a catch-all, "identifier" rule, such as in the set: 2255.nf 2256 2257 "case" return TOK_CASE; 2258 "switch" return TOK_SWITCH; 2259 ... 2260 "default" return TOK_DEFAULT; 2261 [a-z]+ return TOK_ID; 2262 2263.fi 2264then you're better off using the full table representation. If only 2265the "identifier" rule is present and you then use a hash table or some such 2266to detect the keywords, you're better off using 2267.B -F. |
198.IP 199This option is equivalent to 200.B \-CFr | 2268.IP 2269This option is equivalent to 2270.B \-CFr |
201(see below). | 2271(see below). It cannot be used with 2272.B \-+. |
202.TP 203.B \-I 204instructs 205.I flex 206to generate an 207.I interactive | 2273.TP 2274.B \-I 2275instructs 2276.I flex 2277to generate an 2278.I interactive |
208scanner, that is, a scanner which stops immediately rather than 209looking ahead if it knows 210that the currently scanned text cannot be part of a longer rule's match. 211This is the opposite of 212.I batch 213scanners (see 214.B \-B 215above). See 216.B lexdoc(1) 217for details. | 2279scanner. An interactive scanner is one that only looks ahead to decide 2280what token has been matched if it absolutely must. It turns out that 2281always looking one extra character ahead, even if the scanner has already 2282seen enough text to disambiguate the current token, is a bit faster than 2283only looking ahead when necessary. But scanners that always look ahead 2284give dreadful interactive performance; for example, when a user types 2285a newline, it is not recognized as a newline token until they enter 2286.I another 2287token, which often means typing in another whole line. |
218.IP | 2288.IP |
219Note, 220.B \-I 221cannot be used in conjunction with 222.I full | 2289.I Flex 2290scanners default to 2291.I interactive 2292unless you use the 2293.B \-Cf |
223or | 2294or |
224.I fast tables, 225i.e., the 226.B \-f, \-F, \-Cf, 227or | |
228.B \-CF | 2295.B \-CF |
229flags. For other table compression options, | 2296table-compression options (see below). That's because if you're looking 2297for high-performance you should be using one of these options, so if you 2298didn't, 2299.I flex 2300assumes you'd rather trade off a bit of run-time performance for intuitive 2301interactive behavior. Note also that you 2302.I cannot 2303use |
230.B \-I | 2304.B \-I |
231is the default. | 2305in conjunction with 2306.B \-Cf 2307or 2308.B \-CF. 2309Thus, this option is not really needed; it is on by default for all those 2310cases in which it is allowed. 2311.IP 2312You can force a scanner to 2313.I not 2314be interactive by using 2315.B \-B 2316(see above). |
232.TP 233.B \-L 234instructs 235.I flex 236not to generate 237.B #line | 2317.TP 2318.B \-L 2319instructs 2320.I flex 2321not to generate 2322.B #line |
238directives in 239.B lex.yy.c. 240The default is to generate such directives so error 241messages in the actions will be correctly 242located with respect to the original | 2323directives. Without this option, |
243.I flex | 2324.I flex |
244input file, and not to 245the fairly meaningless line numbers of 246.B lex.yy.c. | 2325peppers the generated scanner 2326with #line directives so error messages in the actions will be correctly 2327located with respect to either the original 2328.I flex 2329input file (if the errors are due to code in the input file), or 2330.B lex.yy.c 2331(if the errors are 2332.I flex's 2333fault -- you should report these sorts of errors to the email address 2334given below). |
247.TP 248.B \-T 249makes 250.I flex 251run in 252.I trace 253mode. It will generate a lot of messages to 254.I stderr 255concerning 256the form of the input and the resultant non-deterministic and deterministic 257finite automata. This option is mostly for use in maintaining 258.I flex. 259.TP 260.B \-V 261prints the version number to | 2335.TP 2336.B \-T 2337makes 2338.I flex 2339run in 2340.I trace 2341mode. It will generate a lot of messages to 2342.I stderr 2343concerning 2344the form of the input and the resultant non-deterministic and deterministic 2345finite automata. This option is mostly for use in maintaining 2346.I flex. 2347.TP 2348.B \-V 2349prints the version number to |
262.I stderr | 2350.I stdout |
263and exits. | 2351and exits. |
2352.B \-\-version 2353is a synonym for 2354.B \-V. |
|
264.TP 265.B \-7 266instructs 267.I flex | 2355.TP 2356.B \-7 2357instructs 2358.I flex |
268to generate a 7-bit scanner, which can save considerable table space, 269especially when using | 2359to generate a 7-bit scanner, i.e., one which can only recognized 7-bit 2360characters in its input. The advantage of using 2361.B \-7 2362is that the scanner's tables can be up to half the size of those generated 2363using the 2364.B \-8 2365option (see below). The disadvantage is that such scanners often hang 2366or crash if their input contains an 8-bit character. 2367.IP 2368Note, however, that unless you generate your scanner using the |
270.B \-Cf 271or 272.B \-CF | 2369.B \-Cf 2370or 2371.B \-CF |
273(and, at most sites, | 2372table compression options, use of |
274.B \-7 | 2373.B \-7 |
275is on by default for these options. To see if this is the case, use the 276.B -v 277verbose flag and check the flag summary it reports). | 2374will save only a small amount of table space, and make your scanner 2375considerably less portable. 2376.I Flex's 2377default behavior is to generate an 8-bit scanner unless you use the 2378.B \-Cf 2379or 2380.B \-CF, 2381in which case 2382.I flex 2383defaults to generating 7-bit scanners unless your site was always 2384configured to generate 8-bit scanners (as will often be the case 2385with non-USA sites). You can tell whether flex generated a 7-bit 2386or an 8-bit scanner by inspecting the flag summary in the 2387.B \-v 2388output as described above. 2389.IP 2390Note that if you use 2391.B \-Cfe 2392or 2393.B \-CFe 2394(those table compression options, but also using equivalence classes as 2395discussed see below), flex still defaults to generating an 8-bit 2396scanner, since usually with these compression options full 8-bit tables 2397are not much more expensive than 7-bit tables. |
278.TP 279.B \-8 280instructs 281.I flex | 2398.TP 2399.B \-8 2400instructs 2401.I flex |
282to generate an 8-bit scanner. This is the default except for the | 2402to generate an 8-bit scanner, i.e., one which can recognize 8-bit 2403characters. This flag is only needed for scanners generated using |
283.B \-Cf | 2404.B \-Cf |
284and 285.B \-CF 286compression options, for which the default is site-dependent, and 287can be checked by inspecting the flag summary generated by the 288.B \-v 289option. | 2405or 2406.B \-CF, 2407as otherwise flex defaults to generating an 8-bit scanner anyway. 2408.IP 2409See the discussion of 2410.B \-7 2411above for flex's default behavior and the tradeoffs between 7-bit 2412and 8-bit scanners. |
290.TP 291.B \-+ 292specifies that you want flex to generate a C++ | 2413.TP 2414.B \-+ 2415specifies that you want flex to generate a C++ |
293scanner class. See the section on Generating C++ Scanners in 294.I lexdoc(1) 295for details. | 2416scanner class. See the section on Generating C++ Scanners below for 2417details. |
296.TP 297.B \-C[aefFmr] | 2418.TP 2419.B \-C[aefFmr] |
298controls the degree of table compression and scanner optimization. | 2420controls the degree of table compression and, more generally, trade-offs 2421between small scanners and fast scanners. |
299.IP 300.B \-Ca | 2422.IP 2423.B \-Ca |
301trade off larger tables in the generated scanner for faster performance 302because the elements of the tables are better aligned for memory access 303and computation. This option can double the size of the tables used by 304your scanner. | 2424("align") instructs flex to trade off larger tables in the 2425generated scanner for faster performance because the elements of 2426the tables are better aligned for memory access and computation. On some 2427RISC architectures, fetching and manipulating longwords is more efficient 2428than with smaller-sized units such as shortwords. This option can 2429double the size of the tables used by your scanner. |
305.IP 306.B \-Ce 307directs 308.I flex 309to construct 310.I equivalence classes, 311i.e., sets of characters | 2430.IP 2431.B \-Ce 2432directs 2433.I flex 2434to construct 2435.I equivalence classes, 2436i.e., sets of characters |
312which have identical lexical properties. 313Equivalence classes usually give | 2437which have identical lexical properties (for example, if the only 2438appearance of digits in the 2439.I flex 2440input is in the character class 2441"[0-9]" then the digits '0', '1', ..., '9' will all be put 2442in the same equivalence class). Equivalence classes usually give |
314dramatic reductions in the final table/object file sizes (typically 315a factor of 2-5) and are pretty cheap performance-wise (one array 316look-up per character scanned). 317.IP 318.B \-Cf 319specifies that the 320.I full 321scanner tables should be generated - 322.I flex 323should not compress the 324tables by taking advantages of similar transition functions for 325different states. 326.IP 327.B \-CF | 2443dramatic reductions in the final table/object file sizes (typically 2444a factor of 2-5) and are pretty cheap performance-wise (one array 2445look-up per character scanned). 2446.IP 2447.B \-Cf 2448specifies that the 2449.I full 2450scanner tables should be generated - 2451.I flex 2452should not compress the 2453tables by taking advantages of similar transition functions for 2454different states. 2455.IP 2456.B \-CF |
328specifies that the alternate fast scanner representation (described in 329.B lexdoc(1)) | 2457specifies that the alternate fast scanner representation (described 2458above under the 2459.B \-F 2460flag) |
330should be used. This option cannot be used with 331.B \-+. 332.IP 333.B \-Cm 334directs 335.I flex 336to construct 337.I meta-equivalence classes, 338which are sets of equivalence classes (or characters, if equivalence 339classes are not being used) that are commonly used together. Meta-equivalence 340classes are often a big win when using compressed tables, but they 341have a moderate performance impact (one or two "if" tests and one 342array look-up per character scanned). 343.IP 344.B \-Cr 345causes the generated scanner to 346.I bypass | 2461should be used. This option cannot be used with 2462.B \-+. 2463.IP 2464.B \-Cm 2465directs 2466.I flex 2467to construct 2468.I meta-equivalence classes, 2469which are sets of equivalence classes (or characters, if equivalence 2470classes are not being used) that are commonly used together. Meta-equivalence 2471classes are often a big win when using compressed tables, but they 2472have a moderate performance impact (one or two "if" tests and one 2473array look-up per character scanned). 2474.IP 2475.B \-Cr 2476causes the generated scanner to 2477.I bypass |
347using stdio for input. In general this option results in a minor 348performance gain only worthwhile if used in conjunction with | 2478use of the standard I/O library (stdio) for input. Instead of calling 2479.B fread() 2480or 2481.B getc(), 2482the scanner will use the 2483.B read() 2484system call, resulting in a performance gain which varies from system 2485to system, but in general is probably negligible unless you are also using |
349.B \-Cf 350or 351.B \-CF. | 2486.B \-Cf 2487or 2488.B \-CF. |
352It can cause surprising behavior if you use stdio yourself to 353read from | 2489Using 2490.B \-Cr 2491can cause strange behavior if, for example, you read from |
354.I yyin | 2492.I yyin |
355prior to calling the scanner. | 2493using stdio prior to calling the scanner (because the scanner will miss 2494whatever text your previous reads left in the stdio input buffer). |
356.IP | 2495.IP |
2496.B \-Cr 2497has no effect if you define 2498.B YY_INPUT 2499(see The Generated Scanner above). 2500.IP |
|
357A lone 358.B \-C 359specifies that the scanner tables should be compressed but neither 360equivalence classes nor meta-equivalence classes should be used. 361.IP 362The options 363.B \-Cf 364or 365.B \-CF 366and 367.B \-Cm 368do not make sense together - there is no opportunity for meta-equivalence 369classes if the table is not being compressed. Otherwise the options | 2501A lone 2502.B \-C 2503specifies that the scanner tables should be compressed but neither 2504equivalence classes nor meta-equivalence classes should be used. 2505.IP 2506The options 2507.B \-Cf 2508or 2509.B \-CF 2510and 2511.B \-Cm 2512do not make sense together - there is no opportunity for meta-equivalence 2513classes if the table is not being compressed. Otherwise the options |
370may be freely mixed. | 2514may be freely mixed, and are cumulative. |
371.IP 372The default setting is 373.B \-Cem, 374which specifies that 375.I flex 376should generate equivalence classes 377and meta-equivalence classes. This setting provides the highest 378degree of table compression. You can trade off --- 7 unchanged lines hidden (view full) --- 386 -Ce 387 -C 388 -C{f,F}e 389 -C{f,F} 390 -C{f,F}a 391 fastest & largest 392 393.fi | 2515.IP 2516The default setting is 2517.B \-Cem, 2518which specifies that 2519.I flex 2520should generate equivalence classes 2521and meta-equivalence classes. This setting provides the highest 2522degree of table compression. You can trade off --- 7 unchanged lines hidden (view full) --- 2530 -Ce 2531 -C 2532 -C{f,F}e 2533 -C{f,F} 2534 -C{f,F}a 2535 fastest & largest 2536 2537.fi |
2538Note that scanners with the smallest tables are usually generated and 2539compiled the quickest, so 2540during development you will usually want to use the default, maximal 2541compression. |
|
394.IP | 2542.IP |
395.B \-C 396options are cumulative. | 2543.B \-Cfe 2544is often a good compromise between speed and size for production 2545scanners. |
397.TP | 2546.TP |
2547.B \-ooutput 2548directs flex to write the scanner to the file 2549.B output 2550instead of 2551.B lex.yy.c. 2552If you combine 2553.B \-o 2554with the 2555.B \-t 2556option, then the scanner is written to 2557.I stdout 2558but its 2559.B #line 2560directives (see the 2561.B \\-L 2562option above) refer to the file 2563.B output. 2564.TP |
|
398.B \-Pprefix 399changes the default 400.I "yy" 401prefix used by 402.I flex | 2565.B \-Pprefix 2566changes the default 2567.I "yy" 2568prefix used by 2569.I flex |
403to be 404.I prefix 405instead. See 406.I lexdoc(1) 407for a description of all the global variables and file names that 408this affects. | 2570for all globally-visible variable and function names to instead be 2571.I prefix. 2572For example, 2573.B \-Pfoo 2574changes the name of 2575.B yytext 2576to 2577.B footext. 2578It also changes the name of the default output file from 2579.B lex.yy.c 2580to 2581.B lex.foo.c. 2582Here are all of the names affected: 2583.nf 2584 2585 yy_create_buffer 2586 yy_delete_buffer 2587 yy_flex_debug 2588 yy_init_buffer 2589 yy_flush_buffer 2590 yy_load_buffer_state 2591 yy_switch_to_buffer 2592 yyin 2593 yyleng 2594 yylex 2595 yylineno 2596 yyout 2597 yyrestart 2598 yytext 2599 yywrap 2600 2601.fi 2602(If you are using a C++ scanner, then only 2603.B yywrap 2604and 2605.B yyFlexLexer 2606are affected.) 2607Within your scanner itself, you can still refer to the global variables 2608and functions using either version of their name; but externally, they 2609have the modified name. 2610.IP 2611This option lets you easily link together multiple 2612.I flex 2613programs into the same executable. Note, though, that using this 2614option also renames 2615.B yywrap(), 2616so you now 2617.I must 2618either 2619provide your own (appropriately-named) version of the routine for your 2620scanner, or use 2621.B %option noyywrap, 2622as linking with 2623.B \-ll 2624no longer provides one for you by default. |
409.TP 410.B \-Sskeleton_file 411overrides the default skeleton file from which 412.I flex 413constructs its scanners. You'll never need this option unless you are doing 414.I flex 415maintenance or development. | 2625.TP 2626.B \-Sskeleton_file 2627overrides the default skeleton file from which 2628.I flex 2629constructs its scanners. You'll never need this option unless you are doing 2630.I flex 2631maintenance or development. |
416.SH SUMMARY OF FLEX REGULAR EXPRESSIONS 417The patterns in the input are written using an extended set of regular 418expressions. These are: | 2632.PP 2633.I flex 2634also provides a mechanism for controlling options within the 2635scanner specification itself, rather than from the flex command-line. 2636This is done by including 2637.B %option 2638directives in the first section of the scanner specification. 2639You can specify multiple options with a single 2640.B %option 2641directive, and multiple directives in the first section of your flex input 2642file. 2643.PP 2644Most options are given simply as names, optionally preceded by the 2645word "no" (with no intervening whitespace) to negate their meaning. 2646A number are equivalent to flex flags or their negation: |
419.nf 420 | 2647.nf 2648 |
421 x match the character 'x' 422 . any character except newline 423 [xyz] a "character class"; in this case, the pattern 424 matches either an 'x', a 'y', or a 'z' 425 [abj-oZ] a "character class" with a range in it; matches 426 an 'a', a 'b', any letter from 'j' through 'o', 427 or a 'Z' 428 [^A-Z] a "negated character class", i.e., any character 429 but those in the class. In this case, any 430 character EXCEPT an uppercase letter. 431 [^A-Z\\n] any character EXCEPT an uppercase letter or 432 a newline 433 r* zero or more r's, where r is any regular expression 434 r+ one or more r's 435 r? zero or one r's (that is, "an optional r") 436 r{2,5} anywhere from two to five r's 437 r{2,} two or more r's 438 r{4} exactly 4 r's 439 {name} the expansion of the "name" definition 440 (see above) 441 "[xyz]\\"foo" 442 the literal string: [xyz]"foo 443 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', 444 then the ANSI-C interpretation of \\x. 445 Otherwise, a literal 'X' (used to escape 446 operators such as '*') 447 \\123 the character with octal value 123 448 \\x2a the character with hexadecimal value 2a 449 (r) match an r; parentheses are used to override 450 precedence (see below) | 2649 7bit -7 option 2650 8bit -8 option 2651 align -Ca option 2652 backup -b option 2653 batch -B option 2654 c++ -+ option |
451 | 2655 |
2656 caseful or 2657 case-sensitive opposite of -i (default) |
|
452 | 2658 |
453 rs the regular expression r followed by the 454 regular expression s; called "concatenation" | 2659 case-insensitive or 2660 caseless -i option |
455 | 2661 |
2662 debug -d option 2663 default opposite of -s option 2664 ecs -Ce option 2665 fast -F option 2666 full -f option 2667 interactive -I option 2668 lex-compat -l option 2669 meta-ecs -Cm option 2670 perf-report -p option 2671 read -Cr option 2672 stdout -t option 2673 verbose -v option 2674 warn opposite of -w option 2675 (use "%option nowarn" for -w) |
|
456 | 2676 |
457 r|s either an r or an s | 2677 array equivalent to "%array" 2678 pointer equivalent to "%pointer" (default) |
458 | 2679 |
2680.fi 2681Some 2682.B %option's 2683provide features otherwise not available: 2684.TP 2685.B always-interactive 2686instructs flex to generate a scanner which always considers its input 2687"interactive". Normally, on each new input file the scanner calls 2688.B isatty() 2689in an attempt to determine whether 2690the scanner's input source is interactive and thus should be read a 2691character at a time. When this option is used, however, then no 2692such call is made. 2693.TP 2694.B main 2695directs flex to provide a default 2696.B main() 2697program for the scanner, which simply calls 2698.B yylex(). 2699This option implies 2700.B noyywrap 2701(see below). 2702.TP 2703.B never-interactive 2704instructs flex to generate a scanner which never considers its input 2705"interactive" (again, no call made to 2706.B isatty()). 2707This is the opposite of 2708.B always-interactive. 2709.TP 2710.B stack 2711enables the use of start condition stacks (see Start Conditions above). 2712.TP 2713.B stdinit 2714if set (i.e., 2715.B %option stdinit) 2716initializes 2717.I yyin 2718and 2719.I yyout 2720to 2721.I stdin 2722and 2723.I stdout, 2724instead of the default of 2725.I nil. 2726Some existing 2727.I lex 2728programs depend on this behavior, even though it is not compliant with 2729ANSI C, which does not require 2730.I stdin 2731and 2732.I stdout 2733to be compile-time constant. 2734.TP 2735.B yylineno 2736directs 2737.I flex 2738to generate a scanner that maintains the number of the current line 2739read from its input in the global variable 2740.B yylineno. 2741This option is implied by 2742.B %option lex-compat. 2743.TP 2744.B yywrap 2745if unset (i.e., 2746.B %option noyywrap), 2747makes the scanner not call 2748.B yywrap() 2749upon an end-of-file, but simply assume that there are no more 2750files to scan (until the user points 2751.I yyin 2752at a new file and calls 2753.B yylex() 2754again). 2755.PP 2756.I flex 2757scans your rule actions to determine whether you use the 2758.B REJECT 2759or 2760.B yymore() 2761features. The 2762.B reject 2763and 2764.B yymore 2765options are available to override its decision as to whether you use the 2766options, either by setting them (e.g., 2767.B %option reject) 2768to indicate the feature is indeed used, or 2769unsetting them to indicate it actually is not used 2770(e.g., 2771.B %option noyymore). 2772.PP 2773Three options take string-delimited values, offset with '=': 2774.nf |
|
459 | 2775 |
460 r/s an r but only if it is followed by an s. The 461 s is not part of the matched text. This type 462 of pattern is called as "trailing context". 463 ^r an r, but only at the beginning of a line 464 r$ an r, but only at the end of a line. Equivalent 465 to "r/\\n". | 2776 %option outfile="ABC" |
466 | 2777 |
2778.fi 2779is equivalent to 2780.B -oABC, 2781and 2782.nf |
|
467 | 2783 |
468 <s>r an r, but only in start condition s (see 469 below for discussion of start conditions) 470 <s1,s2,s3>r 471 same, but in any of start conditions s1, 472 s2, or s3 473 <*>r an r in any start condition, even an exclusive one. | 2784 %option prefix="XYZ" |
474 | 2785 |
2786.fi 2787is equivalent to 2788.B -PXYZ. 2789Finally, 2790.nf |
|
475 | 2791 |
476 <<EOF>> an end-of-file 477 <s1,s2><<EOF>> 478 an end-of-file when in start condition s1 or s2 | 2792 %option yyclass="foo" |
479 480.fi | 2793 2794.fi |
481The regular expressions listed above are grouped according to 482precedence, from highest precedence at the top to lowest at the bottom. 483Those grouped together have equal precedence. | 2795only applies when generating a C++ scanner ( 2796.B \-+ 2797option). It informs 2798.I flex 2799that you have derived 2800.B foo 2801as a subclass of 2802.B yyFlexLexer, 2803so 2804.I flex 2805will place your actions in the member function 2806.B foo::yylex() 2807instead of 2808.B yyFlexLexer::yylex(). 2809It also generates a 2810.B yyFlexLexer::yylex() 2811member function that emits a run-time error (by invoking 2812.B yyFlexLexer::LexerError()) 2813if called. 2814See Generating C++ Scanners, below, for additional information. |
484.PP | 2815.PP |
485Some notes on patterns: 486.IP - 487Negated character classes 488.I match newlines 489unless "\\n" (or an equivalent escape sequence) is one of the 490characters explicitly present in the negated character class 491(e.g., "[^A-Z\\n]"). 492.IP - 493A rule can have at most one instance of trailing context (the '/' operator 494or the '$' operator). The start condition, '^', and "<<EOF>>" patterns 495can only occur at the beginning of a pattern, and, as well as with '/' and '$', 496cannot be grouped inside parentheses. The following are all illegal: | 2816A number of options are available for lint purists who want to suppress 2817the appearance of unneeded routines in the generated scanner. Each of the 2818following, if unset 2819(e.g., 2820.B %option nounput 2821), results in the corresponding routine not appearing in 2822the generated scanner: |
497.nf 498 | 2823.nf 2824 |
499 foo/bar$ 500 foo|(bar$) 501 foo|^bar 502 <sc1>foo<sc2>bar | 2825 input, unput 2826 yy_push_state, yy_pop_state, yy_top_state 2827 yy_scan_buffer, yy_scan_bytes, yy_scan_string |
503 504.fi | 2828 2829.fi |
505.SH SUMMARY OF SPECIAL ACTIONS 506In addition to arbitrary C code, the following can appear in actions: 507.IP - 508.B ECHO 509copies yytext to the scanner's output. 510.IP - 511.B BEGIN 512followed by the name of a start condition places the scanner in the 513corresponding start condition. 514.IP - | 2830(though 2831.B yy_push_state() 2832and friends won't appear anyway unless you use 2833.B %option stack). 2834.SH PERFORMANCE CONSIDERATIONS 2835The main design goal of 2836.I flex 2837is that it generate high-performance scanners. It has been optimized 2838for dealing well with large sets of rules. Aside from the effects on 2839scanner speed of the table compression 2840.B \-C 2841options outlined above, 2842there are a number of options/actions which degrade performance. These 2843are, from most expensive to least: 2844.nf 2845 2846 REJECT 2847 %option yylineno 2848 arbitrary trailing context 2849 2850 pattern sets that require backing up 2851 %array 2852 %option interactive 2853 %option always-interactive 2854 2855 '^' beginning-of-line operator 2856 yymore() 2857 2858.fi 2859with the first three all being quite expensive and the last two 2860being quite cheap. Note also that 2861.B unput() 2862is implemented as a routine call that potentially does quite a bit of 2863work, while 2864.B yyless() 2865is a quite-cheap macro; so if just putting back some excess text you 2866scanned, use 2867.B yyless(). 2868.PP |
515.B REJECT | 2869.B REJECT |
516directs the scanner to proceed on to the "second best" rule which matched the 517input (or a prefix of the input). 518.B yytext 519and 520.B yyleng 521are set up appropriately. Note that 522.B REJECT 523is a particularly expensive feature in terms scanner performance; 524if it is used in 525.I any 526of the scanner's actions it will slow down 527.I all 528of the scanner's matching. Furthermore, 529.B REJECT 530cannot be used with the 531.B \-f | 2870should be avoided at all costs when performance is important. 2871It is a particularly expensive option. 2872.PP 2873Getting rid of backing up is messy and often may be an enormous 2874amount of work for a complicated scanner. In principal, one begins 2875by using the 2876.B \-b 2877flag to generate a 2878.I lex.backup 2879file. For example, on the input 2880.nf 2881 2882 %% 2883 foo return TOK_KEYWORD; 2884 foobar return TOK_KEYWORD; 2885 2886.fi 2887the file looks like: 2888.nf 2889 2890 State #6 is non-accepting - 2891 associated rule line numbers: 2892 2 3 2893 out-transitions: [ o ] 2894 jam-transitions: EOF [ \\001-n p-\\177 ] 2895 2896 State #8 is non-accepting - 2897 associated rule line numbers: 2898 3 2899 out-transitions: [ a ] 2900 jam-transitions: EOF [ \\001-` b-\\177 ] 2901 2902 State #9 is non-accepting - 2903 associated rule line numbers: 2904 3 2905 out-transitions: [ r ] 2906 jam-transitions: EOF [ \\001-q s-\\177 ] 2907 2908 Compressed tables always back up. 2909 2910.fi 2911The first few lines tell us that there's a scanner state in 2912which it can make a transition on an 'o' but not on any other 2913character, and that in that state the currently scanned text does not match 2914any rule. The state occurs when trying to match the rules found 2915at lines 2 and 3 in the input file. 2916If the scanner is in that state and then reads 2917something other than an 'o', it will have to back up to find 2918a rule which is matched. With 2919a bit of headscratching one can see that this must be the 2920state it's in when it has seen "fo". When this has happened, 2921if anything other than another 'o' is seen, the scanner will 2922have to back up to simply match the 'f' (by the default rule). 2923.PP 2924The comment regarding State #8 indicates there's a problem 2925when "foob" has been scanned. Indeed, on any character other 2926than an 'a', the scanner will have to back up to accept "foo". 2927Similarly, the comment for State #9 concerns when "fooba" has 2928been scanned and an 'r' does not follow. 2929.PP 2930The final comment reminds us that there's no point going to 2931all the trouble of removing backing up from the rules unless 2932we're using 2933.B \-Cf |
532or | 2934or |
533.B \-F 534options. 535.IP 536Note also that unlike the other special actions, | 2935.B \-CF, 2936since there's no performance gain doing so with compressed scanners. 2937.PP 2938The way to remove the backing up is to add "error" rules: 2939.nf 2940 2941 %% 2942 foo return TOK_KEYWORD; 2943 foobar return TOK_KEYWORD; 2944 2945 fooba | 2946 foob | 2947 fo { 2948 /* false alarm, not really a keyword */ 2949 return TOK_ID; 2950 } 2951 2952.fi 2953.PP 2954Eliminating backing up among a list of keywords can also be 2955done using a "catch-all" rule: 2956.nf 2957 2958 %% 2959 foo return TOK_KEYWORD; 2960 foobar return TOK_KEYWORD; 2961 2962 [a-z]+ return TOK_ID; 2963 2964.fi 2965This is usually the best solution when appropriate. 2966.PP 2967Backing up messages tend to cascade. 2968With a complicated set of rules it's not uncommon to get hundreds 2969of messages. If one can decipher them, though, it often 2970only takes a dozen or so rules to eliminate the backing up (though 2971it's easy to make a mistake and have an error rule accidentally match 2972a valid token. A possible future 2973.I flex 2974feature will be to automatically add rules to eliminate backing up). 2975.PP 2976It's important to keep in mind that you gain the benefits of eliminating 2977backing up only if you eliminate 2978.I every 2979instance of backing up. Leaving just one means you gain nothing. 2980.PP 2981.I Variable 2982trailing context (where both the leading and trailing parts do not have 2983a fixed length) entails almost the same performance loss as |
537.B REJECT | 2984.B REJECT |
538is a 539.I branch; 540code immediately following it in the action will | 2985(i.e., substantial). So when possible a rule like: 2986.nf 2987 2988 %% 2989 mouse|rat/(cat|dog) run(); 2990 2991.fi 2992is better written: 2993.nf 2994 2995 %% 2996 mouse/cat|dog run(); 2997 rat/cat|dog run(); 2998 2999.fi 3000or as 3001.nf 3002 3003 %% 3004 mouse|rat/cat run(); 3005 mouse|rat/dog run(); 3006 3007.fi 3008Note that here the special '|' action does |
541.I not | 3009.I not |
542be executed. 543.IP - 544.B yymore() 545tells the scanner that the next time it matches a rule, the corresponding 546token should be 547.I appended 548onto the current value of | 3010provide any savings, and can even make things worse (see 3011Deficiencies / Bugs below). 3012.LP 3013Another area where the user can increase a scanner's performance 3014(and one that's easier to implement) arises from the fact that 3015the longer the tokens matched, the faster the scanner will run. 3016This is because with long tokens the processing of most input 3017characters takes place in the (short) inner scanning loop, and 3018does not often have to go through the additional work of setting up 3019the scanning environment (e.g., 3020.B yytext) 3021for the action. Recall the scanner for C comments: 3022.nf 3023 3024 %x comment 3025 %% 3026 int line_num = 1; 3027 3028 "/*" BEGIN(comment); 3029 3030 <comment>[^*\\n]* 3031 <comment>"*"+[^*/\\n]* 3032 <comment>\\n ++line_num; 3033 <comment>"*"+"/" BEGIN(INITIAL); 3034 3035.fi 3036This could be sped up by writing it as: 3037.nf 3038 3039 %x comment 3040 %% 3041 int line_num = 1; 3042 3043 "/*" BEGIN(comment); 3044 3045 <comment>[^*\\n]* 3046 <comment>[^*\\n]*\\n ++line_num; 3047 <comment>"*"+[^*/\\n]* 3048 <comment>"*"+[^*/\\n]*\\n ++line_num; 3049 <comment>"*"+"/" BEGIN(INITIAL); 3050 3051.fi 3052Now instead of each newline requiring the processing of another 3053action, recognizing the newlines is "distributed" over the other rules 3054to keep the matched text as long as possible. Note that 3055.I adding 3056rules does 3057.I not 3058slow down the scanner! The speed of the scanner is independent 3059of the number of rules or (modulo the considerations given at the 3060beginning of this section) how complicated the rules are with 3061regard to operators such as '*' and '|'. 3062.PP 3063A final example in speeding up a scanner: suppose you want to scan 3064through a file containing identifiers and keywords, one per line 3065and with no other extraneous characters, and recognize all the 3066keywords. A natural first approach is: 3067.nf 3068 3069 %% 3070 asm | 3071 auto | 3072 break | 3073 ... etc ... 3074 volatile | 3075 while /* it's a keyword */ 3076 3077 .|\\n /* it's not a keyword */ 3078 3079.fi 3080To eliminate the back-tracking, introduce a catch-all rule: 3081.nf 3082 3083 %% 3084 asm | 3085 auto | 3086 break | 3087 ... etc ... 3088 volatile | 3089 while /* it's a keyword */ 3090 3091 [a-z]+ | 3092 .|\\n /* it's not a keyword */ 3093 3094.fi 3095Now, if it's guaranteed that there's exactly one word per line, 3096then we can reduce the total number of matches by a half by 3097merging in the recognition of newlines with that of the other 3098tokens: 3099.nf 3100 3101 %% 3102 asm\\n | 3103 auto\\n | 3104 break\\n | 3105 ... etc ... 3106 volatile\\n | 3107 while\\n /* it's a keyword */ 3108 3109 [a-z]+\\n | 3110 .|\\n /* it's not a keyword */ 3111 3112.fi 3113One has to be careful here, as we have now reintroduced backing up 3114into the scanner. In particular, while 3115.I we 3116know that there will never be any characters in the input stream 3117other than letters or newlines, 3118.I flex 3119can't figure this out, and it will plan for possibly needing to back up 3120when it has scanned a token like "auto" and then the next character 3121is something other than a newline or a letter. Previously it would 3122then just match the "auto" rule and be done, but now it has no "auto" 3123rule, only a "auto\\n" rule. To eliminate the possibility of backing up, 3124we could either duplicate all rules but without final newlines, or, 3125since we never expect to encounter such an input and therefore don't 3126how it's classified, we can introduce one more catch-all rule, this 3127one which doesn't include a newline: 3128.nf 3129 3130 %% 3131 asm\\n | 3132 auto\\n | 3133 break\\n | 3134 ... etc ... 3135 volatile\\n | 3136 while\\n /* it's a keyword */ 3137 3138 [a-z]+\\n | 3139 [a-z]+ | 3140 .|\\n /* it's not a keyword */ 3141 3142.fi 3143Compiled with 3144.B \-Cf, 3145this is about as fast as one can get a 3146.I flex 3147scanner to go for this particular problem. 3148.PP 3149A final note: 3150.I flex 3151is slow when matching NUL's, particularly when a token contains 3152multiple NUL's. 3153It's best to write rules which match 3154.I short 3155amounts of text if it's anticipated that the text will often include NUL's. 3156.PP 3157Another final note regarding performance: as mentioned above in the section 3158How the Input is Matched, dynamically resizing |
549.B yytext | 3159.B yytext |
550rather than replacing it. 551.IP - 552.B yyless(n) 553returns all but the first 554.I n 555characters of the current token back to the input stream, where they 556will be rescanned when the scanner looks for the next match. 557.B yytext | 3160to accommodate huge tokens is a slow process because it presently requires that 3161the (huge) token be rescanned from the beginning. Thus if performance is 3162vital, you should attempt to match "large" quantities of text but not 3163"huge" quantities, where the cutoff between the two is at about 8K 3164characters/token. 3165.SH GENERATING C++ SCANNERS 3166.I flex 3167provides two different ways to generate scanners for use with C++. The 3168first way is to simply compile a scanner generated by 3169.I flex 3170using a C++ compiler instead of a C compiler. You should not encounter 3171any compilations errors (please report any you find to the email address 3172given in the Author section below). You can then use C++ code in your 3173rule actions instead of C code. Note that the default input source for 3174your scanner remains 3175.I yyin, 3176and default echoing is still done to 3177.I yyout. 3178Both of these remain 3179.I FILE * 3180variables and not C++ 3181.I streams. 3182.PP 3183You can also use 3184.I flex 3185to generate a C++ scanner class, using the 3186.B \-+ 3187option (or, equivalently, 3188.B %option c++), 3189which is automatically specified if the name of the flex 3190executable ends in a '+', such as 3191.I flex++. 3192When using this option, flex defaults to generating the scanner to the file 3193.B lex.yy.cc 3194instead of 3195.B lex.yy.c. 3196The generated scanner includes the header file 3197.I FlexLexer.h, 3198which defines the interface to two C++ classes. 3199.PP 3200The first class, 3201.B FlexLexer, 3202provides an abstract base class defining the general scanner class 3203interface. It provides the following member functions: 3204.TP 3205.B const char* YYText() 3206returns the text of the most recently matched token, the equivalent of 3207.B yytext. 3208.TP 3209.B int YYLeng() 3210returns the length of the most recently matched token, the equivalent of 3211.B yyleng. 3212.TP 3213.B int lineno() const 3214returns the current input line number 3215(see 3216.B %option yylineno), 3217or 3218.B 1 3219if 3220.B %option yylineno 3221was not used. 3222.TP 3223.B void set_debug( int flag ) 3224sets the debugging flag for the scanner, equivalent to assigning to 3225.B yy_flex_debug 3226(see the Options section above). Note that you must build the scanner 3227using 3228.B %option debug 3229to include debugging information in it. 3230.TP 3231.B int debug() const 3232returns the current setting of the debugging flag. 3233.PP 3234Also provided are member functions equivalent to 3235.B yy_switch_to_buffer(), 3236.B yy_create_buffer() 3237(though the first argument is an 3238.B istream* 3239object pointer and not a 3240.B FILE*), 3241.B yy_flush_buffer(), 3242.B yy_delete_buffer(), |
558and | 3243and |
559.B yyleng 560are adjusted appropriately (e.g., 561.B yyleng 562will now be equal to 563.I n 564). | 3244.B yyrestart() 3245(again, the first argument is a 3246.B istream* 3247object pointer). 3248.PP 3249The second class defined in 3250.I FlexLexer.h 3251is 3252.B yyFlexLexer, 3253which is derived from 3254.B FlexLexer. 3255It defines the following additional member functions: 3256.TP 3257.B 3258yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3259constructs a 3260.B yyFlexLexer 3261object using the given streams for input and output. If not specified, 3262the streams default to 3263.B cin 3264and 3265.B cout, 3266respectively. 3267.TP 3268.B virtual int yylex() 3269performs the same role is 3270.B yylex() 3271does for ordinary flex scanners: it scans the input stream, consuming 3272tokens, until a rule's action returns a value. If you derive a subclass 3273.B S 3274from 3275.B yyFlexLexer 3276and want to access the member functions and variables of 3277.B S 3278inside 3279.B yylex(), 3280then you need to use 3281.B %option yyclass="S" 3282to inform 3283.I flex 3284that you will be using that subclass instead of 3285.B yyFlexLexer. 3286In this case, rather than generating 3287.B yyFlexLexer::yylex(), 3288.I flex 3289generates 3290.B S::yylex() 3291(and also generates a dummy 3292.B yyFlexLexer::yylex() 3293that calls 3294.B yyFlexLexer::LexerError() 3295if called). 3296.TP 3297.B 3298virtual void switch_streams(istream* new_in = 0, 3299.B 3300ostream* new_out = 0) 3301reassigns 3302.B yyin 3303to 3304.B new_in 3305(if non-nil) 3306and 3307.B yyout 3308to 3309.B new_out 3310(ditto), deleting the previous input buffer if 3311.B yyin 3312is reassigned. 3313.TP 3314.B 3315int yylex( istream* new_in, ostream* new_out = 0 ) 3316first switches the input streams via 3317.B switch_streams( new_in, new_out ) 3318and then returns the value of 3319.B yylex(). 3320.PP 3321In addition, 3322.B yyFlexLexer 3323defines the following protected virtual functions which you can redefine 3324in derived classes to tailor the scanner: 3325.TP 3326.B 3327virtual int LexerInput( char* buf, int max_size ) 3328reads up to 3329.B max_size 3330characters into 3331.B buf 3332and returns the number of characters read. To indicate end-of-input, 3333return 0 characters. Note that "interactive" scanners (see the 3334.B \-B 3335and 3336.B \-I 3337flags) define the macro 3338.B YY_INTERACTIVE. 3339If you redefine 3340.B LexerInput() 3341and need to take different actions depending on whether or not 3342the scanner might be scanning an interactive input source, you can 3343test for the presence of this name via 3344.B #ifdef. 3345.TP 3346.B 3347virtual void LexerOutput( const char* buf, int size ) 3348writes out 3349.B size 3350characters from the buffer 3351.B buf, 3352which, while NUL-terminated, may also contain "internal" NUL's if 3353the scanner's rules can match text with NUL's in them. 3354.TP 3355.B 3356virtual void LexerError( const char* msg ) 3357reports a fatal error message. The default version of this function 3358writes the message to the stream 3359.B cerr 3360and exits. 3361.PP 3362Note that a 3363.B yyFlexLexer 3364object contains its 3365.I entire 3366scanning state. Thus you can use such objects to create reentrant 3367scanners. You can instantiate multiple instances of the same 3368.B yyFlexLexer 3369class, and you can also combine multiple C++ scanner classes together 3370in the same program using the 3371.B \-P 3372option discussed above. 3373.PP 3374Finally, note that the 3375.B %array 3376feature is not available to C++ scanner classes; you must use 3377.B %pointer 3378(the default). 3379.PP 3380Here is an example of a simple C++ scanner: 3381.nf 3382 3383 // An example of using the flex C++ scanner class. 3384 3385 %{ 3386 int mylineno = 0; 3387 %} 3388 3389 string \\"[^\\n"]+\\" 3390 3391 ws [ \\t]+ 3392 3393 alpha [A-Za-z] 3394 dig [0-9] 3395 name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])* 3396 num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)? 3397 num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)? 3398 number {num1}|{num2} 3399 3400 %% 3401 3402 {ws} /* skip blanks and tabs */ 3403 3404 "/*" { 3405 int c; 3406 3407 while((c = yyinput()) != 0) 3408 { 3409 if(c == '\\n') 3410 ++mylineno; 3411 3412 else if(c == '*') 3413 { 3414 if((c = yyinput()) == '/') 3415 break; 3416 else 3417 unput(c); 3418 } 3419 } 3420 } 3421 3422 {number} cout << "number " << YYText() << '\\n'; 3423 3424 \\n mylineno++; 3425 3426 {name} cout << "name " << YYText() << '\\n'; 3427 3428 {string} cout << "string " << YYText() << '\\n'; 3429 3430 %% 3431 3432 int main( int /* argc */, char** /* argv */ ) 3433 { 3434 FlexLexer* lexer = new yyFlexLexer; 3435 while(lexer->yylex() != 0) 3436 ; 3437 return 0; 3438 } 3439.fi 3440If you want to create multiple (different) lexer classes, you use the 3441.B \-P 3442flag (or the 3443.B prefix= 3444option) to rename each 3445.B yyFlexLexer 3446to some other 3447.B xxFlexLexer. 3448You then can include 3449.B <FlexLexer.h> 3450in your other sources once per lexer class, first renaming 3451.B yyFlexLexer 3452as follows: 3453.nf 3454 3455 #undef yyFlexLexer 3456 #define yyFlexLexer xxFlexLexer 3457 #include <FlexLexer.h> 3458 3459 #undef yyFlexLexer 3460 #define yyFlexLexer zzFlexLexer 3461 #include <FlexLexer.h> 3462 3463.fi 3464if, for example, you used 3465.B %option prefix="xx" 3466for one of your scanners and 3467.B %option prefix="zz" 3468for the other. 3469.PP 3470IMPORTANT: the present form of the scanning class is 3471.I experimental 3472and may change considerably between major releases. 3473.SH INCOMPATIBILITIES WITH LEX AND POSIX 3474.I flex 3475is a rewrite of the AT&T Unix 3476.I lex 3477tool (the two implementations do not share any code, though), 3478with some extensions and incompatibilities, both of which 3479are of concern to those who wish to write scanners acceptable 3480to either implementation. Flex is fully compliant with the POSIX 3481.I lex 3482specification, except that when using 3483.B %pointer 3484(the default), a call to 3485.B unput() 3486destroys the contents of 3487.B yytext, 3488which is counter to the POSIX specification. 3489.PP 3490In this section we discuss all of the known areas of incompatibility 3491between flex, AT&T lex, and the POSIX specification. 3492.PP 3493.I flex's 3494.B \-l 3495option turns on maximum compatibility with the original AT&T 3496.I lex 3497implementation, at the cost of a major loss in the generated scanner's 3498performance. We note below which incompatibilities can be overcome 3499using the 3500.B \-l 3501option. 3502.PP 3503.I flex 3504is fully compatible with 3505.I lex 3506with the following exceptions: |
565.IP - | 3507.IP - |
566.B unput(c) 567puts the character 568.I c 569back onto the input stream. It will be the next character scanned. | 3508The undocumented 3509.I lex 3510scanner internal variable 3511.B yylineno 3512is not supported unless 3513.B \-l 3514or 3515.B %option yylineno 3516is used. 3517.IP 3518.B yylineno 3519should be maintained on a per-buffer basis, rather than a per-scanner 3520(single global variable) basis. 3521.IP 3522.B yylineno 3523is not part of the POSIX specification. |
570.IP - | 3524.IP - |
3525The |
|
571.B input() | 3526.B input() |
572reads the next character from the input stream (this routine is called 573.B yyinput() 574if the scanner is compiled using 575.B C++). 576.IP - 577.B yyterminate() 578can be used in lieu of a return statement in an action. It terminates 579the scanner and returns a 0 to the scanner's caller, indicating "all done". | 3527routine is not redefinable, though it may be called to read characters 3528following whatever has been matched by a rule. If 3529.B input() 3530encounters an end-of-file the normal 3531.B yywrap() 3532processing is done. A ``real'' end-of-file is returned by 3533.B input() 3534as 3535.I EOF. |
580.IP | 3536.IP |
581By default, 582.B yyterminate() 583is also called when an end-of-file is encountered. It is a macro and 584may be redefined. | 3537Input is instead controlled by defining the 3538.B YY_INPUT 3539macro. 3540.IP 3541The 3542.I flex 3543restriction that 3544.B input() 3545cannot be redefined is in accordance with the POSIX specification, 3546which simply does not specify any way of controlling the 3547scanner's input other than by making an initial assignment to 3548.I yyin. |
585.IP - | 3549.IP - |
586.B YY_NEW_FILE 587is an action available only in <<EOF>> rules. It means "Okay, I've 588set up a new input file, continue scanning". It is no longer required; 589you can just assign 590.I yyin 591to point to a new file in the <<EOF>> action. | 3550The 3551.B unput() 3552routine is not redefinable. This restriction is in accordance with POSIX. |
592.IP - | 3553.IP - |
593.B yy_create_buffer( file, size ) 594takes a 595.I FILE 596pointer and an integer 597.I size. 598It returns a YY_BUFFER_STATE 599handle to a new input buffer large enough to accomodate 600.I size 601characters and associated with the given file. When in doubt, use 602.B YY_BUF_SIZE 603for the size. | 3554.I flex 3555scanners are not as reentrant as 3556.I lex 3557scanners. In particular, if you have an interactive scanner and 3558an interrupt handler which long-jumps out of the scanner, and 3559the scanner is subsequently called again, you may get the following 3560message: 3561.nf 3562 3563 fatal flex scanner internal error--end of buffer missed 3564 3565.fi 3566To reenter the scanner, first use 3567.nf 3568 3569 yyrestart( yyin ); 3570 3571.fi 3572Note that this call will throw away any buffered input; usually this 3573isn't a problem with an interactive scanner. 3574.IP 3575Also note that flex C++ scanner classes 3576.I are 3577reentrant, so if using C++ is an option for you, you should use 3578them instead. See "Generating C++ Scanners" above for details. |
604.IP - | 3579.IP - |
605.B yy_switch_to_buffer( new_buffer ) 606switches the scanner's processing to scan for tokens from 607the given buffer, which must be a YY_BUFFER_STATE. | 3580.B output() 3581is not supported. 3582Output from the 3583.B ECHO 3584macro is done to the file-pointer 3585.I yyout 3586(default 3587.I stdout). 3588.IP 3589.B output() 3590is not part of the POSIX specification. |
608.IP - | 3591.IP - |
609.B yy_delete_buffer( buffer ) 610deletes the given buffer. 611.SH VALUES AVAILABLE TO THE USER | 3592.I lex 3593does not support exclusive start conditions (%x), though they 3594are in the POSIX specification. |
612.IP - | 3595.IP - |
613.B char *yytext 614holds the text of the current token. It may be modified but not lengthened 615(you cannot append characters to the end). Modifying the last character 616may affect the activity of rules anchored using '^' during the next scan; 617see 618.B lexdoc(1) 619for details. | 3596When definitions are expanded, 3597.I flex 3598encloses them in parentheses. 3599With lex, the following: 3600.nf 3601 3602 NAME [A-Z][A-Z0-9]* 3603 %% 3604 foo{NAME}? printf( "Found it\\n" ); 3605 %% 3606 3607.fi 3608will not match the string "foo" because when the macro 3609is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" 3610and the precedence is such that the '?' is associated with 3611"[A-Z0-9]*". With 3612.I flex, 3613the rule will be expanded to 3614"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. |
620.IP | 3615.IP |
621If the special directive 622.B %array 623appears in the first section of the scanner description, then 624.B yytext 625is instead declared 626.B char yytext[YYLMAX], 627where 628.B YYLMAX 629is a macro definition that you can redefine in the first section 630if you don't like the default value (generally 8KB). Using 631.B %array 632results in somewhat slower scanners, but the value of 633.B yytext 634becomes immune to calls to 635.I input() | 3616Note that if the definition begins with 3617.B ^ 3618or ends with 3619.B $ 3620then it is 3621.I not 3622expanded with parentheses, to allow these operators to appear in 3623definitions without losing their special meanings. But the 3624.B <s>, /, |
636and | 3625and |
637.I unput(), 638which potentially destroy its value when 639.B yytext 640is a character pointer. The opposite of 641.B %array 642is 643.B %pointer, 644which is the default. | 3626.B <<EOF>> 3627operators cannot be used in a 3628.I flex 3629definition. |
645.IP | 3630.IP |
646You cannot use 647.B %array 648when generating C++ scanner classes 649(the 650.B \-+ 651flag). | 3631Using 3632.B \-l 3633results in the 3634.I lex 3635behavior of no parentheses around the definition. 3636.IP 3637The POSIX specification is that the definition be enclosed in parentheses. |
652.IP - | 3638.IP - |
653.B int yyleng 654holds the length of the current token. 655.IP - 656.B FILE *yyin 657is the file which by default | 3639Some implementations of 3640.I lex 3641allow a rule's action to begin on a separate line, if the rule's pattern 3642has trailing whitespace: 3643.nf 3644 3645 %% 3646 foo|bar<space here> 3647 { foobar_action(); } 3648 3649.fi |
658.I flex | 3650.I flex |
659reads from. It may be redefined but doing so only makes sense before 660scanning begins or after an EOF has been encountered. Changing it in 661the midst of scanning will have unexpected results since 662.I flex 663buffers its input; use 664.B yyrestart() 665instead. 666Once scanning terminates because an end-of-file 667has been seen, 668.B 669you can assign 670.I yyin 671at the new input file and then call the scanner again to continue scanning. | 3651does not support this feature. |
672.IP - | 3652.IP - |
673.B void yyrestart( FILE *new_file ) 674may be called to point 675.I yyin 676at the new input file. The switch-over to the new file is immediate 677(any previously buffered-up input is lost). Note that calling 678.B yyrestart() 679with 680.I yyin 681as an argument thus throws away the current input buffer and continues 682scanning the same input file. | 3653The 3654.I lex 3655.B %r 3656(generate a Ratfor scanner) option is not supported. It is not part 3657of the POSIX specification. |
683.IP - | 3658.IP - |
684.B FILE *yyout 685is the file to which 686.B ECHO 687actions are done. It can be reassigned by the user. | 3659After a call to 3660.B unput(), 3661.I yytext 3662is undefined until the next token is matched, unless the scanner 3663was built using 3664.B %array. 3665This is not the case with 3666.I lex 3667or the POSIX specification. The 3668.B \-l 3669option does away with this incompatibility. |
688.IP - | 3670.IP - |
689.B YY_CURRENT_BUFFER 690returns a 691.B YY_BUFFER_STATE 692handle to the current buffer. | 3671The precedence of the 3672.B {} 3673(numeric range) operator is different. 3674.I lex 3675interprets "abc{1,3}" as "match one, two, or 3676three occurrences of 'abc'", whereas 3677.I flex 3678interprets it as "match 'ab' 3679followed by one, two, or three occurrences of 'c'". The latter is 3680in agreement with the POSIX specification. |
693.IP - | 3681.IP - |
694.B YY_START 695returns an integer value corresponding to the current start 696condition. You can subsequently use this value with 697.B BEGIN 698to return to that start condition. 699.SH MACROS AND FUNCTIONS YOU CAN REDEFINE | 3682The precedence of the 3683.B ^ 3684operator is different. 3685.I lex 3686interprets "^foo|bar" as "match either 'foo' at the beginning of a line, 3687or 'bar' anywhere", whereas 3688.I flex 3689interprets it as "match either 'foo' or 'bar' if they come at the beginning 3690of a line". The latter is in agreement with the POSIX specification. |
700.IP - | 3691.IP - |
701.B YY_DECL 702controls how the scanning routine is declared. 703By default, it is "int yylex()", or, if prototypes are being 704used, "int yylex(void)". This definition may be changed by redefining 705the "YY_DECL" macro. Note that 706if you give arguments to the scanning routine using a 707K&R-style/non-prototyped function declaration, you must terminate 708the definition with a semi-colon (;). | 3692The special table-size declarations such as 3693.B %a 3694supported by 3695.I lex 3696are not required by 3697.I flex 3698scanners; 3699.I flex 3700ignores them. |
709.IP - | 3701.IP - |
710The nature of how the scanner 711gets its input can be controlled by redefining the 712.B YY_INPUT 713macro. 714YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its 715action is to place up to 716.I max_size 717characters in the character array 718.I buf 719and return in the integer variable 720.I result 721either the 722number of characters read or the constant YY_NULL (0 on Unix systems) 723to indicate EOF. The default YY_INPUT reads from the 724global file-pointer "yyin". 725A sample redefinition of YY_INPUT (in the definitions 726section of the input file): | 3702The name 3703.bd 3704FLEX_SCANNER 3705is #define'd so scanners may be written for use with either 3706.I flex 3707or 3708.I lex. 3709Scanners also include 3710.B YY_FLEX_MAJOR_VERSION 3711and 3712.B YY_FLEX_MINOR_VERSION 3713indicating which version of 3714.I flex 3715generated the scanner 3716(for example, for the 2.5 release, these defines would be 2 and 5 3717respectively). 3718.PP 3719The following 3720.I flex 3721features are not included in 3722.I lex 3723or the POSIX specification: |
727.nf 728 | 3724.nf 3725 |
729 %{ 730 #undef YY_INPUT 731 #define YY_INPUT(buf,result,max_size) \\ 732 { \\ 733 int c = getchar(); \\ 734 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ 735 } 736 %} | 3726 C++ scanners 3727 %option 3728 start condition scopes 3729 start condition stacks 3730 interactive/non-interactive scanners 3731 yy_scan_string() and friends 3732 yyterminate() 3733 yy_set_interactive() 3734 yy_set_bol() 3735 YY_AT_BOL() 3736 <<EOF>> 3737 <*> 3738 YY_DECL 3739 YY_START 3740 YY_USER_ACTION 3741 YY_USER_INIT 3742 #line directives 3743 %{}'s around actions 3744 multiple actions on a line |
737 738.fi | 3745 3746.fi |
739.IP - 740When the scanner receives an end-of-file indication from YY_INPUT, 741it then checks the function 742.B yywrap() 743function. If 744.B yywrap() 745returns false (zero), then it is assumed that the 746function has gone ahead and set up 747.I yyin 748to point to another input file, and scanning continues. If it returns 749true (non-zero), then the scanner terminates, returning 0 to its 750caller. 751.IP 752The default 753.B yywrap() 754always returns 1. 755.IP - 756YY_USER_ACTION 757can be redefined to provide an action 758which is always executed prior to the matched rule's action. 759.IP - 760The macro 761.B YY_USER_INIT 762may be redefined to provide an action which is always executed before 763the first scan. 764.IP - 765In the generated scanner, the actions are all gathered in one large 766switch statement and separated using 767.B YY_BREAK, 768which may be redefined. By default, it is simply a "break", to separate 769each rule's action from the following rule's. 770.SH FILES 771.TP 772.B \-ll 773library with which to link scanners to obtain the default versions 774of 775.I yywrap() 776and/or 777.I main(). 778.TP 779.I lex.yy.c 780generated scanner (called 781.I lexyy.c 782on some systems). 783.TP 784.I lex.yy.cc 785generated C++ scanner class, when using 786.B -+. 787.TP 788.I <FlexLexer.h> 789header file defining the C++ scanner base class, 790.B FlexLexer, 791and its derived class, 792.B yyFlexLexer. 793.TP 794.I flex.skl 795skeleton scanner. This file is only used when building flex, not when 796flex executes. 797.TP 798.I lex.backup 799backing-up information for 800.B \-b 801flag (called 802.I lex.bck 803on some systems). 804.SH "SEE ALSO" | 3747plus almost all of the flex flags. 3748The last feature in the list refers to the fact that with 3749.I flex 3750you can put multiple actions on the same line, separated with 3751semi-colons, while with 3752.I lex, 3753the following 3754.nf 3755 3756 foo handle_foo(); ++num_foos_seen; 3757 3758.fi 3759is (rather surprisingly) truncated to 3760.nf 3761 3762 foo handle_foo(); 3763 3764.fi 3765.I flex 3766does not truncate the action. Actions that are not enclosed in 3767braces are simply terminated at the end of the line. 3768.SH DIAGNOSTICS |
805.PP | 3769.PP |
806lexdoc(1), lex(1), yacc(1), sed(1), awk(1). | 3770.I warning, rule cannot be matched 3771indicates that the given rule 3772cannot be matched because it follows other rules that will 3773always match the same text as it. For 3774example, in the following "foo" cannot be matched because it comes after 3775an identifier "catch-all" rule: 3776.nf 3777 3778 [a-z]+ got_identifier(); 3779 foo got_foo(); 3780 3781.fi 3782Using 3783.B REJECT 3784in a scanner suppresses this warning. |
807.PP | 3785.PP |
808M. E. Lesk and E. Schmidt, 809.I LEX \- Lexical Analyzer Generator 810.SH DIAGNOSTICS | 3786.I warning, 3787.B \-s 3788.I 3789option given but default rule can be matched 3790means that it is possible (perhaps only in a particular start condition) 3791that the default rule (match any single character) is the only one 3792that will match a particular input. Since 3793.B \-s 3794was given, presumably this is not intended. |
811.PP 812.I reject_used_but_not_detected undefined 813or | 3795.PP 3796.I reject_used_but_not_detected undefined 3797or |
814.PP | |
815.I yymore_used_but_not_detected undefined - 816These errors can occur at compile time. They indicate that the 817scanner uses 818.B REJECT 819or 820.B yymore() 821but that 822.I flex 823failed to notice the fact, meaning that 824.I flex 825scanned the first two sections looking for occurrences of these actions 826and failed to find any, but somehow you snuck some in (via a #include | 3798.I yymore_used_but_not_detected undefined - 3799These errors can occur at compile time. They indicate that the 3800scanner uses 3801.B REJECT 3802or 3803.B yymore() 3804but that 3805.I flex 3806failed to notice the fact, meaning that 3807.I flex 3808scanned the first two sections looking for occurrences of these actions 3809and failed to find any, but somehow you snuck some in (via a #include |
827file, for example). Make an explicit reference to the action in your 828.I flex 829input file. (Note that previously 830.I flex 831supported a 832.B %used/%unused 833mechanism for dealing with this problem; this feature is still supported 834but now deprecated, and will go away soon unless the author hears from 835people who can argue compellingly that they need it.) | 3810file, for example). Use 3811.B %option reject 3812or 3813.B %option yymore 3814to indicate to flex that you really do use these features. |
836.PP 837.I flex scanner jammed - 838a scanner compiled with 839.B \-s 840has encountered an input string which wasn't matched by | 3815.PP 3816.I flex scanner jammed - 3817a scanner compiled with 3818.B \-s 3819has encountered an input string which wasn't matched by |
841any of its rules. | 3820any of its rules. This error can also occur due to internal problems. |
842.PP | 3821.PP |
843.I warning, rule cannot be matched 844indicates that the given rule 845cannot be matched because it follows other rules that will 846always match the same text as it. See 847.I lexdoc(1) 848for an example. 849.PP 850.I warning, 851.B \-s 852.I 853option given but default rule can be matched 854means that it is possible (perhaps only in a particular start condition) 855that the default rule (match any single character) is the only one 856that will match a particular input. Since 857.PP 858.I scanner input buffer overflowed - 859a scanner rule matched more text than the available dynamic memory. 860.PP | |
861.I token too large, exceeds YYLMAX - 862your scanner uses 863.B %array 864and one of its rules matched a string longer than the 865.B YYLMAX 866constant (8K bytes by default). You can increase the value by 867#define'ing 868.B YYLMAX --- 5 unchanged lines hidden (view full) --- 874.I use the character 'x' - 875Your scanner specification includes recognizing the 8-bit character 876.I 'x' 877and you did not specify the \-8 flag, and your scanner defaulted to 7-bit 878because you used the 879.B \-Cf 880or 881.B \-CF | 3822.I token too large, exceeds YYLMAX - 3823your scanner uses 3824.B %array 3825and one of its rules matched a string longer than the 3826.B YYLMAX 3827constant (8K bytes by default). You can increase the value by 3828#define'ing 3829.B YYLMAX --- 5 unchanged lines hidden (view full) --- 3835.I use the character 'x' - 3836Your scanner specification includes recognizing the 8-bit character 3837.I 'x' 3838and you did not specify the \-8 flag, and your scanner defaulted to 7-bit 3839because you used the 3840.B \-Cf 3841or 3842.B \-CF |
882table compression options. | 3843table compression options. See the discussion of the 3844.B \-7 3845flag for details. |
883.PP 884.I flex scanner push-back overflow - 885you used 886.B unput() 887to push back so much text that the scanner's buffer could not hold 888both the pushed-back text and the current token in 889.B yytext. 890Ideally the scanner should dynamically resize the buffer in this case, but at --- 11 unchanged lines hidden (view full) --- 902This can occur in an scanner which is reentered after a long-jump 903has jumped out (or over) the scanner's activation frame. Before 904reentering the scanner, use: 905.nf 906 907 yyrestart( yyin ); 908 909.fi | 3846.PP 3847.I flex scanner push-back overflow - 3848you used 3849.B unput() 3850to push back so much text that the scanner's buffer could not hold 3851both the pushed-back text and the current token in 3852.B yytext. 3853Ideally the scanner should dynamically resize the buffer in this case, but at --- 11 unchanged lines hidden (view full) --- 3865This can occur in an scanner which is reentered after a long-jump 3866has jumped out (or over) the scanner's activation frame. Before 3867reentering the scanner, use: 3868.nf 3869 3870 yyrestart( yyin ); 3871 3872.fi |
910or use C++ scanner classes (the 911.B \-+ 912option), which are fully reentrant. 913.SH AUTHOR 914Vern Paxson, with the help of many ideas and much inspiration from 915Van Jacobson. Original version by Jef Poskanzer. | 3873or, as noted above, switch to using the C++ scanner class. |
916.PP | 3874.PP |
917See lexdoc(1) for additional credits and the address to send comments to. | 3875.I too many start conditions in <> construct! - 3876you listed more start conditions in a <> construct than exist (so 3877you must have listed at least one of them twice). 3878.SH FILES 3879.TP 3880.B \-ll 3881library with which scanners must be linked. 3882.TP 3883.I lex.yy.c 3884generated scanner (called 3885.I lexyy.c 3886on some systems). 3887.TP 3888.I lex.yy.cc 3889generated C++ scanner class, when using 3890.B -+. 3891.TP 3892.I <FlexLexer.h> 3893header file defining the C++ scanner base class, 3894.B FlexLexer, 3895and its derived class, 3896.B yyFlexLexer. 3897.TP 3898.I flex.skl 3899skeleton scanner. This file is only used when building flex, not when 3900flex executes. 3901.TP 3902.I lex.backup 3903backing-up information for 3904.B \-b 3905flag (called 3906.I lex.bck 3907on some systems). |
918.SH DEFICIENCIES / BUGS 919.PP 920Some trailing context 921patterns cannot be properly matched and generate 922warning messages ("dangerous trailing context"). These are 923patterns where the ending of the 924first part of the rule matches the beginning of the second 925part, such as "zx*/xy*", where the 'x*' matches the 'x' at --- 15 unchanged lines hidden (view full) --- 941 %% 942 abc | 943 xyz/def 944 945.fi 946.PP 947Use of 948.B unput() | 3908.SH DEFICIENCIES / BUGS 3909.PP 3910Some trailing context 3911patterns cannot be properly matched and generate 3912warning messages ("dangerous trailing context"). These are 3913patterns where the ending of the 3914first part of the rule matches the beginning of the second 3915part, such as "zx*/xy*", where the 'x*' matches the 'x' at --- 15 unchanged lines hidden (view full) --- 3931 %% 3932 abc | 3933 xyz/def 3934 3935.fi 3936.PP 3937Use of 3938.B unput() |
949or 950.B input() | |
951invalidates yytext and yyleng, unless the 952.B %array 953directive 954or the 955.B \-l 956option has been used. 957.PP | 3939invalidates yytext and yyleng, unless the 3940.B %array 3941directive 3942or the 3943.B \-l 3944option has been used. 3945.PP |
958Use of unput() to push back more text than was matched can 959result in the pushed-back text matching a beginning-of-line ('^') 960rule even though it didn't come at the beginning of the line 961(though this is rare!). 962.PP | |
963Pattern-matching of NUL's is substantially slower than matching other 964characters. 965.PP 966Dynamic resizing of the input buffer is slow, as it entails rescanning 967all the text matched so far by the current (generally huge) token. 968.PP | 3946Pattern-matching of NUL's is substantially slower than matching other 3947characters. 3948.PP 3949Dynamic resizing of the input buffer is slow, as it entails rescanning 3950all the text matched so far by the current (generally huge) token. 3951.PP |
969.I flex 970does not generate correct #line directives for code internal 971to the scanner; thus, bugs in 972.I flex.skl 973yield bogus line numbers. 974.PP | |
975Due to both buffering of input and read-ahead, you cannot intermix 976calls to <stdio.h> routines, such as, for example, 977.B getchar(), 978with 979.I flex 980rules and expect it to work. Call 981.B input() 982instead. --- 11 unchanged lines hidden (view full) --- 994.B \-f 995or 996.B \-F 997options. 998.PP 999The 1000.I flex 1001internal algorithms need documentation. | 3952Due to both buffering of input and read-ahead, you cannot intermix 3953calls to <stdio.h> routines, such as, for example, 3954.B getchar(), 3955with 3956.I flex 3957rules and expect it to work. Call 3958.B input() 3959instead. --- 11 unchanged lines hidden (view full) --- 3971.B \-f 3972or 3973.B \-F 3974options. 3975.PP 3976The 3977.I flex 3978internal algorithms need documentation. |
3979.SH SEE ALSO 3980.PP 3981lex(1), yacc(1), sed(1), awk(1). 3982.PP 3983John Levine, Tony Mason, and Doug Brown, 3984.I Lex & Yacc, 3985O'Reilly and Associates. Be sure to get the 2nd edition. 3986.PP 3987M. E. Lesk and E. Schmidt, 3988.I LEX \- Lexical Analyzer Generator 3989.PP 3990Alfred Aho, Ravi Sethi and Jeffrey Ullman, 3991.I Compilers: Principles, Techniques and Tools, 3992Addison-Wesley (1986). Describes the pattern-matching techniques used by 3993.I flex 3994(deterministic finite automata). 3995.SH AUTHOR 3996Vern Paxson, with the help of many ideas and much inspiration from 3997Van Jacobson. Original version by Jef Poskanzer. The fast table 3998representation is a partial implementation of a design done by Van 3999Jacobson. The implementation was done by Kevin Gong and Vern Paxson. 4000.PP 4001Thanks to the many 4002.I flex 4003beta-testers, feedbackers, and contributors, especially Francois Pinard, 4004Casey Leedom, 4005Robert Abramovitz, 4006Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai, 4007Neal Becker, Nelson H.F. Beebe, benson@odi.com, 4008Karl Berry, Peter A. Bigot, Simon Blanchard, 4009Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, 4010Brian Clapper, J.T. Conklin, 4011Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David 4012Daniels, Chris G. Demetriou, Theo Deraadt, 4013Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, 4014Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl, 4015Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz, 4016Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel, 4017Jan Hajic, Charles Hemphill, NORO Hideo, 4018Jarkko Hietaniemi, Scott Hofmann, 4019Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, 4020Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, 4021Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, 4022Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, 4023Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht, 4024Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle, 4025David Loffredo, Mike Long, 4026Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall, 4027Bengt Martensson, Chris Metcalf, 4028Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, 4029G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, 4030Richard Ohnemus, Karsten Pahnke, 4031Sven Panne, Roland Pesch, Walter Pelissero, Gaumond 4032Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, 4033Frederic Raimbault, Pat Rankin, Rick Richardson, 4034Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini, 4035Andreas Scherer, Darrell Schiebel, Raf Schietekat, 4036Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, 4037Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist, 4038Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor, 4039Chris Thewalt, Richard M. Timoney, Jodi Tsai, 4040Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken 4041Yap, Ron Zellar, Nathan Zelle, David Zuhn, 4042and those whose names have slipped my marginal 4043mail-archiving skills but whose contributions are appreciated all the 4044same. 4045.PP 4046Thanks to Keith Bostic, Jon Forrest, Noah Friedman, 4047John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. 4048Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various 4049distribution headaches. 4050.PP 4051Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to 4052Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom 4053Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to 4054Eric Hughes for support of multiple buffers. 4055.PP 4056This work was primarily done when I was with the Real Time Systems Group 4057at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there 4058for the support I received. 4059.PP 4060Send comments to vern@ee.lbl.gov. |
|