1\input texinfo 2@c %**start of header 3@setfilename flex.info 4@settitle Flex - a scanner generator 5@c @finalout 6@c @setchapternewpage odd 7@c %**end of header 8 9@set EDITION 2.5 10@set UPDATED March 1995 11@set VERSION 2.5 12 13@c FIXME - Reread a printed copy with a red pen and patience. 14@c FIXME - Modify all "See ..." references and replace with @xref's. 15 16@ifinfo 17@format 18START-INFO-DIR-ENTRY 19* Flex: (flex). A fast scanner generator. 20END-INFO-DIR-ENTRY 21@end format 22@end ifinfo 23 24@c Define new indices for commands, filenames, and options. 25@c @defcodeindex cm 26@c @defcodeindex fl 27@c @defcodeindex op 28 29@c Put everything in one index (arbitrarily chosen to be the concept index). 30@c @syncodeindex cm cp 31@c @syncodeindex fl cp 32@syncodeindex fn cp 33@syncodeindex ky cp 34@c @syncodeindex op cp 35@syncodeindex pg cp 36@syncodeindex vr cp 37 38@ifinfo 39This file documents Flex. 40 41Copyright (c) 1990 The Regents of the University of California. 42All rights reserved. 43 44This code is derived from software contributed to Berkeley by 45Vern Paxson. 46 47The United States Government has rights in this work pursuant 48to contract no. DE-AC03-76SF00098 between the United States 49Department of Energy and the University of California. 50 51Redistribution and use in source and binary forms with or without 52modification are permitted provided that: (1) source distributions 53retain this entire copyright notice and comment, and (2) 54distributions including binaries display the following 55acknowledgement: ``This product includes software developed by the 56University of California, Berkeley and its contributors'' in the 57documentation or other materials provided with the distribution and 58in all advertising materials mentioning features or use of this 59software. Neither the name of the University nor the names of its 60contributors may be used to endorse or promote products derived 61from this software without specific prior written permission. 62 63THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 64IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 65WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 66PURPOSE. 67 68@ignore 69Permission is granted to process this file through TeX and print the 70results, provided the printed document carries copying permission 71notice identical to this one except for the removal of this paragraph 72(this paragraph not being relevant to the printed manual). 73 74@end ignore 75@end ifinfo 76 77@titlepage 78@title Flex, version @value{VERSION} 79@subtitle A fast scanner generator 80@subtitle Edition @value{EDITION}, @value{UPDATED} 81@author Vern Paxson 82 83@page 84@vskip 0pt plus 1filll 85Copyright @copyright{} 1990 The Regents of the University of California. 86All rights reserved. 87 88This code is derived from software contributed to Berkeley by 89Vern Paxson. 90 91The United States Government has rights in this work pursuant 92to contract no. DE-AC03-76SF00098 between the United States 93Department of Energy and the University of California. 94 95Redistribution and use in source and binary forms with or without 96modification are permitted provided that: (1) source distributions 97retain this entire copyright notice and comment, and (2) 98distributions including binaries display the following 99acknowledgement: ``This product includes software developed by the 100University of California, Berkeley and its contributors'' in the 101documentation or other materials provided with the distribution and 102in all advertising materials mentioning features or use of this 103software. Neither the name of the University nor the names of its 104contributors may be used to endorse or promote products derived 105from this software without specific prior written permission. 106 107THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 108IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 109WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 110PURPOSE. 111@end titlepage 112 113@ifinfo 114 115@node Top, Name, (dir), (dir) 116@top flex 117 118@cindex scanner generator 119 120This manual documents @code{flex}. It covers release @value{VERSION}. 121 122@menu 123* Name:: Name 124* Synopsis:: Synopsis 125* Overview:: Overview 126* Description:: Description 127* Examples:: Some simple examples 128* Format:: Format of the input file 129* Patterns:: Patterns 130* Matching:: How the input is matched 131* Actions:: Actions 132* Generated scanner:: The generated scanner 133* Start conditions:: Start conditions 134* Multiple buffers:: Multiple input buffers 135* End-of-file rules:: End-of-file rules 136* Miscellaneous:: Miscellaneous macros 137* User variables:: Values available to the user 138* YACC interface:: Interfacing with @code{yacc} 139* Options:: Options 140* Performance:: Performance considerations 141* C++:: Generating C++ scanners 142* Incompatibilities:: Incompatibilities with @code{lex} and POSIX 143* Diagnostics:: Diagnostics 144* Files:: Files 145* Deficiencies:: Deficiencies / Bugs 146* See also:: See also 147* Author:: Author 148@c * Index:: Index 149@end menu 150 151@end ifinfo 152 153@node Name, Synopsis, Top, Top 154@section Name 155 156flex - fast lexical analyzer generator 157 158@node Synopsis, Overview, Name, Top 159@section Synopsis 160 161@example 162flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton] 163[--help --version] [@var{filename} @dots{}] 164@end example 165 166@node Overview, Description, Synopsis, Top 167@section Overview 168 169This manual describes @code{flex}, a tool for generating programs 170that perform pattern-matching on text. The manual 171includes both tutorial and reference sections: 172 173@table @asis 174@item Description 175a brief overview of the tool 176 177@item Some Simple Examples 178 179@item Format Of The Input File 180 181@item Patterns 182the extended regular expressions used by flex 183 184@item How The Input Is Matched 185the rules for determining what has been matched 186 187@item Actions 188how to specify what to do when a pattern is matched 189 190@item The Generated Scanner 191details regarding the scanner that flex produces; 192how to control the input source 193 194@item Start Conditions 195introducing context into your scanners, and 196managing "mini-scanners" 197 198@item Multiple Input Buffers 199how to manipulate multiple input sources; how to 200scan from strings instead of files 201 202@item End-of-file Rules 203special rules for matching the end of the input 204 205@item Miscellaneous Macros 206a summary of macros available to the actions 207 208@item Values Available To The User 209a summary of values available to the actions 210 211@item Interfacing With Yacc 212connecting flex scanners together with yacc parsers 213 214@item Options 215flex command-line options, and the "%option" 216directive 217 218@item Performance Considerations 219how to make your scanner go as fast as possible 220 221@item Generating C++ Scanners 222the (experimental) facility for generating C++ 223scanner classes 224 225@item Incompatibilities With Lex And POSIX 226how flex differs from AT&T lex and the POSIX lex 227standard 228 229@item Diagnostics 230those error messages produced by flex (or scanners 231it generates) whose meanings might not be apparent 232 233@item Files 234files used by flex 235 236@item Deficiencies / Bugs 237known problems with flex 238 239@item See Also 240other documentation, related tools 241 242@item Author 243includes contact information 244@end table 245 246@node Description, Examples, Overview, Top 247@section Description 248 249@code{flex} is a tool for generating @dfn{scanners}: programs which 250recognized lexical patterns in text. @code{flex} reads the given 251input files, or its standard input if no file names are 252given, for a description of a scanner to generate. The 253description is in the form of pairs of regular expressions 254and C code, called @dfn{rules}. @code{flex} generates as output a C 255source file, @file{lex.yy.c}, which defines a routine @samp{yylex()}. 256This file is compiled and linked with the @samp{-lfl} library to 257produce an executable. When the executable is run, it 258analyzes its input for occurrences of the regular 259expressions. Whenever it finds one, it executes the 260corresponding C code. 261 262@node Examples, Format, Description, Top 263@section Some simple examples 264 265First some simple examples to get the flavor of how one 266uses @code{flex}. The following @code{flex} input specifies a scanner 267which whenever it encounters the string "username" will 268replace it with the user's login name: 269 270@example 271%% 272username printf( "%s", getlogin() ); 273@end example 274 275By default, any text not matched by a @code{flex} scanner is 276copied to the output, so the net effect of this scanner is 277to copy its input file to its output with each occurrence 278of "username" expanded. In this input, there is just one 279rule. "username" is the @var{pattern} and the "printf" is the 280@var{action}. The "%%" marks the beginning of the rules. 281 282Here's another simple example: 283 284@example 285 int num_lines = 0, num_chars = 0; 286 287%% 288\n ++num_lines; ++num_chars; 289. ++num_chars; 290 291%% 292main() 293 @{ 294 yylex(); 295 printf( "# of lines = %d, # of chars = %d\n", 296 num_lines, num_chars ); 297 @} 298@end example 299 300This scanner counts the number of characters and the 301number of lines in its input (it produces no output other 302than the final report on the counts). The first line 303declares two globals, "num_lines" and "num_chars", which 304are accessible both inside @samp{yylex()} and in the @samp{main()} 305routine declared after the second "%%". There are two rules, 306one which matches a newline ("\n") and increments both the 307line count and the character count, and one which matches 308any character other than a newline (indicated by the "." 309regular expression). 310 311A somewhat more complicated example: 312 313@example 314/* scanner for a toy Pascal-like language */ 315 316%@{ 317/* need this for the call to atof() below */ 318#include <math.h> 319%@} 320 321DIGIT [0-9] 322ID [a-z][a-z0-9]* 323 324%% 325 326@{DIGIT@}+ @{ 327 printf( "An integer: %s (%d)\n", yytext, 328 atoi( yytext ) ); 329 @} 330 331@{DIGIT@}+"."@{DIGIT@}* @{ 332 printf( "A float: %s (%g)\n", yytext, 333 atof( yytext ) ); 334 @} 335 336if|then|begin|end|procedure|function @{ 337 printf( "A keyword: %s\n", yytext ); 338 @} 339 340@{ID@} printf( "An identifier: %s\n", yytext ); 341 342"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 343 344"@{"[^@}\n]*"@}" /* eat up one-line comments */ 345 346[ \t\n]+ /* eat up whitespace */ 347 348. printf( "Unrecognized character: %s\n", yytext ); 349 350%% 351 352main( argc, argv ) 353int argc; 354char **argv; 355 @{ 356 ++argv, --argc; /* skip over program name */ 357 if ( argc > 0 ) 358 yyin = fopen( argv[0], "r" ); 359 else 360 yyin = stdin; 361 362 yylex(); 363 @} 364@end example 365 366This is the beginnings of a simple scanner for a language 367like Pascal. It identifies different types of @var{tokens} and 368reports on what it has seen. 369 370The details of this example will be explained in the 371following sections. 372 373@node Format, Patterns, Examples, Top 374@section Format of the input file 375 376The @code{flex} input file consists of three sections, separated 377by a line with just @samp{%%} in it: 378 379@example 380definitions 381%% 382rules 383%% 384user code 385@end example 386 387The @dfn{definitions} section contains declarations of simple 388@dfn{name} definitions to simplify the scanner specification, 389and declarations of @dfn{start conditions}, which are explained 390in a later section. 391Name definitions have the form: 392 393@example 394name definition 395@end example 396 397The "name" is a word beginning with a letter or an 398underscore ('_') followed by zero or more letters, digits, '_', 399or '-' (dash). The definition is taken to begin at the 400first non-white-space character following the name and 401continuing to the end of the line. The definition can 402subsequently be referred to using "@{name@}", which will 403expand to "(definition)". For example, 404 405@example 406DIGIT [0-9] 407ID [a-z][a-z0-9]* 408@end example 409 410@noindent 411defines "DIGIT" to be a regular expression which matches a 412single digit, and "ID" to be a regular expression which 413matches a letter followed by zero-or-more 414letters-or-digits. A subsequent reference to 415 416@example 417@{DIGIT@}+"."@{DIGIT@}* 418@end example 419 420@noindent 421is identical to 422 423@example 424([0-9])+"."([0-9])* 425@end example 426 427@noindent 428and matches one-or-more digits followed by a '.' followed 429by zero-or-more digits. 430 431The @var{rules} section of the @code{flex} input contains a series of 432rules of the form: 433 434@example 435pattern action 436@end example 437 438@noindent 439where the pattern must be unindented and the action must 440begin on the same line. 441 442See below for a further description of patterns and 443actions. 444 445Finally, the user code section is simply copied to 446@file{lex.yy.c} verbatim. It is used for companion routines 447which call or are called by the scanner. The presence of 448this section is optional; if it is missing, the second @samp{%%} 449in the input file may be skipped, too. 450 451In the definitions and rules sections, any @emph{indented} text or 452text enclosed in @samp{%@{} and @samp{%@}} is copied verbatim to the 453output (with the @samp{%@{@}}'s removed). The @samp{%@{@}}'s must 454appear unindented on lines by themselves. 455 456In the rules section, any indented or %@{@} text appearing 457before the first rule may be used to declare variables 458which are local to the scanning routine and (after the 459declarations) code which is to be executed whenever the 460scanning routine is entered. Other indented or %@{@} text 461in the rule section is still copied to the output, but its 462meaning is not well-defined and it may well cause 463compile-time errors (this feature is present for @code{POSIX} compliance; 464see below for other such features). 465 466In the definitions section (but not in the rules section), 467an unindented comment (i.e., a line beginning with "/*") 468is also copied verbatim to the output up to the next "*/". 469 470@node Patterns, Matching, Format, Top 471@section Patterns 472 473The patterns in the input are written using an extended 474set of regular expressions. These are: 475 476@table @samp 477@item x 478match the character @samp{x} 479@item . 480any character (byte) except newline 481@item [xyz] 482a "character class"; in this case, the pattern 483matches either an @samp{x}, a @samp{y}, or a @samp{z} 484@item [abj-oZ] 485a "character class" with a range in it; matches 486an @samp{a}, a @samp{b}, any letter from @samp{j} through @samp{o}, 487or a @samp{Z} 488@item [^A-Z] 489a "negated character class", i.e., any character 490but those in the class. In this case, any 491character EXCEPT an uppercase letter. 492@item [^A-Z\n] 493any character EXCEPT an uppercase letter or 494a newline 495@item @var{r}* 496zero or more @var{r}'s, where @var{r} is any regular expression 497@item @var{r}+ 498one or more @var{r}'s 499@item @var{r}? 500zero or one @var{r}'s (that is, "an optional @var{r}") 501@item @var{r}@{2,5@} 502anywhere from two to five @var{r}'s 503@item @var{r}@{2,@} 504two or more @var{r}'s 505@item @var{r}@{4@} 506exactly 4 @var{r}'s 507@item @{@var{name}@} 508the expansion of the "@var{name}" definition 509(see above) 510@item "[xyz]\"foo" 511the literal string: @samp{[xyz]"foo} 512@item \@var{x} 513if @var{x} is an @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or @samp{v}, 514then the ANSI-C interpretation of \@var{x}. 515Otherwise, a literal @samp{@var{x}} (used to escape 516operators such as @samp{*}) 517@item \0 518a NUL character (ASCII code 0) 519@item \123 520the character with octal value 123 521@item \x2a 522the character with hexadecimal value @code{2a} 523@item (@var{r}) 524match an @var{r}; parentheses are used to override 525precedence (see below) 526@item @var{r}@var{s} 527the regular expression @var{r} followed by the 528regular expression @var{s}; called "concatenation" 529@item @var{r}|@var{s} 530either an @var{r} or an @var{s} 531@item @var{r}/@var{s} 532an @var{r} but only if it is followed by an @var{s}. The text 533matched by @var{s} is included when determining whether this rule is 534the @dfn{longest match}, but is then returned to the input before 535the action is executed. So the action only sees the text matched 536by @var{r}. This type of pattern is called @dfn{trailing context}. 537(There are some combinations of @samp{@var{r}/@var{s}} that @code{flex} 538cannot match correctly; see notes in the Deficiencies / Bugs section 539below regarding "dangerous trailing context".) 540@item ^@var{r} 541an @var{r}, but only at the beginning of a line (i.e., 542which just starting to scan, or right after a 543newline has been scanned). 544@item @var{r}$ 545an @var{r}, but only at the end of a line (i.e., just 546before a newline). Equivalent to "@var{r}/\n". 547 548Note that flex's notion of "newline" is exactly 549whatever the C compiler used to compile flex 550interprets '\n' as; in particular, on some DOS 551systems you must either filter out \r's in the 552input yourself, or explicitly use @var{r}/\r\n for "r$". 553@item <@var{s}>@var{r} 554an @var{r}, but only in start condition @var{s} (see 555below for discussion of start conditions) 556<@var{s1},@var{s2},@var{s3}>@var{r} 557same, but in any of start conditions @var{s1}, 558@var{s2}, or @var{s3} 559@item <*>@var{r} 560an @var{r} in any start condition, even an exclusive one. 561@item <<EOF>> 562an end-of-file 563<@var{s1},@var{s2}><<EOF>> 564an end-of-file when in start condition @var{s1} or @var{s2} 565@end table 566 567Note that inside of a character class, all regular 568expression operators lose their special meaning except escape 569('\') and the character class operators, '-', ']', and, at 570the beginning of the class, '^'. 571 572The regular expressions listed above are grouped according 573to precedence, from highest precedence at the top to 574lowest at the bottom. Those grouped together have equal 575precedence. For example, 576 577@example 578foo|bar* 579@end example 580 581@noindent 582is the same as 583 584@example 585(foo)|(ba(r*)) 586@end example 587 588@noindent 589since the '*' operator has higher precedence than 590concatenation, and concatenation higher than alternation ('|'). 591This pattern therefore matches @emph{either} the string "foo" @emph{or} 592the string "ba" followed by zero-or-more r's. To match 593"foo" or zero-or-more "bar"'s, use: 594 595@example 596foo|(bar)* 597@end example 598 599@noindent 600and to match zero-or-more "foo"'s-or-"bar"'s: 601 602@example 603(foo|bar)* 604@end example 605 606In addition to characters and ranges of characters, 607character classes can also contain character class 608@dfn{expressions}. These are expressions enclosed inside @samp{[}: and @samp{:}] 609delimiters (which themselves must appear between the '[' 610and ']' of the character class; other elements may occur 611inside the character class, too). The valid expressions 612are: 613 614@example 615[:alnum:] [:alpha:] [:blank:] 616[:cntrl:] [:digit:] [:graph:] 617[:lower:] [:print:] [:punct:] 618[:space:] [:upper:] [:xdigit:] 619@end example 620 621These expressions all designate a set of characters 622equivalent to the corresponding standard C @samp{isXXX} function. For 623example, @samp{[:alnum:]} designates those characters for which 624@samp{isalnum()} returns true - i.e., any alphabetic or numeric. 625Some systems don't provide @samp{isblank()}, so flex defines 626@samp{[:blank:]} as a blank or a tab. 627 628For example, the following character classes are all 629equivalent: 630 631@example 632[[:alnum:]] 633[[:alpha:][:digit:] 634[[:alpha:]0-9] 635[a-zA-Z0-9] 636@end example 637 638If your scanner is case-insensitive (the @samp{-i} flag), then 639@samp{[:upper:]} and @samp{[:lower:]} are equivalent to @samp{[:alpha:]}. 640 641Some notes on patterns: 642 643@itemize - 644@item 645A negated character class such as the example 646"[^A-Z]" above @emph{will match a newline} unless "\n" (or an 647equivalent escape sequence) is one of the 648characters explicitly present in the negated character 649class (e.g., "[^A-Z\n]"). This is unlike how many 650other regular expression tools treat negated 651character classes, but unfortunately the inconsistency 652is historically entrenched. Matching newlines 653means that a pattern like [^"]* can match the 654entire input unless there's another quote in the 655input. 656 657@item 658A rule can have at most one instance of trailing 659context (the '/' operator or the '$' operator). 660The start condition, '^', and "<<EOF>>" patterns 661can only occur at the beginning of a pattern, and, 662as well as with '/' and '$', cannot be grouped 663inside parentheses. A '^' which does not occur at 664the beginning of a rule or a '$' which does not 665occur at the end of a rule loses its special 666properties and is treated as a normal character. 667 668The following are illegal: 669 670@example 671foo/bar$ 672<sc1>foo<sc2>bar 673@end example 674 675Note that the first of these, can be written 676"foo/bar\n". 677 678The following will result in '$' or '^' being 679treated as a normal character: 680 681@example 682foo|(bar$) 683foo|^bar 684@end example 685 686If what's wanted is a "foo" or a 687bar-followed-by-a-newline, the following could be used (the special 688'|' action is explained below): 689 690@example 691foo | 692bar$ /* action goes here */ 693@end example 694 695A similar trick will work for matching a foo or a 696bar-at-the-beginning-of-a-line. 697@end itemize 698 699@node Matching, Actions, Patterns, Top 700@section How the input is matched 701 702When the generated scanner is run, it analyzes its input 703looking for strings which match any of its patterns. If 704it finds more than one match, it takes the one matching 705the most text (for trailing context rules, this includes 706the length of the trailing part, even though it will then 707be returned to the input). If it finds two or more 708matches of the same length, the rule listed first in the 709@code{flex} input file is chosen. 710 711Once the match is determined, the text corresponding to 712the match (called the @var{token}) is made available in the 713global character pointer @code{yytext}, and its length in the 714global integer @code{yyleng}. The @var{action} corresponding to the 715matched pattern is then executed (a more detailed 716description of actions follows), and then the remaining input is 717scanned for another match. 718 719If no match is found, then the @dfn{default rule} is executed: 720the next character in the input is considered matched and 721copied to the standard output. Thus, the simplest legal 722@code{flex} input is: 723 724@example 725%% 726@end example 727 728which generates a scanner that simply copies its input 729(one character at a time) to its output. 730 731Note that @code{yytext} can be defined in two different ways: 732either as a character @emph{pointer} or as a character @emph{array}. 733You can control which definition @code{flex} uses by including 734one of the special directives @samp{%pointer} or @samp{%array} in the 735first (definitions) section of your flex input. The 736default is @samp{%pointer}, unless you use the @samp{-l} lex 737compatibility option, in which case @code{yytext} will be an array. The 738advantage of using @samp{%pointer} is substantially faster 739scanning and no buffer overflow when matching very large 740tokens (unless you run out of dynamic memory). The 741disadvantage is that you are restricted in how your actions can 742modify @code{yytext} (see the next section), and calls to the 743@samp{unput()} function destroys the present contents of @code{yytext}, 744which can be a considerable porting headache when moving 745between different @code{lex} versions. 746 747The advantage of @samp{%array} is that you can then modify @code{yytext} 748to your heart's content, and calls to @samp{unput()} do not 749destroy @code{yytext} (see below). Furthermore, existing @code{lex} 750programs sometimes access @code{yytext} externally using 751declarations of the form: 752@example 753extern char yytext[]; 754@end example 755This definition is erroneous when used with @samp{%pointer}, but 756correct for @samp{%array}. 757 758@samp{%array} defines @code{yytext} to be an array of @code{YYLMAX} characters, 759which defaults to a fairly large value. You can change 760the size by simply #define'ing @code{YYLMAX} to a different value 761in the first section of your @code{flex} input. As mentioned 762above, with @samp{%pointer} yytext grows dynamically to 763accommodate large tokens. While this means your @samp{%pointer} scanner 764can accommodate very large tokens (such as matching entire 765blocks of comments), bear in mind that each time the 766scanner must resize @code{yytext} it also must rescan the entire 767token from the beginning, so matching such tokens can 768prove slow. @code{yytext} presently does @emph{not} dynamically grow if 769a call to @samp{unput()} results in too much text being pushed 770back; instead, a run-time error results. 771 772Also note that you cannot use @samp{%array} with C++ scanner 773classes (the @code{c++} option; see below). 774 775@node Actions, Generated scanner, Matching, Top 776@section Actions 777 778Each pattern in a rule has a corresponding action, which 779can be any arbitrary C statement. The pattern ends at the 780first non-escaped whitespace character; the remainder of 781the line is its action. If the action is empty, then when 782the pattern is matched the input token is simply 783discarded. For example, here is the specification for a 784program which deletes all occurrences of "zap me" from its 785input: 786 787@example 788%% 789"zap me" 790@end example 791 792(It will copy all other characters in the input to the 793output since they will be matched by the default rule.) 794 795Here is a program which compresses multiple blanks and 796tabs down to a single blank, and throws away whitespace 797found at the end of a line: 798 799@example 800%% 801[ \t]+ putchar( ' ' ); 802[ \t]+$ /* ignore this token */ 803@end example 804 805If the action contains a '@{', then the action spans till 806the balancing '@}' is found, and the action may cross 807multiple lines. @code{flex} knows about C strings and comments and 808won't be fooled by braces found within them, but also 809allows actions to begin with @samp{%@{} and will consider the 810action to be all the text up to the next @samp{%@}} (regardless of 811ordinary braces inside the action). 812 813An action consisting solely of a vertical bar ('|') means 814"same as the action for the next rule." See below for an 815illustration. 816 817Actions can include arbitrary C code, including @code{return} 818statements to return a value to whatever routine called 819@samp{yylex()}. Each time @samp{yylex()} is called it continues 820processing tokens from where it last left off until it either 821reaches the end of the file or executes a return. 822 823Actions are free to modify @code{yytext} except for lengthening 824it (adding characters to its end--these will overwrite 825later characters in the input stream). This however does 826not apply when using @samp{%array} (see above); in that case, 827@code{yytext} may be freely modified in any way. 828 829Actions are free to modify @code{yyleng} except they should not 830do so if the action also includes use of @samp{yymore()} (see 831below). 832 833There are a number of special directives which can be 834included within an action: 835 836@itemize - 837@item 838@samp{ECHO} copies yytext to the scanner's output. 839 840@item 841@code{BEGIN} followed by the name of a start condition 842places the scanner in the corresponding start 843condition (see below). 844 845@item 846@code{REJECT} directs the scanner to proceed on to the 847"second best" rule which matched the input (or a 848prefix of the input). The rule is chosen as 849described above in "How the Input is Matched", and 850@code{yytext} and @code{yyleng} set up appropriately. It may 851either be one which matched as much text as the 852originally chosen rule but came later in the @code{flex} 853input file, or one which matched less text. For 854example, the following will both count the words in 855the input and call the routine special() whenever 856"frob" is seen: 857 858@example 859 int word_count = 0; 860%% 861 862frob special(); REJECT; 863[^ \t\n]+ ++word_count; 864@end example 865 866Without the @code{REJECT}, any "frob"'s in the input would 867not be counted as words, since the scanner normally 868executes only one action per token. Multiple 869@code{REJECT's} are allowed, each one finding the next 870best choice to the currently active rule. For 871example, when the following scanner scans the token 872"abcd", it will write "abcdabcaba" to the output: 873 874@example 875%% 876a | 877ab | 878abc | 879abcd ECHO; REJECT; 880.|\n /* eat up any unmatched character */ 881@end example 882 883(The first three rules share the fourth's action 884since they use the special '|' action.) @code{REJECT} is 885a particularly expensive feature in terms of 886scanner performance; if it is used in @emph{any} of the 887scanner's actions it will slow down @emph{all} of the 888scanner's matching. Furthermore, @code{REJECT} cannot be used 889with the @samp{-Cf} or @samp{-CF} options (see below). 890 891Note also that unlike the other special actions, 892@code{REJECT} is a @emph{branch}; code immediately following it 893in the action will @emph{not} be executed. 894 895@item 896@samp{yymore()} tells the scanner that the next time it 897matches a rule, the corresponding token should be 898@emph{appended} onto the current value of @code{yytext} rather 899than replacing it. For example, given the input 900"mega-kludge" the following will write 901"mega-mega-kludge" to the output: 902 903@example 904%% 905mega- ECHO; yymore(); 906kludge ECHO; 907@end example 908 909First "mega-" is matched and echoed to the output. 910Then "kludge" is matched, but the previous "mega-" 911is still hanging around at the beginning of @code{yytext} 912so the @samp{ECHO} for the "kludge" rule will actually 913write "mega-kludge". 914@end itemize 915 916Two notes regarding use of @samp{yymore()}. First, @samp{yymore()} 917depends on the value of @code{yyleng} correctly reflecting the 918size of the current token, so you must not modify @code{yyleng} 919if you are using @samp{yymore()}. Second, the presence of 920@samp{yymore()} in the scanner's action entails a minor 921performance penalty in the scanner's matching speed. 922 923@itemize - 924@item 925@samp{yyless(n)} returns all but the first @var{n} characters of 926the current token back to the input stream, where 927they will be rescanned when the scanner looks for 928the next match. @code{yytext} and @code{yyleng} are adjusted 929appropriately (e.g., @code{yyleng} will now be equal to @var{n} 930). For example, on the input "foobar" the 931following will write out "foobarbar": 932 933@example 934%% 935foobar ECHO; yyless(3); 936[a-z]+ ECHO; 937@end example 938 939An argument of 0 to @code{yyless} will cause the entire 940current input string to be scanned again. Unless 941you've changed how the scanner will subsequently 942process its input (using @code{BEGIN}, for example), this 943will result in an endless loop. 944 945Note that @code{yyless} is a macro and can only be used in the 946flex input file, not from other source files. 947 948@item 949@samp{unput(c)} puts the character @code{c} back onto the input 950stream. It will be the next character scanned. 951The following action will take the current token 952and cause it to be rescanned enclosed in 953parentheses. 954 955@example 956@{ 957int i; 958/* Copy yytext because unput() trashes yytext */ 959char *yycopy = strdup( yytext ); 960unput( ')' ); 961for ( i = yyleng - 1; i >= 0; --i ) 962 unput( yycopy[i] ); 963unput( '(' ); 964free( yycopy ); 965@} 966@end example 967 968Note that since each @samp{unput()} puts the given 969character back at the @emph{beginning} of the input stream, 970pushing back strings must be done back-to-front. 971An important potential problem when using @samp{unput()} is that 972if you are using @samp{%pointer} (the default), a call to @samp{unput()} 973@emph{destroys} the contents of @code{yytext}, starting with its 974rightmost character and devouring one character to the left 975with each call. If you need the value of yytext preserved 976after a call to @samp{unput()} (as in the above example), you 977must either first copy it elsewhere, or build your scanner 978using @samp{%array} instead (see How The Input Is Matched). 979 980Finally, note that you cannot put back @code{EOF} to attempt to 981mark the input stream with an end-of-file. 982 983@item 984@samp{input()} reads the next character from the input 985stream. For example, the following is one way to 986eat up C comments: 987 988@example 989%% 990"/*" @{ 991 register int c; 992 993 for ( ; ; ) 994 @{ 995 while ( (c = input()) != '*' && 996 c != EOF ) 997 ; /* eat up text of comment */ 998 999 if ( c == '*' ) 1000 @{ 1001 while ( (c = input()) == '*' ) 1002 ; 1003 if ( c == '/' ) 1004 break; /* found the end */ 1005 @} 1006 1007 if ( c == EOF ) 1008 @{ 1009 error( "EOF in comment" ); 1010 break; 1011 @} 1012 @} 1013 @} 1014@end example 1015 1016(Note that if the scanner is compiled using @samp{C++}, 1017then @samp{input()} is instead referred to as @samp{yyinput()}, 1018in order to avoid a name clash with the @samp{C++} stream 1019by the name of @code{input}.) 1020 1021@item YY_FLUSH_BUFFER 1022flushes the scanner's internal buffer so that the next time the scanner 1023attempts to match a token, it will first refill the buffer using 1024@code{YY_INPUT} (see The Generated Scanner, below). This action is 1025a special case of the more general @samp{yy_flush_buffer()} function, 1026described below in the section Multiple Input Buffers. 1027 1028@item 1029@samp{yyterminate()} can be used in lieu of a return 1030statement in an action. It terminates the scanner 1031and returns a 0 to the scanner's caller, indicating 1032"all done". By default, @samp{yyterminate()} is also 1033called when an end-of-file is encountered. It is a 1034macro and may be redefined. 1035@end itemize 1036 1037@node Generated scanner, Start conditions, Actions, Top 1038@section The generated scanner 1039 1040The output of @code{flex} is the file @file{lex.yy.c}, which contains 1041the scanning routine @samp{yylex()}, a number of tables used by 1042it for matching tokens, and a number of auxiliary routines 1043and macros. By default, @samp{yylex()} is declared as follows: 1044 1045@example 1046int yylex() 1047 @{ 1048 @dots{} various definitions and the actions in here @dots{} 1049 @} 1050@end example 1051 1052(If your environment supports function prototypes, then it 1053will be "int yylex( void )".) This definition may be 1054changed by defining the "YY_DECL" macro. For example, you 1055could use: 1056 1057@example 1058#define YY_DECL float lexscan( a, b ) float a, b; 1059@end example 1060 1061to give the scanning routine the name @code{lexscan}, returning a 1062float, and taking two floats as arguments. Note that if 1063you give arguments to the scanning routine using a 1064K&R-style/non-prototyped function declaration, you must 1065terminate the definition with a semi-colon (@samp{;}). 1066 1067Whenever @samp{yylex()} is called, it scans tokens from the 1068global input file @code{yyin} (which defaults to stdin). It 1069continues until it either reaches an end-of-file (at which 1070point it returns the value 0) or one of its actions 1071executes a @code{return} statement. 1072 1073If the scanner reaches an end-of-file, subsequent calls are undefined 1074unless either @code{yyin} is pointed at a new input file (in which case 1075scanning continues from that file), or @samp{yyrestart()} is called. 1076@samp{yyrestart()} takes one argument, a @samp{FILE *} pointer (which 1077can be nil, if you've set up @code{YY_INPUT} to scan from a source 1078other than @code{yyin}), and initializes @code{yyin} for scanning from 1079that file. Essentially there is no difference between just assigning 1080@code{yyin} to a new input file or using @samp{yyrestart()} to do so; 1081the latter is available for compatibility with previous versions of 1082@code{flex}, and because it can be used to switch input files in the 1083middle of scanning. It can also be used to throw away the current 1084input buffer, by calling it with an argument of @code{yyin}; but 1085better is to use @code{YY_FLUSH_BUFFER} (see above). Note that 1086@samp{yyrestart()} does @emph{not} reset the start condition to 1087@code{INITIAL} (see Start Conditions, below). 1088 1089 1090If @samp{yylex()} stops scanning due to executing a @code{return} 1091statement in one of the actions, the scanner may then be called 1092again and it will resume scanning where it left off. 1093 1094By default (and for purposes of efficiency), the scanner 1095uses block-reads rather than simple @samp{getc()} calls to read 1096characters from @code{yyin}. The nature of how it gets its input 1097can be controlled by defining the @code{YY_INPUT} macro. 1098YY_INPUT's calling sequence is 1099"YY_INPUT(buf,result,max_size)". Its action is to place 1100up to @var{max_size} characters in the character array @var{buf} and 1101return in the integer variable @var{result} either the number of 1102characters read or the constant YY_NULL (0 on Unix 1103systems) to indicate EOF. The default YY_INPUT reads from 1104the global file-pointer "yyin". 1105 1106A sample definition of YY_INPUT (in the definitions 1107section of the input file): 1108 1109@example 1110%@{ 1111#define YY_INPUT(buf,result,max_size) \ 1112 @{ \ 1113 int c = getchar(); \ 1114 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 1115 @} 1116%@} 1117@end example 1118 1119This definition will change the input processing to occur 1120one character at a time. 1121 1122When the scanner receives an end-of-file indication from 1123YY_INPUT, it then checks the @samp{yywrap()} function. If 1124@samp{yywrap()} returns false (zero), then it is assumed that the 1125function has gone ahead and set up @code{yyin} to point to 1126another input file, and scanning continues. If it returns 1127true (non-zero), then the scanner terminates, returning 0 1128to its caller. Note that in either case, the start 1129condition remains unchanged; it does @emph{not} revert to @code{INITIAL}. 1130 1131If you do not supply your own version of @samp{yywrap()}, then you 1132must either use @samp{%option noyywrap} (in which case the scanner 1133behaves as though @samp{yywrap()} returned 1), or you must link with 1134@samp{-lfl} to obtain the default version of the routine, which always 1135returns 1. 1136 1137Three routines are available for scanning from in-memory 1138buffers rather than files: @samp{yy_scan_string()}, 1139@samp{yy_scan_bytes()}, and @samp{yy_scan_buffer()}. See the discussion 1140of them below in the section Multiple Input Buffers. 1141 1142The scanner writes its @samp{ECHO} output to the @code{yyout} global 1143(default, stdout), which may be redefined by the user 1144simply by assigning it to some other @code{FILE} pointer. 1145 1146@node Start conditions, Multiple buffers, Generated scanner, Top 1147@section Start conditions 1148 1149@code{flex} provides a mechanism for conditionally activating 1150rules. Any rule whose pattern is prefixed with "<sc>" 1151will only be active when the scanner is in the start 1152condition named "sc". For example, 1153 1154@example 1155<STRING>[^"]* @{ /* eat up the string body ... */ 1156 @dots{} 1157 @} 1158@end example 1159 1160@noindent 1161will be active only when the scanner is in the "STRING" 1162start condition, and 1163 1164@example 1165<INITIAL,STRING,QUOTE>\. @{ /* handle an escape ... */ 1166 @dots{} 1167 @} 1168@end example 1169 1170@noindent 1171will be active only when the current start condition is 1172either "INITIAL", "STRING", or "QUOTE". 1173 1174Start conditions are declared in the definitions (first) 1175section of the input using unindented lines beginning with 1176either @samp{%s} or @samp{%x} followed by a list of names. The former 1177declares @emph{inclusive} start conditions, the latter @emph{exclusive} 1178start conditions. A start condition is activated using 1179the @code{BEGIN} action. Until the next @code{BEGIN} action is 1180executed, rules with the given start condition will be active 1181and rules with other start conditions will be inactive. 1182If the start condition is @emph{inclusive}, then rules with no 1183start conditions at all will also be active. If it is 1184@emph{exclusive}, then @emph{only} rules qualified with the start 1185condition will be active. A set of rules contingent on the 1186same exclusive start condition describe a scanner which is 1187independent of any of the other rules in the @code{flex} input. 1188Because of this, exclusive start conditions make it easy 1189to specify "mini-scanners" which scan portions of the 1190input that are syntactically different from the rest 1191(e.g., comments). 1192 1193If the distinction between inclusive and exclusive start 1194conditions is still a little vague, here's a simple 1195example illustrating the connection between the two. The set 1196of rules: 1197 1198@example 1199%s example 1200%% 1201 1202<example>foo do_something(); 1203 1204bar something_else(); 1205@end example 1206 1207@noindent 1208is equivalent to 1209 1210@example 1211%x example 1212%% 1213 1214<example>foo do_something(); 1215 1216<INITIAL,example>bar something_else(); 1217@end example 1218 1219Without the @samp{<INITIAL,example>} qualifier, the @samp{bar} pattern 1220in the second example wouldn't be active (i.e., couldn't match) when 1221in start condition @samp{example}. If we just used @samp{<example>} 1222to qualify @samp{bar}, though, then it would only be active in 1223@samp{example} and not in @code{INITIAL}, while in the first example 1224it's active in both, because in the first example the @samp{example} 1225starting condition is an @emph{inclusive} (@samp{%s}) start condition. 1226 1227Also note that the special start-condition specifier @samp{<*>} 1228matches every start condition. Thus, the above example 1229could also have been written; 1230 1231@example 1232%x example 1233%% 1234 1235<example>foo do_something(); 1236 1237<*>bar something_else(); 1238@end example 1239 1240The default rule (to @samp{ECHO} any unmatched character) remains 1241active in start conditions. It is equivalent to: 1242 1243@example 1244<*>.|\\n ECHO; 1245@end example 1246 1247@samp{BEGIN(0)} returns to the original state where only the 1248rules with no start conditions are active. This state can 1249also be referred to as the start-condition "INITIAL", so 1250@samp{BEGIN(INITIAL)} is equivalent to @samp{BEGIN(0)}. (The 1251parentheses around the start condition name are not required but 1252are considered good style.) 1253 1254@code{BEGIN} actions can also be given as indented code at the 1255beginning of the rules section. For example, the 1256following will cause the scanner to enter the "SPECIAL" start 1257condition whenever @samp{yylex()} is called and the global 1258variable @code{enter_special} is true: 1259 1260@example 1261 int enter_special; 1262 1263%x SPECIAL 1264%% 1265 if ( enter_special ) 1266 BEGIN(SPECIAL); 1267 1268<SPECIAL>blahblahblah 1269@dots{}more rules follow@dots{} 1270@end example 1271 1272To illustrate the uses of start conditions, here is a 1273scanner which provides two different interpretations of a 1274string like "123.456". By default it will treat it as as 1275three tokens, the integer "123", a dot ('.'), and the 1276integer "456". But if the string is preceded earlier in 1277the line by the string "expect-floats" it will treat it as 1278a single token, the floating-point number 123.456: 1279 1280@example 1281%@{ 1282#include <math.h> 1283%@} 1284%s expect 1285 1286%% 1287expect-floats BEGIN(expect); 1288 1289<expect>[0-9]+"."[0-9]+ @{ 1290 printf( "found a float, = %f\n", 1291 atof( yytext ) ); 1292 @} 1293<expect>\n @{ 1294 /* that's the end of the line, so 1295 * we need another "expect-number" 1296 * before we'll recognize any more 1297 * numbers 1298 */ 1299 BEGIN(INITIAL); 1300 @} 1301 1302[0-9]+ @{ 1303 1304Version 2.5 December 1994 18 1305 1306 printf( "found an integer, = %d\n", 1307 atoi( yytext ) ); 1308 @} 1309 1310"." printf( "found a dot\n" ); 1311@end example 1312 1313Here is a scanner which recognizes (and discards) C 1314comments while maintaining a count of the current input line. 1315 1316@example 1317%x comment 1318%% 1319 int line_num = 1; 1320 1321"/*" BEGIN(comment); 1322 1323<comment>[^*\n]* /* eat anything that's not a '*' */ 1324<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1325<comment>\n ++line_num; 1326<comment>"*"+"/" BEGIN(INITIAL); 1327@end example 1328 1329This scanner goes to a bit of trouble to match as much 1330text as possible with each rule. In general, when 1331attempting to write a high-speed scanner try to match as 1332much possible in each rule, as it's a big win. 1333 1334Note that start-conditions names are really integer values 1335and can be stored as such. Thus, the above could be 1336extended in the following fashion: 1337 1338@example 1339%x comment foo 1340%% 1341 int line_num = 1; 1342 int comment_caller; 1343 1344"/*" @{ 1345 comment_caller = INITIAL; 1346 BEGIN(comment); 1347 @} 1348 1349@dots{} 1350 1351<foo>"/*" @{ 1352 comment_caller = foo; 1353 BEGIN(comment); 1354 @} 1355 1356<comment>[^*\n]* /* eat anything that's not a '*' */ 1357<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1358<comment>\n ++line_num; 1359<comment>"*"+"/" BEGIN(comment_caller); 1360@end example 1361 1362Furthermore, you can access the current start condition 1363using the integer-valued @code{YY_START} macro. For example, the 1364above assignments to @code{comment_caller} could instead be 1365written 1366 1367@example 1368comment_caller = YY_START; 1369@end example 1370 1371Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that 1372is what's used by AT&T @code{lex}). 1373 1374Note that start conditions do not have their own 1375name-space; %s's and %x's declare names in the same fashion as 1376#define's. 1377 1378Finally, here's an example of how to match C-style quoted 1379strings using exclusive start conditions, including 1380expanded escape sequences (but not including checking for 1381a string that's too long): 1382 1383@example 1384%x str 1385 1386%% 1387 char string_buf[MAX_STR_CONST]; 1388 char *string_buf_ptr; 1389 1390\" string_buf_ptr = string_buf; BEGIN(str); 1391 1392<str>\" @{ /* saw closing quote - all done */ 1393 BEGIN(INITIAL); 1394 *string_buf_ptr = '\0'; 1395 /* return string constant token type and 1396 * value to parser 1397 */ 1398 @} 1399 1400<str>\n @{ 1401 /* error - unterminated string constant */ 1402 /* generate error message */ 1403 @} 1404 1405<str>\\[0-7]@{1,3@} @{ 1406 /* octal escape sequence */ 1407 int result; 1408 1409 (void) sscanf( yytext + 1, "%o", &result ); 1410 1411 if ( result > 0xff ) 1412 /* error, constant is out-of-bounds */ 1413 1414 *string_buf_ptr++ = result; 1415 @} 1416 1417<str>\\[0-9]+ @{ 1418 /* generate error - bad escape sequence; something 1419 * like '\48' or '\0777777' 1420 */ 1421 @} 1422 1423<str>\\n *string_buf_ptr++ = '\n'; 1424<str>\\t *string_buf_ptr++ = '\t'; 1425<str>\\r *string_buf_ptr++ = '\r'; 1426<str>\\b *string_buf_ptr++ = '\b'; 1427<str>\\f *string_buf_ptr++ = '\f'; 1428 1429<str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1430 1431<str>[^\\\n\"]+ @{ 1432 char *yptr = yytext; 1433 1434 while ( *yptr ) 1435 *string_buf_ptr++ = *yptr++; 1436 @} 1437@end example 1438 1439Often, such as in some of the examples above, you wind up 1440writing a whole bunch of rules all preceded by the same 1441start condition(s). Flex makes this a little easier and 1442cleaner by introducing a notion of start condition @dfn{scope}. 1443A start condition scope is begun with: 1444 1445@example 1446<SCs>@{ 1447@end example 1448 1449@noindent 1450where SCs is a list of one or more start conditions. 1451Inside the start condition scope, every rule automatically 1452has the prefix @samp{<SCs>} applied to it, until a @samp{@}} which 1453matches the initial @samp{@{}. So, for example, 1454 1455@example 1456<ESC>@{ 1457 "\\n" return '\n'; 1458 "\\r" return '\r'; 1459 "\\f" return '\f'; 1460 "\\0" return '\0'; 1461@} 1462@end example 1463 1464@noindent 1465is equivalent to: 1466 1467@example 1468<ESC>"\\n" return '\n'; 1469<ESC>"\\r" return '\r'; 1470<ESC>"\\f" return '\f'; 1471<ESC>"\\0" return '\0'; 1472@end example 1473 1474Start condition scopes may be nested. 1475 1476Three routines are available for manipulating stacks of 1477start conditions: 1478 1479@table @samp 1480@item void yy_push_state(int new_state) 1481pushes the current start condition onto the top of 1482the start condition stack and switches to @var{new_state} 1483as though you had used @samp{BEGIN new_state} (recall that 1484start condition names are also integers). 1485 1486@item void yy_pop_state() 1487pops the top of the stack and switches to it via 1488@code{BEGIN}. 1489 1490@item int yy_top_state() 1491returns the top of the stack without altering the 1492stack's contents. 1493@end table 1494 1495The start condition stack grows dynamically and so has no 1496built-in size limitation. If memory is exhausted, program 1497execution aborts. 1498 1499To use start condition stacks, your scanner must include a 1500@samp{%option stack} directive (see Options below). 1501 1502@node Multiple buffers, End-of-file rules, Start conditions, Top 1503@section Multiple input buffers 1504 1505Some scanners (such as those which support "include" 1506files) require reading from several input streams. As 1507@code{flex} scanners do a large amount of buffering, one cannot 1508control where the next input will be read from by simply 1509writing a @code{YY_INPUT} which is sensitive to the scanning 1510context. @code{YY_INPUT} is only called when the scanner reaches 1511the end of its buffer, which may be a long time after 1512scanning a statement such as an "include" which requires 1513switching the input source. 1514 1515To negotiate these sorts of problems, @code{flex} provides a 1516mechanism for creating and switching between multiple 1517input buffers. An input buffer is created by using: 1518 1519@example 1520YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) 1521@end example 1522 1523@noindent 1524which takes a @code{FILE} pointer and a size and creates a buffer 1525associated with the given file and large enough to hold 1526@var{size} characters (when in doubt, use @code{YY_BUF_SIZE} for the 1527size). It returns a @code{YY_BUFFER_STATE} handle, which may 1528then be passed to other routines (see below). The 1529@code{YY_BUFFER_STATE} type is a pointer to an opaque @code{struct} 1530@code{yy_buffer_state} structure, so you may safely initialize 1531YY_BUFFER_STATE variables to @samp{((YY_BUFFER_STATE) 0)} if you 1532wish, and also refer to the opaque structure in order to 1533correctly declare input buffers in source files other than 1534that of your scanner. Note that the @code{FILE} pointer in the 1535call to @code{yy_create_buffer} is only used as the value of @code{yyin} 1536seen by @code{YY_INPUT}; if you redefine @code{YY_INPUT} so it no longer 1537uses @code{yyin}, then you can safely pass a nil @code{FILE} pointer to 1538@code{yy_create_buffer}. You select a particular buffer to scan 1539from using: 1540 1541@example 1542void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) 1543@end example 1544 1545switches the scanner's input buffer so subsequent tokens 1546will come from @var{new_buffer}. Note that 1547@samp{yy_switch_to_buffer()} may be used by @samp{yywrap()} to set 1548things up for continued scanning, instead of opening a new 1549file and pointing @code{yyin} at it. Note also that switching 1550input sources via either @samp{yy_switch_to_buffer()} or @samp{yywrap()} 1551does @emph{not} change the start condition. 1552 1553@example 1554void yy_delete_buffer( YY_BUFFER_STATE buffer ) 1555@end example 1556 1557@noindent 1558is used to reclaim the storage associated with a buffer. 1559You can also clear the current contents of a buffer using: 1560 1561@example 1562void yy_flush_buffer( YY_BUFFER_STATE buffer ) 1563@end example 1564 1565This function discards the buffer's contents, so the next time the 1566scanner attempts to match a token from the buffer, it will first fill 1567the buffer anew using @code{YY_INPUT}. 1568 1569@samp{yy_new_buffer()} is an alias for @samp{yy_create_buffer()}, 1570provided for compatibility with the C++ use of @code{new} and @code{delete} 1571for creating and destroying dynamic objects. 1572 1573Finally, the @code{YY_CURRENT_BUFFER} macro returns a 1574@code{YY_BUFFER_STATE} handle to the current buffer. 1575 1576Here is an example of using these features for writing a 1577scanner which expands include files (the @samp{<<EOF>>} feature 1578is discussed below): 1579 1580@example 1581/* the "incl" state is used for picking up the name 1582 * of an include file 1583 */ 1584%x incl 1585 1586%@{ 1587#define MAX_INCLUDE_DEPTH 10 1588YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1589int include_stack_ptr = 0; 1590%@} 1591 1592%% 1593include BEGIN(incl); 1594 1595[a-z]+ ECHO; 1596[^a-z\n]*\n? ECHO; 1597 1598<incl>[ \t]* /* eat the whitespace */ 1599<incl>[^ \t\n]+ @{ /* got the include file name */ 1600 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 1601 @{ 1602 fprintf( stderr, "Includes nested too deeply" ); 1603 exit( 1 ); 1604 @} 1605 1606 include_stack[include_stack_ptr++] = 1607 YY_CURRENT_BUFFER; 1608 1609 yyin = fopen( yytext, "r" ); 1610 1611 if ( ! yyin ) 1612 error( @dots{} ); 1613 1614 yy_switch_to_buffer( 1615 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 1616 1617 BEGIN(INITIAL); 1618 @} 1619 1620<<EOF>> @{ 1621 if ( --include_stack_ptr < 0 ) 1622 @{ 1623 yyterminate(); 1624 @} 1625 1626 else 1627 @{ 1628 yy_delete_buffer( YY_CURRENT_BUFFER ); 1629 yy_switch_to_buffer( 1630 include_stack[include_stack_ptr] ); 1631 @} 1632 @} 1633@end example 1634 1635Three routines are available for setting up input buffers 1636for scanning in-memory strings instead of files. All of 1637them create a new input buffer for scanning the string, 1638and return a corresponding @code{YY_BUFFER_STATE} handle (which 1639you should delete with @samp{yy_delete_buffer()} when done with 1640it). They also switch to the new buffer using 1641@samp{yy_switch_to_buffer()}, so the next call to @samp{yylex()} will 1642start scanning the string. 1643 1644@table @samp 1645@item yy_scan_string(const char *str) 1646scans a NUL-terminated string. 1647 1648@item yy_scan_bytes(const char *bytes, int len) 1649scans @code{len} bytes (including possibly NUL's) starting 1650at location @var{bytes}. 1651@end table 1652 1653Note that both of these functions create and scan a @emph{copy} 1654of the string or bytes. (This may be desirable, since 1655@samp{yylex()} modifies the contents of the buffer it is 1656scanning.) You can avoid the copy by using: 1657 1658@table @samp 1659@item yy_scan_buffer(char *base, yy_size_t size) 1660which scans in place the buffer starting at @var{base}, 1661consisting of @var{size} bytes, the last two bytes of 1662which @emph{must} be @code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). 1663These last two bytes are not scanned; thus, 1664scanning consists of @samp{base[0]} through @samp{base[size-2]}, 1665inclusive. 1666 1667If you fail to set up @var{base} in this manner (i.e., 1668forget the final two @code{YY_END_OF_BUFFER_CHAR} bytes), 1669then @samp{yy_scan_buffer()} returns a nil pointer instead 1670of creating a new input buffer. 1671 1672The type @code{yy_size_t} is an integral type to which you 1673can cast an integer expression reflecting the size 1674of the buffer. 1675@end table 1676 1677@node End-of-file rules, Miscellaneous, Multiple buffers, Top 1678@section End-of-file rules 1679 1680The special rule "<<EOF>>" indicates actions which are to 1681be taken when an end-of-file is encountered and yywrap() 1682returns non-zero (i.e., indicates no further files to 1683process). The action must finish by doing one of four 1684things: 1685 1686@itemize - 1687@item 1688assigning @code{yyin} to a new input file (in previous 1689versions of flex, after doing the assignment you 1690had to call the special action @code{YY_NEW_FILE}; this is 1691no longer necessary); 1692 1693@item 1694executing a @code{return} statement; 1695 1696@item 1697executing the special @samp{yyterminate()} action; 1698 1699@item 1700or, switching to a new buffer using 1701@samp{yy_switch_to_buffer()} as shown in the example 1702above. 1703@end itemize 1704 1705<<EOF>> rules may not be used with other patterns; they 1706may only be qualified with a list of start conditions. If 1707an unqualified <<EOF>> rule is given, it applies to @emph{all} 1708start conditions which do not already have <<EOF>> 1709actions. To specify an <<EOF>> rule for only the initial 1710start condition, use 1711 1712@example 1713<INITIAL><<EOF>> 1714@end example 1715 1716These rules are useful for catching things like unclosed 1717comments. An example: 1718 1719@example 1720%x quote 1721%% 1722 1723@dots{}other rules for dealing with quotes@dots{} 1724 1725<quote><<EOF>> @{ 1726 error( "unterminated quote" ); 1727 yyterminate(); 1728 @} 1729<<EOF>> @{ 1730 if ( *++filelist ) 1731 yyin = fopen( *filelist, "r" ); 1732 else 1733 yyterminate(); 1734 @} 1735@end example 1736 1737@node Miscellaneous, User variables, End-of-file rules, Top 1738@section Miscellaneous macros 1739 1740The macro @code{YY_USER_ACTION} can be defined to provide an 1741action which is always executed prior to the matched 1742rule's action. For example, it could be #define'd to call 1743a routine to convert yytext to lower-case. When 1744@code{YY_USER_ACTION} is invoked, the variable @code{yy_act} gives the 1745number of the matched rule (rules are numbered starting 1746with 1). Suppose you want to profile how often each of 1747your rules is matched. The following would do the trick: 1748 1749@example 1750#define YY_USER_ACTION ++ctr[yy_act] 1751@end example 1752 1753where @code{ctr} is an array to hold the counts for the different 1754rules. Note that the macro @code{YY_NUM_RULES} gives the total number 1755of rules (including the default rule, even if you use @samp{-s}, so 1756a correct declaration for @code{ctr} is: 1757 1758@example 1759int ctr[YY_NUM_RULES]; 1760@end example 1761 1762The macro @code{YY_USER_INIT} may be defined to provide an action 1763which is always executed before the first scan (and before 1764the scanner's internal initializations are done). For 1765example, it could be used to call a routine to read in a 1766data table or open a logging file. 1767 1768The macro @samp{yy_set_interactive(is_interactive)} can be used 1769to control whether the current buffer is considered 1770@emph{interactive}. An interactive buffer is processed more slowly, 1771but must be used when the scanner's input source is indeed 1772interactive to avoid problems due to waiting to fill 1773buffers (see the discussion of the @samp{-I} flag below). A 1774non-zero value in the macro invocation marks the buffer as 1775interactive, a zero value as non-interactive. Note that 1776use of this macro overrides @samp{%option always-interactive} or 1777@samp{%option never-interactive} (see Options below). 1778@samp{yy_set_interactive()} must be invoked prior to beginning to 1779scan the buffer that is (or is not) to be considered 1780interactive. 1781 1782The macro @samp{yy_set_bol(at_bol)} can be used to control 1783whether the current buffer's scanning context for the next 1784token match is done as though at the beginning of a line. 1785A non-zero macro argument makes rules anchored with 1786 1787The macro @samp{YY_AT_BOL()} returns true if the next token 1788scanned from the current buffer will have '^' rules 1789active, false otherwise. 1790 1791In the generated scanner, the actions are all gathered in 1792one large switch statement and separated using @code{YY_BREAK}, 1793which may be redefined. By default, it is simply a 1794"break", to separate each rule's action from the following 1795rule's. Redefining @code{YY_BREAK} allows, for example, C++ 1796users to #define YY_BREAK to do nothing (while being very 1797careful that every rule ends with a "break" or a 1798"return"!) to avoid suffering from unreachable statement 1799warnings where because a rule's action ends with "return", 1800the @code{YY_BREAK} is inaccessible. 1801 1802@node User variables, YACC interface, Miscellaneous, Top 1803@section Values available to the user 1804 1805This section summarizes the various values available to 1806the user in the rule actions. 1807 1808@itemize - 1809@item 1810@samp{char *yytext} holds the text of the current token. 1811It may be modified but not lengthened (you cannot 1812append characters to the end). 1813 1814If the special directive @samp{%array} appears in the 1815first section of the scanner description, then 1816@code{yytext} is instead declared @samp{char yytext[YYLMAX]}, 1817where @code{YYLMAX} is a macro definition that you can 1818redefine in the first section if you don't like the 1819default value (generally 8KB). Using @samp{%array} 1820results in somewhat slower scanners, but the value 1821of @code{yytext} becomes immune to calls to @samp{input()} and 1822@samp{unput()}, which potentially destroy its value when 1823@code{yytext} is a character pointer. The opposite of 1824@samp{%array} is @samp{%pointer}, which is the default. 1825 1826You cannot use @samp{%array} when generating C++ scanner 1827classes (the @samp{-+} flag). 1828 1829@item 1830@samp{int yyleng} holds the length of the current token. 1831 1832@item 1833@samp{FILE *yyin} is the file which by default @code{flex} reads 1834from. It may be redefined but doing so only makes 1835sense before scanning begins or after an EOF has 1836been encountered. Changing it in the midst of 1837scanning will have unexpected results since @code{flex} 1838buffers its input; use @samp{yyrestart()} instead. Once 1839scanning terminates because an end-of-file has been 1840seen, you can assign @code{yyin} at the new input file and 1841then call the scanner again to continue scanning. 1842 1843@item 1844@samp{void yyrestart( FILE *new_file )} may be called to 1845point @code{yyin} at the new input file. The switch-over 1846to the new file is immediate (any previously 1847buffered-up input is lost). Note that calling 1848@samp{yyrestart()} with @code{yyin} as an argument thus throws 1849away the current input buffer and continues 1850scanning the same input file. 1851 1852@item 1853@samp{FILE *yyout} is the file to which @samp{ECHO} actions are 1854done. It can be reassigned by the user. 1855 1856@item 1857@code{YY_CURRENT_BUFFER} returns a @code{YY_BUFFER_STATE} handle 1858to the current buffer. 1859 1860@item 1861@code{YY_START} returns an integer value corresponding to 1862the current start condition. You can subsequently 1863use this value with @code{BEGIN} to return to that start 1864condition. 1865@end itemize 1866 1867@node YACC interface, Options, User variables, Top 1868@section Interfacing with @code{yacc} 1869 1870One of the main uses of @code{flex} is as a companion to the @code{yacc} 1871parser-generator. @code{yacc} parsers expect to call a routine 1872named @samp{yylex()} to find the next input token. The routine 1873is supposed to return the type of the next token as well 1874as putting any associated value in the global @code{yylval}. To 1875use @code{flex} with @code{yacc}, one specifies the @samp{-d} option to @code{yacc} to 1876instruct it to generate the file @file{y.tab.h} containing 1877definitions of all the @samp{%tokens} appearing in the @code{yacc} input. 1878This file is then included in the @code{flex} scanner. For 1879example, if one of the tokens is "TOK_NUMBER", part of the 1880scanner might look like: 1881 1882@example 1883%@{ 1884#include "y.tab.h" 1885%@} 1886 1887%% 1888 1889[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 1890@end example 1891 1892@node Options, Performance, YACC interface, Top 1893@section Options 1894@code{flex} has the following options: 1895 1896@table @samp 1897@item -b 1898Generate backing-up information to @file{lex.backup}. 1899This is a list of scanner states which require 1900backing up and the input characters on which they 1901do so. By adding rules one can remove backing-up 1902states. If @emph{all} backing-up states are eliminated 1903and @samp{-Cf} or @samp{-CF} is used, the generated scanner will 1904run faster (see the @samp{-p} flag). Only users who wish 1905to squeeze every last cycle out of their scanners 1906need worry about this option. (See the section on 1907Performance Considerations below.) 1908 1909@item -c 1910is a do-nothing, deprecated option included for 1911POSIX compliance. 1912 1913@item -d 1914makes the generated scanner run in @dfn{debug} mode. 1915Whenever a pattern is recognized and the global 1916@code{yy_flex_debug} is non-zero (which is the default), 1917the scanner will write to @code{stderr} a line of the 1918form: 1919 1920@example 1921--accepting rule at line 53 ("the matched text") 1922@end example 1923 1924The line number refers to the location of the rule 1925in the file defining the scanner (i.e., the file 1926that was fed to flex). Messages are also generated 1927when the scanner backs up, accepts the default 1928rule, reaches the end of its input buffer (or 1929encounters a NUL; at this point, the two look the 1930same as far as the scanner's concerned), or reaches 1931an end-of-file. 1932 1933@item -f 1934specifies @dfn{fast scanner}. No table compression is 1935done and stdio is bypassed. The result is large 1936but fast. This option is equivalent to @samp{-Cfr} (see 1937below). 1938 1939@item -h 1940generates a "help" summary of @code{flex's} options to 1941@code{stdout} and then exits. @samp{-?} and @samp{--help} are synonyms 1942for @samp{-h}. 1943 1944@item -i 1945instructs @code{flex} to generate a @emph{case-insensitive} 1946scanner. The case of letters given in the @code{flex} input 1947patterns will be ignored, and tokens in the input 1948will be matched regardless of case. The matched 1949text given in @code{yytext} will have the preserved case 1950(i.e., it will not be folded). 1951 1952@item -l 1953turns on maximum compatibility with the original 1954AT&T @code{lex} implementation. Note that this does not 1955mean @emph{full} compatibility. Use of this option costs 1956a considerable amount of performance, and it cannot 1957be used with the @samp{-+, -f, -F, -Cf}, or @samp{-CF} options. 1958For details on the compatibilities it provides, see 1959the section "Incompatibilities With Lex And POSIX" 1960below. This option also results in the name 1961@code{YY_FLEX_LEX_COMPAT} being #define'd in the generated 1962scanner. 1963 1964@item -n 1965is another do-nothing, deprecated option included 1966only for POSIX compliance. 1967 1968@item -p 1969generates a performance report to stderr. The 1970report consists of comments regarding features of 1971the @code{flex} input file which will cause a serious loss 1972of performance in the resulting scanner. If you 1973give the flag twice, you will also get comments 1974regarding features that lead to minor performance 1975losses. 1976 1977Note that the use of @code{REJECT}, @samp{%option yylineno} and 1978variable trailing context (see the Deficiencies / Bugs section below) 1979entails a substantial performance penalty; use of @samp{yymore()}, 1980the @samp{^} operator, and the @samp{-I} flag entail minor performance 1981penalties. 1982 1983@item -s 1984causes the @dfn{default rule} (that unmatched scanner 1985input is echoed to @code{stdout}) to be suppressed. If 1986the scanner encounters input that does not match 1987any of its rules, it aborts with an error. This 1988option is useful for finding holes in a scanner's 1989rule set. 1990 1991@item -t 1992instructs @code{flex} to write the scanner it generates to 1993standard output instead of @file{lex.yy.c}. 1994 1995@item -v 1996specifies that @code{flex} should write to @code{stderr} a 1997summary of statistics regarding the scanner it 1998generates. Most of the statistics are meaningless to 1999the casual @code{flex} user, but the first line identifies 2000the version of @code{flex} (same as reported by @samp{-V}), and 2001the next line the flags used when generating the 2002scanner, including those that are on by default. 2003 2004@item -w 2005suppresses warning messages. 2006 2007@item -B 2008instructs @code{flex} to generate a @emph{batch} scanner, the 2009opposite of @emph{interactive} scanners generated by @samp{-I} 2010(see below). In general, you use @samp{-B} when you are 2011@emph{certain} that your scanner will never be used 2012interactively, and you want to squeeze a @emph{little} more 2013performance out of it. If your goal is instead to 2014squeeze out a @emph{lot} more performance, you should be 2015using the @samp{-Cf} or @samp{-CF} options (discussed below), 2016which turn on @samp{-B} automatically anyway. 2017 2018@item -F 2019specifies that the @dfn{fast} scanner table 2020representation should be used (and stdio bypassed). This 2021representation is about as fast as the full table 2022representation @samp{(-f)}, and for some sets of patterns 2023will be considerably smaller (and for others, 2024larger). In general, if the pattern set contains 2025both "keywords" and a catch-all, "identifier" rule, 2026such as in the set: 2027 2028@example 2029"case" return TOK_CASE; 2030"switch" return TOK_SWITCH; 2031... 2032"default" return TOK_DEFAULT; 2033[a-z]+ return TOK_ID; 2034@end example 2035 2036@noindent 2037then you're better off using the full table 2038representation. If only the "identifier" rule is 2039present and you then use a hash table or some such to 2040detect the keywords, you're better off using @samp{-F}. 2041 2042This option is equivalent to @samp{-CFr} (see below). It 2043cannot be used with @samp{-+}. 2044 2045@item -I 2046instructs @code{flex} to generate an @emph{interactive} scanner. 2047An interactive scanner is one that only looks ahead 2048to decide what token has been matched if it 2049absolutely must. It turns out that always looking one 2050extra character ahead, even if the scanner has 2051already seen enough text to disambiguate the 2052current token, is a bit faster than only looking ahead 2053when necessary. But scanners that always look 2054ahead give dreadful interactive performance; for 2055example, when a user types a newline, it is not 2056recognized as a newline token until they enter 2057@emph{another} token, which often means typing in another 2058whole line. 2059 2060@code{Flex} scanners default to @emph{interactive} unless you use 2061the @samp{-Cf} or @samp{-CF} table-compression options (see 2062below). That's because if you're looking for 2063high-performance you should be using one of these 2064options, so if you didn't, @code{flex} assumes you'd 2065rather trade off a bit of run-time performance for 2066intuitive interactive behavior. Note also that you 2067@emph{cannot} use @samp{-I} in conjunction with @samp{-Cf} or @samp{-CF}. 2068Thus, this option is not really needed; it is on by 2069default for all those cases in which it is allowed. 2070 2071You can force a scanner to @emph{not} be interactive by 2072using @samp{-B} (see above). 2073 2074@item -L 2075instructs @code{flex} not to generate @samp{#line} directives. 2076Without this option, @code{flex} peppers the generated 2077scanner with #line directives so error messages in 2078the actions will be correctly located with respect 2079to either the original @code{flex} input file (if the 2080errors are due to code in the input file), or 2081@file{lex.yy.c} (if the errors are @code{flex's} fault -- you 2082should report these sorts of errors to the email 2083address given below). 2084 2085@item -T 2086makes @code{flex} run in @code{trace} mode. It will generate a 2087lot of messages to @code{stderr} concerning the form of 2088the input and the resultant non-deterministic and 2089deterministic finite automata. This option is 2090mostly for use in maintaining @code{flex}. 2091 2092@item -V 2093prints the version number to @code{stdout} and exits. 2094@samp{--version} is a synonym for @samp{-V}. 2095 2096@item -7 2097instructs @code{flex} to generate a 7-bit scanner, i.e., 2098one which can only recognized 7-bit characters in 2099its input. The advantage of using @samp{-7} is that the 2100scanner's tables can be up to half the size of 2101those generated using the @samp{-8} option (see below). 2102The disadvantage is that such scanners often hang 2103or crash if their input contains an 8-bit 2104character. 2105 2106Note, however, that unless you generate your 2107scanner using the @samp{-Cf} or @samp{-CF} table compression options, 2108use of @samp{-7} will save only a small amount of table 2109space, and make your scanner considerably less 2110portable. @code{Flex's} default behavior is to generate 2111an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, in 2112which case @code{flex} defaults to generating 7-bit 2113scanners unless your site was always configured to 2114generate 8-bit scanners (as will often be the case 2115with non-USA sites). You can tell whether flex 2116generated a 7-bit or an 8-bit scanner by inspecting 2117the flag summary in the @samp{-v} output as described 2118above. 2119 2120Note that if you use @samp{-Cfe} or @samp{-CFe} (those table 2121compression options, but also using equivalence 2122classes as discussed see below), flex still 2123defaults to generating an 8-bit scanner, since 2124usually with these compression options full 8-bit 2125tables are not much more expensive than 7-bit 2126tables. 2127 2128@item -8 2129instructs @code{flex} to generate an 8-bit scanner, i.e., 2130one which can recognize 8-bit characters. This 2131flag is only needed for scanners generated using 2132@samp{-Cf} or @samp{-CF}, as otherwise flex defaults to 2133generating an 8-bit scanner anyway. 2134 2135See the discussion of @samp{-7} above for flex's default 2136behavior and the tradeoffs between 7-bit and 8-bit 2137scanners. 2138 2139@item -+ 2140specifies that you want flex to generate a C++ 2141scanner class. See the section on Generating C++ 2142Scanners below for details. 2143 2144@item -C[aefFmr] 2145controls the degree of table compression and, more 2146generally, trade-offs between small scanners and 2147fast scanners. 2148 2149@samp{-Ca} ("align") instructs flex to trade off larger 2150tables in the generated scanner for faster 2151performance because the elements of the tables are better 2152aligned for memory access and computation. On some 2153RISC architectures, fetching and manipulating 2154long-words is more efficient than with smaller-sized 2155units such as shortwords. This option can double 2156the size of the tables used by your scanner. 2157 2158@samp{-Ce} directs @code{flex} to construct @dfn{equivalence classes}, 2159i.e., sets of characters which have identical 2160lexical properties (for example, if the only appearance 2161of digits in the @code{flex} input is in the character 2162class "[0-9]" then the digits '0', '1', @dots{}, '9' 2163will all be put in the same equivalence class). 2164Equivalence classes usually give dramatic 2165reductions in the final table/object file sizes 2166(typically a factor of 2-5) and are pretty cheap 2167performance-wise (one array look-up per character 2168scanned). 2169 2170@samp{-Cf} specifies that the @emph{full} scanner tables should 2171be generated - @code{flex} should not compress the tables 2172by taking advantages of similar transition 2173functions for different states. 2174 2175@samp{-CF} specifies that the alternate fast scanner 2176representation (described above under the @samp{-F} flag) 2177should be used. This option cannot be used with 2178@samp{-+}. 2179 2180@samp{-Cm} directs @code{flex} to construct @dfn{meta-equivalence 2181classes}, which are sets of equivalence classes (or 2182characters, if equivalence classes are not being 2183used) that are commonly used together. 2184Meta-equivalence classes are often a big win when using 2185compressed tables, but they have a moderate 2186performance impact (one or two "if" tests and one array 2187look-up per character scanned). 2188 2189@samp{-Cr} causes the generated scanner to @emph{bypass} use of 2190the standard I/O library (stdio) for input. 2191Instead of calling @samp{fread()} or @samp{getc()}, the scanner 2192will use the @samp{read()} system call, resulting in a 2193performance gain which varies from system to 2194system, but in general is probably negligible unless 2195you are also using @samp{-Cf} or @samp{-CF}. Using @samp{-Cr} can cause 2196strange behavior if, for example, you read from 2197@code{yyin} using stdio prior to calling the scanner 2198(because the scanner will miss whatever text your 2199previous reads left in the stdio input buffer). 2200 2201@samp{-Cr} has no effect if you define @code{YY_INPUT} (see The 2202Generated Scanner above). 2203 2204A lone @samp{-C} specifies that the scanner tables should 2205be compressed but neither equivalence classes nor 2206meta-equivalence classes should be used. 2207 2208The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense 2209together - there is no opportunity for 2210meta-equivalence classes if the table is not being 2211compressed. Otherwise the options may be freely 2212mixed, and are cumulative. 2213 2214The default setting is @samp{-Cem}, which specifies that 2215@code{flex} should generate equivalence classes and 2216meta-equivalence classes. This setting provides the 2217highest degree of table compression. You can trade 2218off faster-executing scanners at the cost of larger 2219tables with the following generally being true: 2220 2221@example 2222slowest & smallest 2223 -Cem 2224 -Cm 2225 -Ce 2226 -C 2227 -C@{f,F@}e 2228 -C@{f,F@} 2229 -C@{f,F@}a 2230fastest & largest 2231@end example 2232 2233Note that scanners with the smallest tables are 2234usually generated and compiled the quickest, so 2235during development you will usually want to use the 2236default, maximal compression. 2237 2238@samp{-Cfe} is often a good compromise between speed and 2239size for production scanners. 2240 2241@item -ooutput 2242directs flex to write the scanner to the file @samp{out-} 2243@code{put} instead of @file{lex.yy.c}. If you combine @samp{-o} with 2244the @samp{-t} option, then the scanner is written to 2245@code{stdout} but its @samp{#line} directives (see the @samp{-L} option 2246above) refer to the file @code{output}. 2247 2248@item -Pprefix 2249changes the default @samp{yy} prefix used by @code{flex} for all 2250globally-visible variable and function names to 2251instead be @var{prefix}. For example, @samp{-Pfoo} changes the 2252name of @code{yytext} to @file{footext}. It also changes the 2253name of the default output file from @file{lex.yy.c} to 2254@file{lex.foo.c}. Here are all of the names affected: 2255 2256@example 2257yy_create_buffer 2258yy_delete_buffer 2259yy_flex_debug 2260yy_init_buffer 2261yy_flush_buffer 2262yy_load_buffer_state 2263yy_switch_to_buffer 2264yyin 2265yyleng 2266yylex 2267yylineno 2268yyout 2269yyrestart 2270yytext 2271yywrap 2272@end example 2273 2274(If you are using a C++ scanner, then only @code{yywrap} 2275and @code{yyFlexLexer} are affected.) Within your scanner 2276itself, you can still refer to the global variables 2277and functions using either version of their name; 2278but externally, they have the modified name. 2279 2280This option lets you easily link together multiple 2281@code{flex} programs into the same executable. Note, 2282though, that using this option also renames 2283@samp{yywrap()}, so you now @emph{must} either provide your own 2284(appropriately-named) version of the routine for 2285your scanner, or use @samp{%option noyywrap}, as linking 2286with @samp{-lfl} no longer provides one for you by 2287default. 2288 2289@item -Sskeleton_file 2290overrides the default skeleton file from which @code{flex} 2291constructs its scanners. You'll never need this 2292option unless you are doing @code{flex} maintenance or 2293development. 2294@end table 2295 2296@code{flex} also provides a mechanism for controlling options 2297within the scanner specification itself, rather than from 2298the flex command-line. This is done by including @samp{%option} 2299directives in the first section of the scanner 2300specification. You can specify multiple options with a single 2301@samp{%option} directive, and multiple directives in the first 2302section of your flex input file. Most options are given 2303simply as names, optionally preceded by the word "no" 2304(with no intervening whitespace) to negate their meaning. 2305A number are equivalent to flex flags or their negation: 2306 2307@example 23087bit -7 option 23098bit -8 option 2310align -Ca option 2311backup -b option 2312batch -B option 2313c++ -+ option 2314 2315caseful or 2316case-sensitive opposite of -i (default) 2317 2318case-insensitive or 2319caseless -i option 2320 2321debug -d option 2322default opposite of -s option 2323ecs -Ce option 2324fast -F option 2325full -f option 2326interactive -I option 2327lex-compat -l option 2328meta-ecs -Cm option 2329perf-report -p option 2330read -Cr option 2331stdout -t option 2332verbose -v option 2333warn opposite of -w option 2334 (use "%option nowarn" for -w) 2335 2336array equivalent to "%array" 2337pointer equivalent to "%pointer" (default) 2338@end example 2339 2340Some @samp{%option's} provide features otherwise not available: 2341 2342@table @samp 2343@item always-interactive 2344instructs flex to generate a scanner which always 2345considers its input "interactive". Normally, on 2346each new input file the scanner calls @samp{isatty()} in 2347an attempt to determine whether the scanner's input 2348source is interactive and thus should be read a 2349character at a time. When this option is used, 2350however, then no such call is made. 2351 2352@item main 2353directs flex to provide a default @samp{main()} program 2354for the scanner, which simply calls @samp{yylex()}. This 2355option implies @code{noyywrap} (see below). 2356 2357@item never-interactive 2358instructs flex to generate a scanner which never 2359considers its input "interactive" (again, no call 2360made to @samp{isatty())}. This is the opposite of @samp{always-} 2361@emph{interactive}. 2362 2363@item stack 2364enables the use of start condition stacks (see 2365Start Conditions above). 2366 2367@item stdinit 2368if unset (i.e., @samp{%option nostdinit}) initializes @code{yyin} 2369and @code{yyout} to nil @code{FILE} pointers, instead of @code{stdin} 2370and @code{stdout}. 2371 2372@item yylineno 2373directs @code{flex} to generate a scanner that maintains the number 2374of the current line read from its input in the global variable 2375@code{yylineno}. This option is implied by @samp{%option lex-compat}. 2376 2377@item yywrap 2378if unset (i.e., @samp{%option noyywrap}), makes the 2379scanner not call @samp{yywrap()} upon an end-of-file, but 2380simply assume that there are no more files to scan 2381(until the user points @code{yyin} at a new file and calls 2382@samp{yylex()} again). 2383@end table 2384 2385@code{flex} scans your rule actions to determine whether you use 2386the @code{REJECT} or @samp{yymore()} features. The @code{reject} and @code{yymore} 2387options are available to override its decision as to 2388whether you use the options, either by setting them (e.g., 2389@samp{%option reject}) to indicate the feature is indeed used, or 2390unsetting them to indicate it actually is not used (e.g., 2391@samp{%option noyymore}). 2392 2393Three options take string-delimited values, offset with '=': 2394 2395@example 2396%option outfile="ABC" 2397@end example 2398 2399@noindent 2400is equivalent to @samp{-oABC}, and 2401 2402@example 2403%option prefix="XYZ" 2404@end example 2405 2406@noindent 2407is equivalent to @samp{-PXYZ}. 2408 2409Finally, 2410 2411@example 2412%option yyclass="foo" 2413@end example 2414 2415@noindent 2416only applies when generating a C++ scanner (@samp{-+} option). It 2417informs @code{flex} that you have derived @samp{foo} as a subclass of 2418@code{yyFlexLexer} so @code{flex} will place your actions in the member 2419function @samp{foo::yylex()} instead of @samp{yyFlexLexer::yylex()}. 2420It also generates a @samp{yyFlexLexer::yylex()} member function that 2421emits a run-time error (by invoking @samp{yyFlexLexer::LexerError()}) 2422if called. See Generating C++ Scanners, below, for additional 2423information. 2424 2425A number of options are available for lint purists who 2426want to suppress the appearance of unneeded routines in 2427the generated scanner. Each of the following, if unset, 2428results in the corresponding routine not appearing in the 2429generated scanner: 2430 2431@example 2432input, unput 2433yy_push_state, yy_pop_state, yy_top_state 2434yy_scan_buffer, yy_scan_bytes, yy_scan_string 2435@end example 2436 2437@noindent 2438(though @samp{yy_push_state()} and friends won't appear anyway 2439unless you use @samp{%option stack}). 2440 2441@node Performance, C++, Options, Top 2442@section Performance considerations 2443 2444The main design goal of @code{flex} is that it generate 2445high-performance scanners. It has been optimized for dealing 2446well with large sets of rules. Aside from the effects on 2447scanner speed of the table compression @samp{-C} options outlined 2448above, there are a number of options/actions which degrade 2449performance. These are, from most expensive to least: 2450 2451@example 2452REJECT 2453%option yylineno 2454arbitrary trailing context 2455 2456pattern sets that require backing up 2457%array 2458%option interactive 2459%option always-interactive 2460 2461'^' beginning-of-line operator 2462yymore() 2463@end example 2464 2465with the first three all being quite expensive and the 2466last two being quite cheap. Note also that @samp{unput()} is 2467implemented as a routine call that potentially does quite 2468a bit of work, while @samp{yyless()} is a quite-cheap macro; so 2469if just putting back some excess text you scanned, use 2470@samp{yyless()}. 2471 2472@code{REJECT} should be avoided at all costs when performance is 2473important. It is a particularly expensive option. 2474 2475Getting rid of backing up is messy and often may be an 2476enormous amount of work for a complicated scanner. In 2477principal, one begins by using the @samp{-b} flag to generate a 2478@file{lex.backup} file. For example, on the input 2479 2480@example 2481%% 2482foo return TOK_KEYWORD; 2483foobar return TOK_KEYWORD; 2484@end example 2485 2486@noindent 2487the file looks like: 2488 2489@example 2490State #6 is non-accepting - 2491 associated rule line numbers: 2492 2 3 2493 out-transitions: [ o ] 2494 jam-transitions: EOF [ \001-n p-\177 ] 2495 2496State #8 is non-accepting - 2497 associated rule line numbers: 2498 3 2499 out-transitions: [ a ] 2500 jam-transitions: EOF [ \001-` b-\177 ] 2501 2502State #9 is non-accepting - 2503 associated rule line numbers: 2504 3 2505 out-transitions: [ r ] 2506 jam-transitions: EOF [ \001-q s-\177 ] 2507 2508Compressed tables always back up. 2509@end example 2510 2511The first few lines tell us that there's a scanner state 2512in which it can make a transition on an 'o' but not on any 2513other character, and that in that state the currently 2514scanned text does not match any rule. The state occurs 2515when trying to match the rules found at lines 2 and 3 in 2516the input file. If the scanner is in that state and then 2517reads something other than an 'o', it will have to back up 2518to find a rule which is matched. With a bit of 2519head-scratching one can see that this must be the state it's in 2520when it has seen "fo". When this has happened, if 2521anything other than another 'o' is seen, the scanner will 2522have to back up to simply match the 'f' (by the default 2523rule). 2524 2525The comment regarding State #8 indicates there's a problem 2526when "foob" has been scanned. Indeed, on any character 2527other than an 'a', the scanner will have to back up to 2528accept "foo". Similarly, the comment for State #9 2529concerns when "fooba" has been scanned and an 'r' does not 2530follow. 2531 2532The final comment reminds us that there's no point going 2533to all the trouble of removing backing up from the rules 2534unless we're using @samp{-Cf} or @samp{-CF}, since there's no 2535performance gain doing so with compressed scanners. 2536 2537The way to remove the backing up is to add "error" rules: 2538 2539@example 2540%% 2541foo return TOK_KEYWORD; 2542foobar return TOK_KEYWORD; 2543 2544fooba | 2545foob | 2546fo @{ 2547 /* false alarm, not really a keyword */ 2548 return TOK_ID; 2549 @} 2550@end example 2551 2552Eliminating backing up among a list of keywords can also 2553be done using a "catch-all" rule: 2554 2555@example 2556%% 2557foo return TOK_KEYWORD; 2558foobar return TOK_KEYWORD; 2559 2560[a-z]+ return TOK_ID; 2561@end example 2562 2563This is usually the best solution when appropriate. 2564 2565Backing up messages tend to cascade. With a complicated 2566set of rules it's not uncommon to get hundreds of 2567messages. If one can decipher them, though, it often only 2568takes a dozen or so rules to eliminate the backing up 2569(though it's easy to make a mistake and have an error rule 2570accidentally match a valid token. A possible future @code{flex} 2571feature will be to automatically add rules to eliminate 2572backing up). 2573 2574It's important to keep in mind that you gain the benefits 2575of eliminating backing up only if you eliminate @emph{every} 2576instance of backing up. Leaving just one means you gain 2577nothing. 2578 2579@var{Variable} trailing context (where both the leading and 2580trailing parts do not have a fixed length) entails almost 2581the same performance loss as @code{REJECT} (i.e., substantial). 2582So when possible a rule like: 2583 2584@example 2585%% 2586mouse|rat/(cat|dog) run(); 2587@end example 2588 2589@noindent 2590is better written: 2591 2592@example 2593%% 2594mouse/cat|dog run(); 2595rat/cat|dog run(); 2596@end example 2597 2598@noindent 2599or as 2600 2601@example 2602%% 2603mouse|rat/cat run(); 2604mouse|rat/dog run(); 2605@end example 2606 2607Note that here the special '|' action does @emph{not} provide any 2608savings, and can even make things worse (see Deficiencies 2609/ Bugs below). 2610 2611Another area where the user can increase a scanner's 2612performance (and one that's easier to implement) arises from 2613the fact that the longer the tokens matched, the faster 2614the scanner will run. This is because with long tokens 2615the processing of most input characters takes place in the 2616(short) inner scanning loop, and does not often have to go 2617through the additional work of setting up the scanning 2618environment (e.g., @code{yytext}) for the action. Recall the 2619scanner for C comments: 2620 2621@example 2622%x comment 2623%% 2624 int line_num = 1; 2625 2626"/*" BEGIN(comment); 2627 2628<comment>[^*\n]* 2629<comment>"*"+[^*/\n]* 2630<comment>\n ++line_num; 2631<comment>"*"+"/" BEGIN(INITIAL); 2632@end example 2633 2634This could be sped up by writing it as: 2635 2636@example 2637%x comment 2638%% 2639 int line_num = 1; 2640 2641"/*" BEGIN(comment); 2642 2643<comment>[^*\n]* 2644<comment>[^*\n]*\n ++line_num; 2645<comment>"*"+[^*/\n]* 2646<comment>"*"+[^*/\n]*\n ++line_num; 2647<comment>"*"+"/" BEGIN(INITIAL); 2648@end example 2649 2650Now instead of each newline requiring the processing of 2651another action, recognizing the newlines is "distributed" 2652over the other rules to keep the matched text as long as 2653possible. Note that @emph{adding} rules does @emph{not} slow down the 2654scanner! The speed of the scanner is independent of the 2655number of rules or (modulo the considerations given at the 2656beginning of this section) how complicated the rules are 2657with regard to operators such as '*' and '|'. 2658 2659A final example in speeding up a scanner: suppose you want 2660to scan through a file containing identifiers and 2661keywords, one per line and with no other extraneous 2662characters, and recognize all the keywords. A natural first 2663approach is: 2664 2665@example 2666%% 2667asm | 2668auto | 2669break | 2670@dots{} etc @dots{} 2671volatile | 2672while /* it's a keyword */ 2673 2674.|\n /* it's not a keyword */ 2675@end example 2676 2677To eliminate the back-tracking, introduce a catch-all 2678rule: 2679 2680@example 2681%% 2682asm | 2683auto | 2684break | 2685... etc ... 2686volatile | 2687while /* it's a keyword */ 2688 2689[a-z]+ | 2690.|\n /* it's not a keyword */ 2691@end example 2692 2693Now, if it's guaranteed that there's exactly one word per 2694line, then we can reduce the total number of matches by a 2695half by merging in the recognition of newlines with that 2696of the other tokens: 2697 2698@example 2699%% 2700asm\n | 2701auto\n | 2702break\n | 2703@dots{} etc @dots{} 2704volatile\n | 2705while\n /* it's a keyword */ 2706 2707[a-z]+\n | 2708.|\n /* it's not a keyword */ 2709@end example 2710 2711One has to be careful here, as we have now reintroduced 2712backing up into the scanner. In particular, while @emph{we} know 2713that there will never be any characters in the input 2714stream other than letters or newlines, @code{flex} can't figure 2715this out, and it will plan for possibly needing to back up 2716when it has scanned a token like "auto" and then the next 2717character is something other than a newline or a letter. 2718Previously it would then just match the "auto" rule and be 2719done, but now it has no "auto" rule, only a "auto\n" rule. 2720To eliminate the possibility of backing up, we could 2721either duplicate all rules but without final newlines, or, 2722since we never expect to encounter such an input and 2723therefore don't how it's classified, we can introduce one 2724more catch-all rule, this one which doesn't include a 2725newline: 2726 2727@example 2728%% 2729asm\n | 2730auto\n | 2731break\n | 2732@dots{} etc @dots{} 2733volatile\n | 2734while\n /* it's a keyword */ 2735 2736[a-z]+\n | 2737[a-z]+ | 2738.|\n /* it's not a keyword */ 2739@end example 2740 2741Compiled with @samp{-Cf}, this is about as fast as one can get a 2742@code{flex} scanner to go for this particular problem. 2743 2744A final note: @code{flex} is slow when matching NUL's, 2745particularly when a token contains multiple NUL's. It's best to 2746write rules which match @emph{short} amounts of text if it's 2747anticipated that the text will often include NUL's. 2748 2749Another final note regarding performance: as mentioned 2750above in the section How the Input is Matched, dynamically 2751resizing @code{yytext} to accommodate huge tokens is a slow 2752process because it presently requires that the (huge) token 2753be rescanned from the beginning. Thus if performance is 2754vital, you should attempt to match "large" quantities of 2755text but not "huge" quantities, where the cutoff between 2756the two is at about 8K characters/token. 2757 2758@node C++, Incompatibilities, Performance, Top 2759@section Generating C++ scanners 2760 2761@code{flex} provides two different ways to generate scanners for 2762use with C++. The first way is to simply compile a 2763scanner generated by @code{flex} using a C++ compiler instead of a C 2764compiler. You should not encounter any compilations 2765errors (please report any you find to the email address 2766given in the Author section below). You can then use C++ 2767code in your rule actions instead of C code. Note that 2768the default input source for your scanner remains @code{yyin}, 2769and default echoing is still done to @code{yyout}. Both of these 2770remain @samp{FILE *} variables and not C++ @code{streams}. 2771 2772You can also use @code{flex} to generate a C++ scanner class, using 2773the @samp{-+} option, (or, equivalently, @samp{%option c++}), which 2774is automatically specified if the name of the flex executable ends 2775in a @samp{+}, such as @code{flex++}. When using this option, flex 2776defaults to generating the scanner to the file @file{lex.yy.cc} instead 2777of @file{lex.yy.c}. The generated scanner includes the header file 2778@file{FlexLexer.h}, which defines the interface to two C++ classes. 2779 2780The first class, @code{FlexLexer}, provides an abstract base 2781class defining the general scanner class interface. It 2782provides the following member functions: 2783 2784@table @samp 2785@item const char* YYText() 2786returns the text of the most recently matched 2787token, the equivalent of @code{yytext}. 2788 2789@item int YYLeng() 2790returns the length of the most recently matched 2791token, the equivalent of @code{yyleng}. 2792 2793@item int lineno() const 2794returns the current input line number (see @samp{%option yylineno}), 2795or 1 if @samp{%option yylineno} was not used. 2796 2797@item void set_debug( int flag ) 2798sets the debugging flag for the scanner, equivalent to assigning to 2799@code{yy_flex_debug} (see the Options section above). Note that you 2800must build the scanner using @samp{%option debug} to include debugging 2801information in it. 2802 2803@item int debug() const 2804returns the current setting of the debugging flag. 2805@end table 2806 2807Also provided are member functions equivalent to 2808@samp{yy_switch_to_buffer(), yy_create_buffer()} (though the 2809first argument is an @samp{istream*} object pointer and not a 2810@samp{FILE*}, @samp{yy_flush_buffer()}, @samp{yy_delete_buffer()}, 2811and @samp{yyrestart()} (again, the first argument is a @samp{istream*} 2812object pointer). 2813 2814The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, 2815which is derived from @code{FlexLexer}. It defines the following 2816additional member functions: 2817 2818@table @samp 2819@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 2820constructs a @code{yyFlexLexer} object using the given 2821streams for input and output. If not specified, 2822the streams default to @code{cin} and @code{cout}, respectively. 2823 2824@item virtual int yylex() 2825performs the same role is @samp{yylex()} does for ordinary 2826flex scanners: it scans the input stream, consuming 2827tokens, until a rule's action returns a value. If you derive a subclass 2828@var{S} 2829from @code{yyFlexLexer} 2830and want to access the member functions and variables of 2831@var{S} 2832inside @samp{yylex()}, 2833then you need to use @samp{%option yyclass="@var{S}"} 2834to inform @code{flex} 2835that you will be using that subclass instead of @code{yyFlexLexer}. 2836In this case, rather than generating @samp{yyFlexLexer::yylex()}, 2837@code{flex} generates @samp{@var{S}::yylex()} 2838(and also generates a dummy @samp{yyFlexLexer::yylex()} 2839that calls @samp{yyFlexLexer::LexerError()} 2840if called). 2841 2842@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) 2843reassigns @code{yyin} to @code{new_in} 2844(if non-nil) 2845and @code{yyout} to @code{new_out} 2846(ditto), deleting the previous input buffer if @code{yyin} 2847is reassigned. 2848 2849@item int yylex( istream* new_in = 0, ostream* new_out = 0 ) 2850first switches the input streams via @samp{switch_streams( new_in, new_out )} 2851and then returns the value of @samp{yylex()}. 2852@end table 2853 2854In addition, @code{yyFlexLexer} defines the following protected 2855virtual functions which you can redefine in derived 2856classes to tailor the scanner: 2857 2858@table @samp 2859@item virtual int LexerInput( char* buf, int max_size ) 2860reads up to @samp{max_size} characters into @var{buf} and 2861returns the number of characters read. To indicate 2862end-of-input, return 0 characters. Note that 2863"interactive" scanners (see the @samp{-B} and @samp{-I} flags) 2864define the macro @code{YY_INTERACTIVE}. If you redefine 2865@code{LexerInput()} and need to take different actions 2866depending on whether or not the scanner might be 2867scanning an interactive input source, you can test 2868for the presence of this name via @samp{#ifdef}. 2869 2870@item virtual void LexerOutput( const char* buf, int size ) 2871writes out @var{size} characters from the buffer @var{buf}, 2872which, while NUL-terminated, may also contain 2873"internal" NUL's if the scanner's rules can match 2874text with NUL's in them. 2875 2876@item virtual void LexerError( const char* msg ) 2877reports a fatal error message. The default version 2878of this function writes the message to the stream 2879@code{cerr} and exits. 2880@end table 2881 2882Note that a @code{yyFlexLexer} object contains its @emph{entire} 2883scanning state. Thus you can use such objects to create 2884reentrant scanners. You can instantiate multiple instances of 2885the same @code{yyFlexLexer} class, and you can also combine 2886multiple C++ scanner classes together in the same program 2887using the @samp{-P} option discussed above. 2888Finally, note that the @samp{%array} feature is not available to 2889C++ scanner classes; you must use @samp{%pointer} (the default). 2890 2891Here is an example of a simple C++ scanner: 2892 2893@example 2894 // An example of using the flex C++ scanner class. 2895 2896%@{ 2897int mylineno = 0; 2898%@} 2899 2900string \"[^\n"]+\" 2901 2902ws [ \t]+ 2903 2904alpha [A-Za-z] 2905dig [0-9] 2906name (@{alpha@}|@{dig@}|\$)(@{alpha@}|@{dig@}|[_.\-/$])* 2907num1 [-+]?@{dig@}+\.?([eE][-+]?@{dig@}+)? 2908num2 [-+]?@{dig@}*\.@{dig@}+([eE][-+]?@{dig@}+)? 2909number @{num1@}|@{num2@} 2910 2911%% 2912 2913@{ws@} /* skip blanks and tabs */ 2914 2915"/*" @{ 2916 int c; 2917 2918 while((c = yyinput()) != 0) 2919 @{ 2920 if(c == '\n') 2921 ++mylineno; 2922 2923 else if(c == '*') 2924 @{ 2925 if((c = yyinput()) == '/') 2926 break; 2927 else 2928 unput(c); 2929 @} 2930 @} 2931 @} 2932 2933@{number@} cout << "number " << YYText() << '\n'; 2934 2935\n mylineno++; 2936 2937@{name@} cout << "name " << YYText() << '\n'; 2938 2939@{string@} cout << "string " << YYText() << '\n'; 2940 2941%% 2942 2943Version 2.5 December 1994 44 2944 2945int main( int /* argc */, char** /* argv */ ) 2946 @{ 2947 FlexLexer* lexer = new yyFlexLexer; 2948 while(lexer->yylex() != 0) 2949 ; 2950 return 0; 2951 @} 2952@end example 2953 2954If you want to create multiple (different) lexer classes, 2955you use the @samp{-P} flag (or the @samp{prefix=} option) to rename each 2956@code{yyFlexLexer} to some other @code{xxFlexLexer}. You then can 2957include @samp{<FlexLexer.h>} in your other sources once per lexer 2958class, first renaming @code{yyFlexLexer} as follows: 2959 2960@example 2961#undef yyFlexLexer 2962#define yyFlexLexer xxFlexLexer 2963#include <FlexLexer.h> 2964 2965#undef yyFlexLexer 2966#define yyFlexLexer zzFlexLexer 2967#include <FlexLexer.h> 2968@end example 2969 2970if, for example, you used @samp{%option prefix="xx"} for one of 2971your scanners and @samp{%option prefix="zz"} for the other. 2972 2973IMPORTANT: the present form of the scanning class is 2974@emph{experimental} and may change considerably between major 2975releases. 2976 2977@node Incompatibilities, Diagnostics, C++, Top 2978@section Incompatibilities with @code{lex} and POSIX 2979 2980@code{flex} is a rewrite of the AT&T Unix @code{lex} tool (the two 2981implementations do not share any code, though), with some 2982extensions and incompatibilities, both of which are of 2983concern to those who wish to write scanners acceptable to 2984either implementation. Flex is fully compliant with the 2985POSIX @code{lex} specification, except that when using @samp{%pointer} 2986(the default), a call to @samp{unput()} destroys the contents of 2987@code{yytext}, which is counter to the POSIX specification. 2988 2989In this section we discuss all of the known areas of 2990incompatibility between flex, AT&T lex, and the POSIX 2991specification. 2992 2993@code{flex's} @samp{-l} option turns on maximum compatibility with the 2994original AT&T @code{lex} implementation, at the cost of a major 2995loss in the generated scanner's performance. We note 2996below which incompatibilities can be overcome using the @samp{-l} 2997option. 2998 2999@code{flex} is fully compatible with @code{lex} with the following 3000exceptions: 3001 3002@itemize - 3003@item 3004The undocumented @code{lex} scanner internal variable @code{yylineno} 3005is not supported unless @samp{-l} or @samp{%option yylineno} is used. 3006@code{yylineno} should be maintained on a per-buffer basis, rather 3007than a per-scanner (single global variable) basis. @code{yylineno} is 3008not part of the POSIX specification. 3009 3010@item 3011The @samp{input()} routine is not redefinable, though it 3012may be called to read characters following whatever 3013has been matched by a rule. If @samp{input()} encounters 3014an end-of-file the normal @samp{yywrap()} processing is 3015done. A ``real'' end-of-file is returned by 3016@samp{input()} as @code{EOF}. 3017 3018Input is instead controlled by defining the 3019@code{YY_INPUT} macro. 3020 3021The @code{flex} restriction that @samp{input()} cannot be 3022redefined is in accordance with the POSIX 3023specification, which simply does not specify any way of 3024controlling the scanner's input other than by making 3025an initial assignment to @code{yyin}. 3026 3027@item 3028The @samp{unput()} routine is not redefinable. This 3029restriction is in accordance with POSIX. 3030 3031@item 3032@code{flex} scanners are not as reentrant as @code{lex} scanners. 3033In particular, if you have an interactive scanner 3034and an interrupt handler which long-jumps out of 3035the scanner, and the scanner is subsequently called 3036again, you may get the following message: 3037 3038@example 3039fatal flex scanner internal error--end of buffer missed 3040@end example 3041 3042To reenter the scanner, first use 3043 3044@example 3045yyrestart( yyin ); 3046@end example 3047 3048Note that this call will throw away any buffered 3049input; usually this isn't a problem with an 3050interactive scanner. 3051 3052Also note that flex C++ scanner classes @emph{are} 3053reentrant, so if using C++ is an option for you, you 3054should use them instead. See "Generating C++ 3055Scanners" above for details. 3056 3057@item 3058@samp{output()} is not supported. Output from the @samp{ECHO} 3059macro is done to the file-pointer @code{yyout} (default 3060@code{stdout}). 3061 3062@samp{output()} is not part of the POSIX specification. 3063 3064@item 3065@code{lex} does not support exclusive start conditions 3066(%x), though they are in the POSIX specification. 3067 3068@item 3069When definitions are expanded, @code{flex} encloses them 3070in parentheses. With lex, the following: 3071 3072@example 3073NAME [A-Z][A-Z0-9]* 3074%% 3075foo@{NAME@}? printf( "Found it\n" ); 3076%% 3077@end example 3078 3079will not match the string "foo" because when the 3080macro is expanded the rule is equivalent to 3081"foo[A-Z][A-Z0-9]*?" and the precedence is such that the 3082'?' is associated with "[A-Z0-9]*". With @code{flex}, the 3083rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and 3084so the string "foo" will match. 3085 3086Note that if the definition begins with @samp{^} or ends 3087with @samp{$} then it is @emph{not} expanded with parentheses, to 3088allow these operators to appear in definitions 3089without losing their special meanings. But the 3090@samp{<s>, /}, and @samp{<<EOF>>} operators cannot be used in a 3091@code{flex} definition. 3092 3093Using @samp{-l} results in the @code{lex} behavior of no 3094parentheses around the definition. 3095 3096The POSIX specification is that the definition be enclosed in 3097parentheses. 3098 3099@item 3100Some implementations of @code{lex} allow a rule's action to begin on 3101a separate line, if the rule's pattern has trailing whitespace: 3102 3103@example 3104%% 3105foo|bar<space here> 3106 @{ foobar_action(); @} 3107@end example 3108 3109@code{flex} does not support this feature. 3110 3111@item 3112The @code{lex} @samp{%r} (generate a Ratfor scanner) option is 3113not supported. It is not part of the POSIX 3114specification. 3115 3116@item 3117After a call to @samp{unput()}, @code{yytext} is undefined until 3118the next token is matched, unless the scanner was 3119built using @samp{%array}. This is not the case with @code{lex} 3120or the POSIX specification. The @samp{-l} option does 3121away with this incompatibility. 3122 3123@item 3124The precedence of the @samp{@{@}} (numeric range) operator 3125is different. @code{lex} interprets "abc@{1,3@}" as "match 3126one, two, or three occurrences of 'abc'", whereas 3127@code{flex} interprets it as "match 'ab' followed by one, 3128two, or three occurrences of 'c'". The latter is 3129in agreement with the POSIX specification. 3130 3131@item 3132The precedence of the @samp{^} operator is different. @code{lex} 3133interprets "^foo|bar" as "match either 'foo' at the 3134beginning of a line, or 'bar' anywhere", whereas 3135@code{flex} interprets it as "match either 'foo' or 'bar' 3136if they come at the beginning of a line". The 3137latter is in agreement with the POSIX specification. 3138 3139@item 3140The special table-size declarations such as @samp{%a} 3141supported by @code{lex} are not required by @code{flex} scanners; 3142@code{flex} ignores them. 3143 3144@item 3145The name FLEX_SCANNER is #define'd so scanners may 3146be written for use with either @code{flex} or @code{lex}. 3147Scanners also include @code{YY_FLEX_MAJOR_VERSION} and 3148@code{YY_FLEX_MINOR_VERSION} indicating which version of 3149@code{flex} generated the scanner (for example, for the 31502.5 release, these defines would be 2 and 5 3151respectively). 3152@end itemize 3153 3154The following @code{flex} features are not included in @code{lex} or the 3155POSIX specification: 3156 3157@example 3158C++ scanners 3159%option 3160start condition scopes 3161start condition stacks 3162interactive/non-interactive scanners 3163yy_scan_string() and friends 3164yyterminate() 3165yy_set_interactive() 3166yy_set_bol() 3167YY_AT_BOL() 3168<<EOF>> 3169<*> 3170YY_DECL 3171YY_START 3172YY_USER_ACTION 3173YY_USER_INIT 3174#line directives 3175%@{@}'s around actions 3176multiple actions on a line 3177@end example 3178 3179@noindent 3180plus almost all of the flex flags. The last feature in 3181the list refers to the fact that with @code{flex} you can put 3182multiple actions on the same line, separated with 3183semicolons, while with @code{lex}, the following 3184 3185@example 3186foo handle_foo(); ++num_foos_seen; 3187@end example 3188 3189@noindent 3190is (rather surprisingly) truncated to 3191 3192@example 3193foo handle_foo(); 3194@end example 3195 3196@code{flex} does not truncate the action. Actions that are not 3197enclosed in braces are simply terminated at the end of the 3198line. 3199 3200@node Diagnostics, Files, Incompatibilities, Top 3201@section Diagnostics 3202 3203@table @samp 3204@item warning, rule cannot be matched 3205indicates that the given 3206rule cannot be matched because it follows other rules that 3207will always match the same text as it. For example, in 3208the following "foo" cannot be matched because it comes 3209after an identifier "catch-all" rule: 3210 3211@example 3212[a-z]+ got_identifier(); 3213foo got_foo(); 3214@end example 3215 3216Using @code{REJECT} in a scanner suppresses this warning. 3217 3218@item warning, -s option given but default rule can be matched 3219means that it is possible (perhaps only in a particular 3220start condition) that the default rule (match any single 3221character) is the only one that will match a particular 3222input. Since @samp{-s} was given, presumably this is not 3223intended. 3224 3225@item reject_used_but_not_detected undefined 3226@itemx yymore_used_but_not_detected undefined 3227These errors can 3228occur at compile time. They indicate that the scanner 3229uses @code{REJECT} or @samp{yymore()} but that @code{flex} failed to notice the 3230fact, meaning that @code{flex} scanned the first two sections 3231looking for occurrences of these actions and failed to 3232find any, but somehow you snuck some in (via a #include 3233file, for example). Use @samp{%option reject} or @samp{%option yymore} 3234to indicate to flex that you really do use these features. 3235 3236@item flex scanner jammed 3237a scanner compiled with @samp{-s} has 3238encountered an input string which wasn't matched by any of 3239its rules. This error can also occur due to internal 3240problems. 3241 3242@item token too large, exceeds YYLMAX 3243your scanner uses @samp{%array} 3244and one of its rules matched a string longer than the @samp{YYL-} 3245@code{MAX} constant (8K bytes by default). You can increase the 3246value by #define'ing @code{YYLMAX} in the definitions section of 3247your @code{flex} input. 3248 3249@item scanner requires -8 flag to use the character '@var{x}' 3250Your 3251scanner specification includes recognizing the 8-bit 3252character @var{x} and you did not specify the -8 flag, and your 3253scanner defaulted to 7-bit because you used the @samp{-Cf} or @samp{-CF} 3254table compression options. See the discussion of the @samp{-7} 3255flag for details. 3256 3257@item flex scanner push-back overflow 3258you used @samp{unput()} to push 3259back so much text that the scanner's buffer could not hold 3260both the pushed-back text and the current token in @code{yytext}. 3261Ideally the scanner should dynamically resize the buffer 3262in this case, but at present it does not. 3263 3264@item input buffer overflow, can't enlarge buffer because scanner uses REJECT 3265the scanner was working on matching an 3266extremely large token and needed to expand the input 3267buffer. This doesn't work with scanners that use @code{REJECT}. 3268 3269@item fatal flex scanner internal error--end of buffer missed 3270This can occur in an scanner which is reentered after a 3271long-jump has jumped out (or over) the scanner's 3272activation frame. Before reentering the scanner, use: 3273 3274@example 3275yyrestart( yyin ); 3276@end example 3277 3278@noindent 3279or, as noted above, switch to using the C++ scanner class. 3280 3281@item too many start conditions in <> construct! 3282you listed 3283more start conditions in a <> construct than exist (so you 3284must have listed at least one of them twice). 3285@end table 3286 3287@node Files, Deficiencies, Diagnostics, Top 3288@section Files 3289 3290@table @file 3291@item -lfl 3292library with which scanners must be linked. 3293 3294@item lex.yy.c 3295generated scanner (called @file{lexyy.c} on some systems). 3296 3297@item lex.yy.cc 3298generated C++ scanner class, when using @samp{-+}. 3299 3300@item <FlexLexer.h> 3301header file defining the C++ scanner base class, 3302@code{FlexLexer}, and its derived class, @code{yyFlexLexer}. 3303 3304@item flex.skl 3305skeleton scanner. This file is only used when 3306building flex, not when flex executes. 3307 3308@item lex.backup 3309backing-up information for @samp{-b} flag (called @file{lex.bck} 3310on some systems). 3311@end table 3312 3313@node Deficiencies, See also, Files, Top 3314@section Deficiencies / Bugs 3315 3316Some trailing context patterns cannot be properly matched 3317and generate warning messages ("dangerous trailing 3318context"). These are patterns where the ending of the first 3319part of the rule matches the beginning of the second part, 3320such as "zx*/xy*", where the 'x*' matches the 'x' at the 3321beginning of the trailing context. (Note that the POSIX 3322draft states that the text matched by such patterns is 3323undefined.) 3324 3325For some trailing context rules, parts which are actually 3326fixed-length are not recognized as such, leading to the 3327abovementioned performance loss. In particular, parts 3328using '|' or @{n@} (such as "foo@{3@}") are always considered 3329variable-length. 3330 3331Combining trailing context with the special '|' action can 3332result in @emph{fixed} trailing context being turned into the 3333more expensive @var{variable} trailing context. For example, in 3334the following: 3335 3336@example 3337%% 3338abc | 3339xyz/def 3340@end example 3341 3342Use of @samp{unput()} invalidates yytext and yyleng, unless the 3343@samp{%array} directive or the @samp{-l} option has been used. 3344 3345Pattern-matching of NUL's is substantially slower than 3346matching other characters. 3347 3348Dynamic resizing of the input buffer is slow, as it 3349entails rescanning all the text matched so far by the 3350current (generally huge) token. 3351 3352Due to both buffering of input and read-ahead, you cannot 3353intermix calls to <stdio.h> routines, such as, for 3354example, @samp{getchar()}, with @code{flex} rules and expect it to work. 3355Call @samp{input()} instead. 3356 3357The total table entries listed by the @samp{-v} flag excludes the 3358number of table entries needed to determine what rule has 3359been matched. The number of entries is equal to the 3360number of DFA states if the scanner does not use @code{REJECT}, and 3361somewhat greater than the number of states if it does. 3362 3363@code{REJECT} cannot be used with the @samp{-f} or @samp{-F} options. 3364 3365The @code{flex} internal algorithms need documentation. 3366 3367@node See also, Author, Deficiencies, Top 3368@section See also 3369 3370@code{lex}(1), @code{yacc}(1), @code{sed}(1), @code{awk}(1). 3371 3372John Levine, Tony Mason, and Doug Brown: Lex & Yacc; 3373O'Reilly and Associates. Be sure to get the 2nd edition. 3374 3375M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator. 3376 3377Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: 3378Principles, Techniques and Tools; Addison-Wesley (1986). 3379Describes the pattern-matching techniques used by @code{flex} 3380(deterministic finite automata). 3381 3382@node Author, , See also, Top 3383@section Author 3384 3385Vern Paxson, with the help of many ideas and much inspiration from 3386Van Jacobson. Original version by Jef Poskanzer. The fast table 3387representation is a partial implementation of a design done by Van 3388Jacobson. The implementation was done by Kevin Gong and Vern Paxson. 3389 3390Thanks to the many @code{flex} beta-testers, feedbackers, and 3391contributors, especially Francois Pinard, Casey Leedom, Stan 3392Adermann, Terry Allen, David Barker-Plummer, John Basrai, Nelson 3393H.F. Beebe, @samp{benson@@odi.com}, Karl Berry, Peter A. Bigot, 3394Simon Blanchard, Keith Bostic, Frederic Brehm, Ian Brockbank, Kin 3395Cho, Nick Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin, 3396Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels, Chris 3397G. Demetriou, Theo Deraadt, Mike Donahue, Chuck Doucette, Tom Epperly, 3398Leo Eskin, Chris Faylor, Chris Flatters, Jon Forrest, Joe Gayda, Kaveh 3399R. Ghazi, Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer 3400Griebel, Jan Hajic, Charles Hemphill, NORO Hideo, Jarkko Hietaniemi, 3401Scott Hofmann, Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, 3402Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, 3403Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, 3404Amir Katz, @samp{ken@@ken.hilco.com}, Kevin B. Kenny, Steve Kirsch, 3405Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard, 3406Craig Leres, John Levine, Steve Liddle, Mike Long, Mohamed el Lozy, 3407Brian Madsen, Malte, Joe Marshall, Bengt Martensson, Chris Metcalf, 3408Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, 3409G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, Richard Ohnemus, 3410Karsten Pahnke, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond 3411Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic 3412Raimbault, Pat Rankin, Rick Richardson, Kevin Rodgers, Kai Uwe Rommel, 3413Jim Roskind, Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf 3414Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex 3415Siegel, Eckehard Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart, 3416Dave Tallman, Ian Lance Taylor, Chris Thewalt, Richard M. Timoney, 3417Jodi Tsai, Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, 3418Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn, and 3419those whose names have slipped my marginal mail-archiving skills but 3420whose contributions are appreciated all the same. 3421 3422Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, 3423Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, 3424Rich Salz, and Richard Stallman for help with various distribution 3425headaches. 3426 3427Thanks to Esmond Pitt and Earle Horton for 8-bit character support; 3428to Benson Margulies and Fred Burke for C++ support; to Kent Williams 3429and Tom Epperly for C++ class support; to Ove Ewerlid for support of 3430NUL's; and to Eric Hughes for support of multiple buffers. 3431 3432This work was primarily done when I was with the Real Time Systems 3433Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks 3434to all there for the support I received. 3435 3436Send comments to @samp{vern@@ee.lbl.gov}. 3437 3438@c @node Index, , Top, Top 3439@c @unnumbered Index 3440@c 3441@c @printindex cp 3442 3443@contents 3444@bye 3445 3446@c Local variables: 3447@c texinfo-column-for-description: 32 3448@c End: 3449