flex.texi revision 1.1.1.5
1\input texinfo.tex @c -*-texinfo-*- 2@c %**start of header 3@setfilename flex.info 4@include version.texi 5@settitle Lexical Analysis With Flex, for Flex @value{VERSION} 6@set authors Vern Paxson, Will Estes and John Millaway 7@c "Macro Hooks" index 8@defindex hk 9@c "Options" index 10@defindex op 11@dircategory Programming 12@direntry 13* flex: (flex). Fast lexical analyzer generator (lex replacement). 14@end direntry 15@c %**end of header 16 17@copying 18 19The flex manual is placed under the same licensing conditions as the 20rest of flex: 21 22Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012 23The Flex Project. 24 25Copyright @copyright{} 1990, 1997 The Regents of the University of California. 26All rights reserved. 27 28This code is derived from software contributed to Berkeley by 29Vern Paxson. 30 31The United States Government has rights in this work pursuant 32to contract no. DE-AC03-76SF00098 between the United States 33Department of Energy and the University of California. 34 35Redistribution and use in source and binary forms, with or without 36modification, are permitted provided that the following conditions 37are met: 38 39@enumerate 40@item 41 Redistributions of source code must retain the above copyright 42notice, this list of conditions and the following disclaimer. 43 44@item 45Redistributions in binary form must reproduce the above copyright 46notice, this list of conditions and the following disclaimer in the 47documentation and/or other materials provided with the distribution. 48@end enumerate 49 50Neither the name of the University nor the names of its contributors 51may be used to endorse or promote products derived from this software 52without specific prior written permission. 53 54THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 55IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 56WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 57PURPOSE. 58@end copying 59 60@titlepage 61@title Lexical Analysis with Flex 62@subtitle Edition @value{EDITION}, @value{UPDATED} 63@author @value{authors} 64@page 65@vskip 0pt plus 1filll 66@insertcopying 67@end titlepage 68@contents 69@ifnottex 70@node Top, Copyright, (dir), (dir) 71@top flex 72 73This manual describes @code{flex}, a tool for generating programs that 74perform pattern-matching on text. The manual includes both tutorial and 75reference sections. 76 77This edition of @cite{The flex Manual} documents @code{flex} version 78@value{VERSION}. It was last updated on @value{UPDATED}. 79 80This manual was written by @value{authors}. 81 82@menu 83* Copyright:: 84* Reporting Bugs:: 85* Introduction:: 86* Simple Examples:: 87* Format:: 88* Patterns:: 89* Matching:: 90* Actions:: 91* Generated Scanner:: 92* Start Conditions:: 93* Multiple Input Buffers:: 94* EOF:: 95* Misc Macros:: 96* User Values:: 97* Yacc:: 98* Scanner Options:: 99* Performance:: 100* Cxx:: 101* Reentrant:: 102* Lex and Posix:: 103* Memory Management:: 104* Serialized Tables:: 105* Diagnostics:: 106* Limitations:: 107* Bibliography:: 108* FAQ:: 109* Appendices:: 110* Indices:: 111 112@detailmenu 113 --- The Detailed Node Listing --- 114 115Format of the Input File 116 117* Definitions Section:: 118* Rules Section:: 119* User Code Section:: 120* Comments in the Input:: 121 122Scanner Options 123 124* Options for Specifying Filenames:: 125* Options Affecting Scanner Behavior:: 126* Code-Level And API Options:: 127* Options for Scanner Speed and Size:: 128* Debugging Options:: 129* Miscellaneous Options:: 130 131Reentrant C Scanners 132 133* Reentrant Uses:: 134* Reentrant Overview:: 135* Reentrant Example:: 136* Reentrant Detail:: 137* Reentrant Functions:: 138 139The Reentrant API in Detail 140 141* Specify Reentrant:: 142* Extra Reentrant Argument:: 143* Global Replacement:: 144* Init and Destroy Functions:: 145* Accessor Methods:: 146* Extra Data:: 147* About yyscan_t:: 148 149Memory Management 150 151* The Default Memory Management:: 152* Overriding The Default Memory Management:: 153* A Note About yytext And Memory:: 154 155Serialized Tables 156 157* Creating Serialized Tables:: 158* Loading and Unloading Serialized Tables:: 159* Tables File Format:: 160 161FAQ 162 163* When was flex born?:: 164* How do I expand backslash-escape sequences in C-style quoted strings?:: 165* Why do flex scanners call fileno if it is not ANSI compatible?:: 166* Does flex support recursive pattern definitions?:: 167* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 168* Flex is not matching my patterns in the same order that I defined them.:: 169* My actions are executing out of order or sometimes not at all.:: 170* How can I have multiple input sources feed into the same scanner at the same time?:: 171* Can I build nested parsers that work with the same input file?:: 172* How can I match text only at the end of a file?:: 173* How can I make REJECT cascade across start condition boundaries?:: 174* Why cant I use fast or full tables with interactive mode?:: 175* How much faster is -F or -f than -C?:: 176* If I have a simple grammar cant I just parse it with flex?:: 177* Why doesn't yyrestart() set the start state back to INITIAL?:: 178* How can I match C-style comments?:: 179* The period isn't working the way I expected.:: 180* Can I get the flex manual in another format?:: 181* Does there exist a "faster" NDFA->DFA algorithm?:: 182* How does flex compile the DFA so quickly?:: 183* How can I use more than 8192 rules?:: 184* How do I abandon a file in the middle of a scan and switch to a new file?:: 185* How do I execute code only during initialization (only before the first scan)?:: 186* How do I execute code at termination?:: 187* Where else can I find help?:: 188* Can I include comments in the "rules" section of the file?:: 189* I get an error about undefined yywrap().:: 190* How can I change the matching pattern at run time?:: 191* How can I expand macros in the input?:: 192* How can I build a two-pass scanner?:: 193* How do I match any string not matched in the preceding rules?:: 194* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 195* Is there a way to make flex treat NULL like a regular character?:: 196* Whenever flex can not match the input it says "flex scanner jammed".:: 197* Why doesn't flex have non-greedy operators like perl does?:: 198* Memory leak - 16386 bytes allocated by malloc.:: 199* How do I track the byte offset for lseek()?:: 200* How do I use my own I/O classes in a C++ scanner?:: 201* How do I skip as many chars as possible?:: 202* deleteme00:: 203* Are certain equivalent patterns faster than others?:: 204* Is backing up a big deal?:: 205* Can I fake multi-byte character support?:: 206* deleteme01:: 207* Can you discuss some flex internals?:: 208* unput() messes up yy_at_bol:: 209* The | operator is not doing what I want:: 210* Why can't flex understand this variable trailing context pattern?:: 211* The ^ operator isn't working:: 212* Trailing context is getting confused with trailing optional patterns:: 213* Is flex GNU or not?:: 214* ERASEME53:: 215* I need to scan if-then-else blocks and while loops:: 216* ERASEME55:: 217* ERASEME56:: 218* ERASEME57:: 219* Is there a repository for flex scanners?:: 220* How can I conditionally compile or preprocess my flex input file?:: 221* Where can I find grammars for lex and yacc?:: 222* I get an end-of-buffer message for each character scanned.:: 223* unnamed-faq-62:: 224* unnamed-faq-63:: 225* unnamed-faq-64:: 226* unnamed-faq-65:: 227* unnamed-faq-66:: 228* unnamed-faq-67:: 229* unnamed-faq-68:: 230* unnamed-faq-69:: 231* unnamed-faq-70:: 232* unnamed-faq-71:: 233* unnamed-faq-72:: 234* unnamed-faq-73:: 235* unnamed-faq-74:: 236* unnamed-faq-75:: 237* unnamed-faq-76:: 238* unnamed-faq-77:: 239* unnamed-faq-78:: 240* unnamed-faq-79:: 241* unnamed-faq-80:: 242* unnamed-faq-81:: 243* unnamed-faq-82:: 244* unnamed-faq-83:: 245* unnamed-faq-84:: 246* unnamed-faq-85:: 247* unnamed-faq-86:: 248* unnamed-faq-87:: 249* unnamed-faq-88:: 250* unnamed-faq-90:: 251* unnamed-faq-91:: 252* unnamed-faq-92:: 253* unnamed-faq-93:: 254* unnamed-faq-94:: 255* unnamed-faq-95:: 256* unnamed-faq-96:: 257* unnamed-faq-97:: 258* unnamed-faq-98:: 259* unnamed-faq-99:: 260* unnamed-faq-100:: 261* unnamed-faq-101:: 262* What is the difference between YYLEX_PARAM and YY_DECL?:: 263* Why do I get "conflicting types for yylex" error?:: 264* How do I access the values set in a Flex action from within a Bison action?:: 265 266Appendices 267 268* Makefiles and Flex:: 269* Bison Bridge:: 270* M4 Dependency:: 271* Common Patterns:: 272 273Indices 274 275* Concept Index:: 276* Index of Functions and Macros:: 277* Index of Variables:: 278* Index of Data Types:: 279* Index of Hooks:: 280* Index of Scanner Options:: 281 282@end detailmenu 283@end menu 284@end ifnottex 285@node Copyright, Reporting Bugs, Top, Top 286@chapter Copyright 287 288@cindex copyright of flex 289@cindex distributing flex 290@insertcopying 291 292@node Reporting Bugs, Introduction, Copyright, Top 293@chapter Reporting Bugs 294 295@cindex bugs, reporting 296@cindex reporting bugs 297 298If you find a bug in @code{flex}, please report it using 299GitHub's issue tracking facility at @url{https://github.com/westes/flex/issues/} 300 301@node Introduction, Simple Examples, Reporting Bugs, Top 302@chapter Introduction 303 304@cindex scanner, definition of 305@code{flex} is a tool for generating @dfn{scanners}. A scanner is a 306program which recognizes lexical patterns in text. The @code{flex} 307program reads the given input files, or its standard input if no file 308names are given, for a description of a scanner to generate. The 309description is in the form of pairs of regular expressions and C code, 310called @dfn{rules}. @code{flex} generates as output a C source file, 311@file{lex.yy.c} by default, which defines a routine @code{yylex()}. 312This file can be compiled and linked with the flex runtime library to 313produce an executable. When the executable is run, it analyzes its 314input for occurrences of the regular expressions. Whenever it finds 315one, it executes the corresponding C code. 316 317@node Simple Examples, Format, Introduction, Top 318@chapter Some Simple Examples 319 320First some simple examples to get the flavor of how one uses 321@code{flex}. 322 323@cindex username expansion 324The following @code{flex} input specifies a scanner which, when it 325encounters the string @samp{username} will replace it with the user's 326login name: 327 328@example 329@verbatim 330 %% 331 username printf( "%s", getlogin() ); 332@end verbatim 333@end example 334 335@cindex default rule 336@cindex rules, default 337By default, any text not matched by a @code{flex} scanner is copied to 338the output, so the net effect of this scanner is to copy its input file 339to its output with each occurrence of @samp{username} expanded. In this 340input, there is just one rule. @samp{username} is the @dfn{pattern} and 341the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the 342beginning of the rules. 343 344Here's another simple example: 345 346@cindex counting characters and lines 347@example 348@verbatim 349 int num_lines = 0, num_chars = 0; 350 351 %% 352 \n ++num_lines; ++num_chars; 353 . ++num_chars; 354 355 %% 356 357 int main() 358 { 359 yylex(); 360 printf( "# of lines = %d, # of chars = %d\n", 361 num_lines, num_chars ); 362 } 363@end verbatim 364@end example 365 366This scanner counts the number of characters and the number of lines in 367its input. It produces no output other than the final report on the 368character and line counts. The first line declares two globals, 369@code{num_lines} and @code{num_chars}, which are accessible both inside 370@code{yylex()} and in the @code{main()} routine declared after the 371second @samp{%%}. There are two rules, one which matches a newline 372(@samp{\n}) and increments both the line count and the character count, 373and one which matches any character other than a newline (indicated by 374the @samp{.} regular expression). 375 376A somewhat more complicated example: 377 378@cindex Pascal-like language 379@example 380@verbatim 381 /* scanner for a toy Pascal-like language */ 382 383 %{ 384 /* need this for the call to atof() below */ 385 #include <math.h> 386 %} 387 388 DIGIT [0-9] 389 ID [a-z][a-z0-9]* 390 391 %% 392 393 {DIGIT}+ { 394 printf( "An integer: %s (%d)\n", yytext, 395 atoi( yytext ) ); 396 } 397 398 {DIGIT}+"."{DIGIT}* { 399 printf( "A float: %s (%g)\n", yytext, 400 atof( yytext ) ); 401 } 402 403 if|then|begin|end|procedure|function { 404 printf( "A keyword: %s\n", yytext ); 405 } 406 407 {ID} printf( "An identifier: %s\n", yytext ); 408 409 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 410 411 "{"[^{}\n]*"}" /* eat up one-line comments */ 412 413 [ \t\n]+ /* eat up whitespace */ 414 415 . printf( "Unrecognized character: %s\n", yytext ); 416 417 %% 418 419 int main( int argc, char **argv ) 420 { 421 ++argv, --argc; /* skip over program name */ 422 if ( argc > 0 ) 423 yyin = fopen( argv[0], "r" ); 424 else 425 yyin = stdin; 426 427 yylex(); 428 } 429@end verbatim 430@end example 431 432This is the beginnings of a simple scanner for a language like Pascal. 433It identifies different types of @dfn{tokens} and reports on what it has 434seen. 435 436The details of this example will be explained in the following 437sections. 438 439@node Format, Patterns, Simple Examples, Top 440@chapter Format of the Input File 441 442 443@cindex format of flex input 444@cindex input, format of 445@cindex file format 446@cindex sections of flex input 447 448The @code{flex} input file consists of three sections, separated by a 449line containing only @samp{%%}. 450 451@cindex format of input file 452@example 453@verbatim 454 definitions 455 %% 456 rules 457 %% 458 user code 459@end verbatim 460@end example 461 462@menu 463* Definitions Section:: 464* Rules Section:: 465* User Code Section:: 466* Comments in the Input:: 467@end menu 468 469@node Definitions Section, Rules Section, Format, Format 470@section Format of the Definitions Section 471 472@cindex input file, Definitions section 473@cindex Definitions, in flex input 474The @dfn{definitions section} contains declarations of simple @dfn{name} 475definitions to simplify the scanner specification, and declarations of 476@dfn{start conditions}, which are explained in a later section. 477 478@cindex aliases, how to define 479@cindex pattern aliases, how to define 480Name definitions have the form: 481 482@example 483@verbatim 484 name definition 485@end verbatim 486@end example 487 488The @samp{name} is a word beginning with a letter or an underscore 489(@samp{_}) followed by zero or more letters, digits, @samp{_}, or 490@samp{-} (dash). The definition is taken to begin at the first 491non-whitespace character following the name and continuing to the end of 492the line. The definition can subsequently be referred to using 493@samp{@{name@}}, which will expand to @samp{(definition)}. For example, 494 495@cindex pattern aliases, defining 496@cindex defining pattern aliases 497@example 498@verbatim 499 DIGIT [0-9] 500 ID [a-z][a-z0-9]* 501@end verbatim 502@end example 503 504Defines @samp{DIGIT} to be a regular expression which matches a single 505digit, and @samp{ID} to be a regular expression which matches a letter 506followed by zero-or-more letters-or-digits. A subsequent reference to 507 508@cindex pattern aliases, use of 509@example 510@verbatim 511 {DIGIT}+"."{DIGIT}* 512@end verbatim 513@end example 514 515is identical to 516 517@example 518@verbatim 519 ([0-9])+"."([0-9])* 520@end verbatim 521@end example 522 523and matches one-or-more digits followed by a @samp{.} followed by 524zero-or-more digits. 525 526@cindex comments in flex input 527An unindented comment (i.e., a line 528beginning with @samp{/*}) is copied verbatim to the output up 529to the next @samp{*/}. 530 531@cindex %@{ and %@}, in Definitions Section 532@cindex embedding C code in flex input 533@cindex C code in flex input 534Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 535is also copied verbatim to the output (with the %@{ and %@} symbols 536removed). The %@{ and %@} symbols must appear unindented on lines by 537themselves. 538 539@cindex %top 540 541A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except 542that the code in a @code{%top} block is relocated to the @emph{top} of the 543generated file, before any flex definitions @footnote{Actually, 544@code{yyIN_HEADER} is defined before the @samp{%top} block.}. 545The @code{%top} block is useful when you want certain preprocessor macros to be 546defined or certain files to be included before the generated code. 547The single characters, @samp{@{} and @samp{@}} are used to delimit the 548@code{%top} block, as show in the example below: 549 550@example 551@verbatim 552 %top{ 553 /* This code goes at the "top" of the generated file. */ 554 #include <stdint.h> 555 #include <inttypes.h> 556 } 557@end verbatim 558@end example 559 560Multiple @code{%top} blocks are allowed, and their order is preserved. 561 562@node Rules Section, User Code Section, Definitions Section, Format 563@section Format of the Rules Section 564 565@cindex input file, Rules Section 566@cindex rules, in flex input 567The @dfn{rules} section of the @code{flex} input contains a series of 568rules of the form: 569 570@example 571@verbatim 572 pattern action 573@end verbatim 574@end example 575 576where the pattern must be unindented and the action must begin 577on the same line. 578@xref{Patterns}, for a further description of patterns and actions. 579 580In the rules section, any indented or %@{ %@} enclosed text appearing 581before the first rule may be used to declare variables which are local 582to the scanning routine and (after the declarations) code which is to be 583executed whenever the scanning routine is entered. Other indented or 584%@{ %@} text in the rule section is still copied to the output, but its 585meaning is not well-defined and it may well cause compile-time errors 586(this feature is present for @acronym{POSIX} compliance. @xref{Lex and 587Posix}, for other such features). 588 589Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 590is copied verbatim to the output (with the %@{ and %@} symbols removed). 591The %@{ and %@} symbols must appear unindented on lines by themselves. 592 593@node User Code Section, Comments in the Input, Rules Section, Format 594@section Format of the User Code Section 595 596@cindex input file, user code Section 597@cindex user code, in flex input 598The user code section is simply copied to @file{lex.yy.c} verbatim. It 599is used for companion routines which call or are called by the scanner. 600The presence of this section is optional; if it is missing, the second 601@samp{%%} in the input file may be skipped, too. 602 603@node Comments in the Input, , User Code Section, Format 604@section Comments in the Input 605 606@cindex comments, syntax of 607Flex supports C-style comments, that is, anything between @samp{/*} and 608@samp{*/} is 609considered a comment. Whenever flex encounters a comment, it copies the 610entire comment verbatim to the generated source code. Comments may 611appear just about anywhere, but with the following exceptions: 612 613@itemize 614@cindex comments, in rules section 615@item 616Comments may not appear in the Rules Section wherever flex is expecting 617a regular expression. This means comments may not appear at the 618beginning of a line, or immediately following a list of scanner states. 619@item 620Comments may not appear on an @samp{%option} line in the Definitions 621Section. 622@end itemize 623 624If you want to follow a simple rule, then always begin a comment on a 625new line, with one or more whitespace characters before the initial 626@samp{/*}). This rule will work anywhere in the input file. 627 628All the comments in the following example are valid: 629 630@cindex comments, valid uses of 631@cindex comments in the input 632@example 633@verbatim 634%{ 635/* code block */ 636%} 637 638/* Definitions Section */ 639%x STATE_X 640 641%% 642 /* Rules Section */ 643ruleA /* after regex */ { /* code block */ } /* after code block */ 644 /* Rules Section (indented) */ 645<STATE_X>{ 646ruleC ECHO; 647ruleD ECHO; 648%{ 649/* code block */ 650%} 651} 652%% 653/* User Code Section */ 654 655@end verbatim 656@end example 657 658@node Patterns, Matching, Format, Top 659@chapter Patterns 660 661@cindex patterns, in rules section 662@cindex regular expressions, in patterns 663The patterns in the input (see @ref{Rules Section}) are written using an 664extended set of regular expressions. These are: 665 666@cindex patterns, syntax 667@cindex patterns, syntax 668@table @samp 669@item x 670match the character 'x' 671 672@item . 673any character (byte) except newline 674 675@cindex [] in patterns 676@cindex character classes in patterns, syntax of 677@cindex POSIX, character classes in patterns, syntax of 678@item [xyz] 679a @dfn{character class}; in this case, the pattern 680matches either an 'x', a 'y', or a 'z' 681 682@cindex ranges in patterns 683@item [abj-oZ] 684a "character class" with a range in it; matches 685an 'a', a 'b', any letter from 'j' through 'o', 686or a 'Z' 687 688@cindex ranges in patterns, negating 689@cindex negating ranges in patterns 690@item [^A-Z] 691a "negated character class", i.e., any character 692but those in the class. In this case, any 693character EXCEPT an uppercase letter. 694 695@item [^A-Z\n] 696any character EXCEPT an uppercase letter or 697a newline 698 699@item [a-z]@{-@}[aeiou] 700the lowercase consonants 701 702@item r* 703zero or more r's, where r is any regular expression 704 705@item r+ 706one or more r's 707 708@item r? 709zero or one r's (that is, ``an optional r'') 710 711@cindex braces in patterns 712@item r@{2,5@} 713anywhere from two to five r's 714 715@item r@{2,@} 716two or more r's 717 718@item r@{4@} 719exactly 4 r's 720 721@cindex pattern aliases, expansion of 722@item @{name@} 723the expansion of the @samp{name} definition 724(@pxref{Format}). 725 726@cindex literal text in patterns, syntax of 727@cindex verbatim text in patterns, syntax of 728@item "[xyz]\"foo" 729the literal string: @samp{[xyz]"foo} 730 731@cindex escape sequences in patterns, syntax of 732@item \X 733if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or 734@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a 735literal @samp{X} (used to escape operators such as @samp{*}) 736 737@cindex NULL character in patterns, syntax of 738@item \0 739a NUL character (ASCII code 0) 740 741@cindex octal characters in patterns 742@item \123 743the character with octal value 123 744 745@item \x2a 746the character with hexadecimal value 2a 747 748@item (r) 749match an @samp{r}; parentheses are used to override precedence (see below) 750 751@item (?r-s:pattern) 752apply option @samp{r} and omit option @samp{s} while interpreting pattern. 753Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}. 754 755@samp{i} means case-insensitive. @samp{-i} means case-sensitive. 756 757@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever. 758@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}. 759 760@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless 761it is backslash-escaped, contained within @samp{""}s, or appears inside a 762character class. 763 764The following are all valid: 765 766@verbatim 767(?:foo) same as (foo) 768(?i:ab7) same as ([aA][bB]7) 769(?-i:ab) same as (ab) 770(?s:.) same as [\x00-\xFF] 771(?-s:.) same as [^\n] 772(?ix-s: a . b) same as ([Aa][^\n][bB]) 773(?x:a b) same as ("ab") 774(?x:a\ b) same as ("a b") 775(?x:a" "b) same as ("a b") 776(?x:a[ ]b) same as ("a b") 777(?x:a 778 /* comment */ 779 b 780 c) same as (abc) 781@end verbatim 782 783@item (?# comment ) 784omit everything within @samp{()}. The first @samp{)} 785character encountered ends the pattern. It is not possible to for the comment 786to contain a @samp{)} character. The comment may span lines. 787 788@cindex concatenation, in patterns 789@item rs 790the regular expression @samp{r} followed by the regular expression @samp{s}; called 791@dfn{concatenation} 792 793@item r|s 794either an @samp{r} or an @samp{s} 795 796@cindex trailing context, in patterns 797@item r/s 798an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is 799included when determining whether this rule is the longest match, but is 800then returned to the input before the action is executed. So the action 801only sees the text matched by @samp{r}. This type of pattern is called 802@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex 803cannot match correctly. @xref{Limitations}, regarding dangerous trailing 804context.) 805 806@cindex beginning of line, in patterns 807@cindex BOL, in patterns 808@item ^r 809an @samp{r}, but only at the beginning of a line (i.e., 810when just starting to scan, or right after a 811newline has been scanned). 812 813@cindex end of line, in patterns 814@cindex EOL, in patterns 815@item r$ 816an @samp{r}, but only at the end of a line (i.e., just before a 817newline). Equivalent to @samp{r/\n}. 818 819@cindex newline, matching in patterns 820Note that @code{flex}'s notion of ``newline'' is exactly 821whatever the C compiler used to compile @code{flex} 822interprets @samp{\n} as; in particular, on some DOS 823systems you must either filter out @samp{\r}s in the 824input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. 825 826@cindex start conditions, in patterns 827@item <s>r 828an @samp{r}, but only in start condition @code{s} (see @ref{Start 829Conditions} for discussion of start conditions). 830 831@item <s1,s2,s3>r 832same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. 833 834@item <*>r 835an @samp{r} in any start condition, even an exclusive one. 836 837@cindex end of file, in patterns 838@cindex EOF in patterns, syntax of 839@item <<EOF>> 840an end-of-file. 841 842@item <s1,s2><<EOF>> 843an end-of-file when in start condition @code{s1} or @code{s2} 844@end table 845 846Note that inside of a character class, all regular expression operators 847lose their special meaning except escape (@samp{\}) and the character class 848operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. 849 850@cindex patterns, precedence of operators 851The regular expressions listed above are grouped according to 852precedence, from highest precedence at the top to lowest at the bottom. 853Those grouped together have equal precedence (see special note on the 854precedence of the repeat operator, @samp{@{@}}, under the documentation 855for the @samp{--posix} POSIX compliance option). For example, 856 857@cindex patterns, grouping and precedence 858@example 859@verbatim 860 foo|bar* 861@end verbatim 862@end example 863 864is the same as 865 866@example 867@verbatim 868 (foo)|(ba(r*)) 869@end verbatim 870@end example 871 872since the @samp{*} operator has higher precedence than concatenation, 873and concatenation higher than alternation (@samp{|}). This pattern 874therefore matches @emph{either} the string @samp{foo} @emph{or} the 875string @samp{ba} followed by zero-or-more @samp{r}'s. To match 876@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: 877 878@example 879@verbatim 880 foo|(bar)* 881@end verbatim 882@end example 883 884And to match a sequence of zero or more repetitions of @samp{foo} and 885@samp{bar}: 886 887@cindex patterns, repetitions with grouping 888@example 889@verbatim 890 (foo|bar)* 891@end verbatim 892@end example 893 894@cindex character classes in patterns 895In addition to characters and ranges of characters, character classes 896can also contain @dfn{character class expressions}. These are 897expressions enclosed inside @samp{[:} and @samp{:]} delimiters (which 898themselves must appear between the @samp{[} and @samp{]} of the 899character class. Other elements may occur inside the character class, 900too). The valid expressions are: 901 902@cindex patterns, valid character classes 903@example 904@verbatim 905 [:alnum:] [:alpha:] [:blank:] 906 [:cntrl:] [:digit:] [:graph:] 907 [:lower:] [:print:] [:punct:] 908 [:space:] [:upper:] [:xdigit:] 909@end verbatim 910@end example 911 912These expressions all designate a set of characters equivalent to the 913corresponding standard C @code{isXXX} function. For example, 914@samp{[:alnum:]} designates those characters for which @code{isalnum()} 915returns true - i.e., any alphabetic or numeric character. Some systems 916don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a 917blank or a tab. 918 919For example, the following character classes are all equivalent: 920 921@cindex character classes, equivalence of 922@cindex patterns, character class equivalence 923@example 924@verbatim 925 [[:alnum:]] 926 [[:alpha:][:digit:]] 927 [[:alpha:][0-9]] 928 [a-zA-Z0-9] 929@end verbatim 930@end example 931 932A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. 933This means the character classes are sensitive to the locale in which @code{flex} 934is executed, and the resulting scanner will not be sensitive to the runtime locale. 935This may or may not be desirable. 936 937 938@itemize 939@cindex case-insensitive, effect on character classes 940@item If your scanner is case-insensitive (the @samp{-i} flag), then 941@samp{[:upper:]} and @samp{[:lower:]} are equivalent to 942@samp{[:alpha:]}. 943 944@anchor{case and character ranges} 945@item Character classes with ranges, such as @samp{[a-Z]}, should be used with 946caution in a case-insensitive scanner if the range spans upper or lowercase 947characters. Flex does not know if you want to fold all upper and lowercase 948characters together, or if you want the literal numeric range specified (with 949no case folding). When in doubt, flex will assume that you meant the literal 950numeric range, and will issue a warning. The exception to this rule is a 951character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you 952want case-folding to occur. Here are some examples with the @samp{-i} flag 953enabled: 954 955@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} 956@item Range @tab Result @tab Literal Range @tab Alternate Range 957@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab 958@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab 959@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} 960@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} 961@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} 962@end multitable 963 964@cindex end of line, in negated character classes 965@cindex EOL, in negated character classes 966@item 967A negated character class such as the example @samp{[^A-Z]} above 968@emph{will} match a newline unless @samp{\n} (or an equivalent escape 969sequence) is one of the characters explicitly present in the negated 970character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other 971regular expression tools treat negated character classes, but 972unfortunately the inconsistency is historically entrenched. Matching 973newlines means that a pattern like @samp{[^"]*} can match the entire 974input unless there's another quote in the input. 975 976Flex allows negation of character class expressions by prepending @samp{^} to 977the POSIX character class name. 978 979@example 980@verbatim 981 [:^alnum:] [:^alpha:] [:^blank:] 982 [:^cntrl:] [:^digit:] [:^graph:] 983 [:^lower:] [:^print:] [:^punct:] 984 [:^space:] [:^upper:] [:^xdigit:] 985@end verbatim 986@end example 987 988Flex will issue a warning if the expressions @samp{[:^upper:]} and 989@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is 990unclear. The current behavior is to skip them entirely, but this may change 991without notice in future revisions of flex. 992 993@item 994 995The @samp{@{-@}} operator computes the difference of two character classes. For 996example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class 997@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is 998just the single character @samp{a}). The @samp{@{-@}} operator is left 999associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful 1000not to accidentally create an empty set, which will never match. 1001 1002@item 1003 1004The @samp{@{+@}} operator computes the union of two character classes. For 1005example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator 1006is useful when preceded by the result of a difference operation, as in, 1007@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to 1008@samp{[A-Zq]} in the "C" locale. 1009 1010@cindex trailing context, limits of 1011@cindex ^ as non-special character in patterns 1012@cindex $ as normal character in patterns 1013@item 1014A rule can have at most one instance of trailing context (the @samp{/} operator 1015or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns 1016can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, 1017cannot be grouped inside parentheses. A @samp{^} which does not occur at 1018the beginning of a rule or a @samp{$} which does not occur at the end of 1019a rule loses its special properties and is treated as a normal character. 1020 1021@item 1022The following are invalid: 1023 1024@cindex patterns, invalid trailing context 1025@example 1026@verbatim 1027 foo/bar$ 1028 <sc1>foo<sc2>bar 1029@end verbatim 1030@end example 1031 1032Note that the first of these can be written @samp{foo/bar\n}. 1033 1034@item 1035The following will result in @samp{$} or @samp{^} being treated as a normal character: 1036 1037@cindex patterns, special characters treated as non-special 1038@example 1039@verbatim 1040 foo|(bar$) 1041 foo|^bar 1042@end verbatim 1043@end example 1044 1045If the desired meaning is a @samp{foo} or a 1046@samp{bar}-followed-by-a-newline, the following could be used (the 1047special @code{|} action is explained below, @pxref{Actions}): 1048 1049@cindex patterns, end of line 1050@example 1051@verbatim 1052 foo | 1053 bar$ /* action goes here */ 1054@end verbatim 1055@end example 1056 1057A similar trick will work for matching a @samp{foo} or a 1058@samp{bar}-at-the-beginning-of-a-line. 1059@end itemize 1060 1061@node Matching, Actions, Patterns, Top 1062@chapter How the Input Is Matched 1063 1064@cindex patterns, matching 1065@cindex input, matching 1066@cindex trailing context, matching 1067@cindex matching, and trailing context 1068@cindex matching, length of 1069@cindex matching, multiple matches 1070When the generated scanner is run, it analyzes its input looking for 1071strings which match any of its patterns. If it finds more than one 1072match, it takes the one matching the most text (for trailing context 1073rules, this includes the length of the trailing part, even though it 1074will then be returned to the input). If it finds two or more matches of 1075the same length, the rule listed first in the @code{flex} input file is 1076chosen. 1077 1078@cindex token 1079@cindex yytext 1080@cindex yyleng 1081Once the match is determined, the text corresponding to the match 1082(called the @dfn{token}) is made available in the global character 1083pointer @code{yytext}, and its length in the global integer 1084@code{yyleng}. The @dfn{action} corresponding to the matched pattern is 1085then executed (@pxref{Actions}), and then the remaining input is scanned 1086for another match. 1087 1088@cindex default rule 1089If no match is found, then the @dfn{default rule} is executed: the next 1090character in the input is considered matched and copied to the standard 1091output. Thus, the simplest valid @code{flex} input is: 1092 1093@cindex minimal scanner 1094@example 1095@verbatim 1096 %% 1097@end verbatim 1098@end example 1099 1100which generates a scanner that simply copies its input (one character at 1101a time) to its output. 1102 1103@cindex yytext, two types of 1104@cindex %array, use of 1105@cindex %pointer, use of 1106@vindex yytext 1107Note that @code{yytext} can be defined in two different ways: either as 1108a character @emph{pointer} or as a character @emph{array}. You can 1109control which definition @code{flex} uses by including one of the 1110special directives @code{%pointer} or @code{%array} in the first 1111(definitions) section of your flex input. The default is 1112@code{%pointer}, unless you use the @samp{-l} lex compatibility option, 1113in which case @code{yytext} will be an array. The advantage of using 1114@code{%pointer} is substantially faster scanning and no buffer overflow 1115when matching very large tokens (unless you run out of dynamic memory). 1116The disadvantage is that you are restricted in how your actions can 1117modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()} 1118function destroys the present contents of @code{yytext}, which can be a 1119considerable porting headache when moving between different @code{lex} 1120versions. 1121 1122@cindex %array, advantages of 1123The advantage of @code{%array} is that you can then modify @code{yytext} 1124to your heart's content, and calls to @code{unput()} do not destroy 1125@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex} 1126programs sometimes access @code{yytext} externally using declarations of 1127the form: 1128 1129@example 1130@verbatim 1131 extern char yytext[]; 1132@end verbatim 1133@end example 1134 1135This definition is erroneous when used with @code{%pointer}, but correct 1136for @code{%array}. 1137 1138The @code{%array} declaration defines @code{yytext} to be an array of 1139@code{YYLMAX} characters, which defaults to a fairly large value. You 1140can change the size by simply #define'ing @code{YYLMAX} to a different 1141value in the first section of your @code{flex} input. As mentioned 1142above, with @code{%pointer} yytext grows dynamically to accommodate 1143large tokens. While this means your @code{%pointer} scanner can 1144accommodate very large tokens (such as matching entire blocks of 1145comments), bear in mind that each time the scanner must resize 1146@code{yytext} it also must rescan the entire token from the beginning, 1147so matching such tokens can prove slow. @code{yytext} presently does 1148@emph{not} dynamically grow if a call to @code{unput()} results in too 1149much text being pushed back; instead, a run-time error results. 1150 1151@cindex %array, with C++ 1152Also note that you cannot use @code{%array} with C++ scanner classes 1153(@pxref{Cxx}). 1154 1155@node Actions, Generated Scanner, Matching, Top 1156@chapter Actions 1157 1158@cindex actions 1159Each pattern in a rule has a corresponding @dfn{action}, which can be 1160any arbitrary C statement. The pattern ends at the first non-escaped 1161whitespace character; the remainder of the line is its action. If the 1162action is empty, then when the pattern is matched the input token is 1163simply discarded. For example, here is the specification for a program 1164which deletes all occurrences of @samp{zap me} from its input: 1165 1166@cindex deleting lines from input 1167@example 1168@verbatim 1169 %% 1170 "zap me" 1171@end verbatim 1172@end example 1173 1174This example will copy all other characters in the input to the output 1175since they will be matched by the default rule. 1176 1177Here is a program which compresses multiple blanks and tabs down to a 1178single blank, and throws away whitespace found at the end of a line: 1179 1180@cindex whitespace, compressing 1181@cindex compressing whitespace 1182@example 1183@verbatim 1184 %% 1185 [ \t]+ putchar( ' ' ); 1186 [ \t]+$ /* ignore this token */ 1187@end verbatim 1188@end example 1189 1190@cindex %@{ and %@}, in Rules Section 1191@cindex actions, use of @{ and @} 1192@cindex actions, embedded C strings 1193@cindex C-strings, in actions 1194@cindex comments, in actions 1195If the action contains a @samp{@{}, then the action spans till the 1196balancing @samp{@}} is found, and the action may cross multiple lines. 1197@code{flex} knows about C strings and comments and won't be fooled by 1198braces found within them, but also allows actions to begin with 1199@samp{%@{} and will consider the action to be all the text up to the 1200next @samp{%@}} (regardless of ordinary braces inside the action). 1201 1202@cindex |, in actions 1203An action consisting solely of a vertical bar (@samp{|}) means ``same as the 1204action for the next rule''. See below for an illustration. 1205 1206Actions can include arbitrary C code, including @code{return} statements 1207to return a value to whatever routine called @code{yylex()}. Each time 1208@code{yylex()} is called it continues processing tokens from where it 1209last left off until it either reaches the end of the file or executes a 1210return. 1211 1212@cindex yytext, modification of 1213Actions are free to modify @code{yytext} except for lengthening it 1214(adding characters to its end--these will overwrite later characters in 1215the input stream). This however does not apply when using @code{%array} 1216(@pxref{Matching}). In that case, @code{yytext} may be freely modified 1217in any way. 1218 1219@cindex yyleng, modification of 1220@cindex yymore, and yyleng 1221Actions are free to modify @code{yyleng} except they should not do so if 1222the action also includes use of @code{yymore()} (see below). 1223 1224@cindex preprocessor macros, for use in actions 1225There are a number of special directives which can be included within an 1226action: 1227 1228@table @code 1229@item ECHO 1230@cindex ECHO 1231copies yytext to the scanner's output. 1232 1233@item BEGIN 1234@cindex BEGIN 1235followed by the name of a start condition places the scanner in the 1236corresponding start condition (see below). 1237 1238@item REJECT 1239@cindex REJECT 1240directs the scanner to proceed on to the ``second best'' rule which 1241matched the input (or a prefix of the input). The rule is chosen as 1242described above in @ref{Matching}, and @code{yytext} and @code{yyleng} 1243set up appropriately. It may either be one which matched as much text 1244as the originally chosen rule but came later in the @code{flex} input 1245file, or one which matched less text. For example, the following will 1246both count the words in the input and call the routine @code{special()} 1247whenever @samp{frob} is seen: 1248 1249@example 1250@verbatim 1251 int word_count = 0; 1252 %% 1253 1254 frob special(); REJECT; 1255 [^ \t\n]+ ++word_count; 1256@end verbatim 1257@end example 1258 1259Without the @code{REJECT}, any occurrences of @samp{frob} in the input 1260would not be counted as words, since the scanner normally executes only 1261one action per token. Multiple uses of @code{REJECT} are allowed, each 1262one finding the next best choice to the currently active rule. For 1263example, when the following scanner scans the token @samp{abcd}, it will 1264write @samp{abcdabcaba} to the output: 1265 1266@cindex REJECT, calling multiple times 1267@cindex |, use of 1268@example 1269@verbatim 1270 %% 1271 a | 1272 ab | 1273 abc | 1274 abcd ECHO; REJECT; 1275 .|\n /* eat up any unmatched character */ 1276@end verbatim 1277@end example 1278 1279The first three rules share the fourth's action since they use the 1280special @samp{|} action. 1281 1282@code{REJECT} is a particularly expensive feature in terms of scanner 1283performance; if it is used in @emph{any} of the scanner's actions it 1284will slow down @emph{all} of the scanner's matching. Furthermore, 1285@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options 1286(@pxref{Scanner Options}). 1287 1288Note also that unlike the other special actions, @code{REJECT} is a 1289@emph{branch}. Code immediately following it in the action will 1290@emph{not} be executed. 1291 1292@item yymore() 1293@cindex yymore() 1294tells the scanner that the next time it matches a rule, the 1295corresponding token should be @emph{appended} onto the current value of 1296@code{yytext} rather than replacing it. For example, given the input 1297@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to 1298the output: 1299 1300@cindex yymore(), mega-kludge 1301@cindex yymore() to append token to previous token 1302@example 1303@verbatim 1304 %% 1305 mega- ECHO; yymore(); 1306 kludge ECHO; 1307@end verbatim 1308@end example 1309 1310First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} 1311is matched, but the previous @samp{mega-} is still hanging around at the 1312beginning of 1313@code{yytext} 1314so the 1315@code{ECHO} 1316for the @samp{kludge} rule will actually write @samp{mega-kludge}. 1317@end table 1318 1319@cindex yymore, performance penalty of 1320Two notes regarding use of @code{yymore()}. First, @code{yymore()} 1321depends on the value of @code{yyleng} correctly reflecting the size of 1322the current token, so you must not modify @code{yyleng} if you are using 1323@code{yymore()}. Second, the presence of @code{yymore()} in the 1324scanner's action entails a minor performance penalty in the scanner's 1325matching speed. 1326 1327@cindex yyless() 1328@code{yyless(n)} returns all but the first @code{n} characters of the 1329current token back to the input stream, where they will be rescanned 1330when the scanner looks for the next match. @code{yytext} and 1331@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now 1332be equal to @code{n}). For example, on the input @samp{foobar} the 1333following will write out @samp{foobarbar}: 1334 1335@cindex yyless(), pushing back characters 1336@cindex pushing back characters with yyless 1337@example 1338@verbatim 1339 %% 1340 foobar ECHO; yyless(3); 1341 [a-z]+ ECHO; 1342@end verbatim 1343@end example 1344 1345An argument of 0 to @code{yyless()} will cause the entire current input 1346string to be scanned again. Unless you've changed how the scanner will 1347subsequently process its input (using @code{BEGIN}, for example), this 1348will result in an endless loop. 1349 1350Note that @code{yyless()} is a macro and can only be used in the flex 1351input file, not from other source files. 1352 1353@cindex unput() 1354@cindex pushing back characters with unput 1355@code{unput(c)} puts the character @code{c} back onto the input stream. 1356It will be the next character scanned. The following action will take 1357the current token and cause it to be rescanned enclosed in parentheses. 1358 1359@cindex unput(), pushing back characters 1360@cindex pushing back characters with unput() 1361@example 1362@verbatim 1363 { 1364 int i; 1365 /* Copy yytext because unput() trashes yytext */ 1366 char *yycopy = strdup( yytext ); 1367 unput( ')' ); 1368 for ( i = yyleng - 1; i >= 0; --i ) 1369 unput( yycopy[i] ); 1370 unput( '(' ); 1371 free( yycopy ); 1372 } 1373@end verbatim 1374@end example 1375 1376Note that since each @code{unput()} puts the given character back at the 1377@emph{beginning} of the input stream, pushing back strings must be done 1378back-to-front. 1379 1380@cindex %pointer, and unput() 1381@cindex unput(), and %pointer 1382An important potential problem when using @code{unput()} is that if you 1383are using @code{%pointer} (the default), a call to @code{unput()} 1384@emph{destroys} the contents of @code{yytext}, starting with its 1385rightmost character and devouring one character to the left with each 1386call. If you need the value of @code{yytext} preserved after a call to 1387@code{unput()} (as in the above example), you must either first copy it 1388elsewhere, or build your scanner using @code{%array} instead 1389(@pxref{Matching}). 1390 1391@cindex pushing back EOF 1392@cindex EOF, pushing back 1393Finally, note that you cannot put back @samp{EOF} to attempt to mark the 1394input stream with an end-of-file. 1395 1396@cindex input() 1397@code{input()} reads the next character from the input stream. For 1398example, the following is one way to eat up C comments: 1399 1400@cindex comments, discarding 1401@cindex discarding C comments 1402@example 1403@verbatim 1404 %% 1405 "/*" { 1406 int c; 1407 1408 for ( ; ; ) 1409 { 1410 while ( (c = input()) != '*' && 1411 c != EOF ) 1412 ; /* eat up text of comment */ 1413 1414 if ( c == '*' ) 1415 { 1416 while ( (c = input()) == '*' ) 1417 ; 1418 if ( c == '/' ) 1419 break; /* found the end */ 1420 } 1421 1422 if ( c == EOF ) 1423 { 1424 error( "EOF in comment" ); 1425 break; 1426 } 1427 } 1428 } 1429@end verbatim 1430@end example 1431 1432@cindex input(), and C++ 1433@cindex yyinput() 1434(Note that if the scanner is compiled using @code{C++}, then 1435@code{input()} is instead referred to as @b{yyinput()}, in order to 1436avoid a name clash with the @code{C++} stream by the name of 1437@code{input}.) 1438 1439@cindex flushing the internal buffer 1440@cindex YY_FLUSH_BUFFER 1441@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that 1442the next time the scanner attempts to match a token, it will first 1443refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}). 1444This action is a special case of the more general 1445@code{yy_flush_buffer;} function, described below (@pxref{Multiple 1446Input Buffers}) 1447 1448@cindex yyterminate() 1449@cindex terminating with yyterminate() 1450@cindex exiting with yyterminate() 1451@cindex halting with yyterminate() 1452@code{yyterminate()} can be used in lieu of a return statement in an 1453action. It terminates the scanner and returns a 0 to the scanner's 1454caller, indicating ``all done''. By default, @code{yyterminate()} is 1455also called when an end-of-file is encountered. It is a macro and may 1456be redefined. 1457 1458@node Generated Scanner, Start Conditions, Actions, Top 1459@chapter The Generated Scanner 1460 1461@cindex yylex(), in generated scanner 1462The output of @code{flex} is the file @file{lex.yy.c}, which contains 1463the scanning routine @code{yylex()}, a number of tables used by it for 1464matching tokens, and a number of auxiliary routines and macros. By 1465default, @code{yylex()} is declared as follows: 1466 1467@example 1468@verbatim 1469 int yylex() 1470 { 1471 ... various definitions and the actions in here ... 1472 } 1473@end verbatim 1474@end example 1475 1476@cindex yylex(), overriding 1477(If your environment supports function prototypes, then it will be 1478@code{int yylex( void )}.) This definition may be changed by defining 1479the @code{YY_DECL} macro. For example, you could use: 1480 1481@cindex yylex, overriding the prototype of 1482@example 1483@verbatim 1484 #define YY_DECL float lexscan( a, b ) float a, b; 1485@end verbatim 1486@end example 1487 1488to give the scanning routine the name @code{lexscan}, returning a float, 1489and taking two floats as arguments. Note that if you give arguments to 1490the scanning routine using a K&R-style/non-prototyped function 1491declaration, you must terminate the definition with a semi-colon (;). 1492 1493@code{flex} generates @samp{C99} function definitions by 1494default. Flex used to have the ability to generate obsolete, er, 1495@samp{traditional}, function definitions. This was to support 1496bootstrapping gcc on old systems. Unfortunately, traditional 1497definitions prevent us from using any standard data types smaller than 1498int (such as short, char, or bool) as function arguments. Furthermore, 1499traditional definitions support added extra complexity in the skeleton file. 1500For this reason, current versions of @code{flex} generate standard C99 code 1501only, leaving K&R-style functions to the historians. 1502 1503@cindex stdin, default for yyin 1504@cindex yyin 1505Whenever @code{yylex()} is called, it scans tokens from the global input 1506file @file{yyin} (which defaults to stdin). It continues until it 1507either reaches an end-of-file (at which point it returns the value 0) or 1508one of its actions executes a @code{return} statement. 1509 1510@cindex EOF and yyrestart() 1511@cindex end-of-file, and yyrestart() 1512@cindex yyrestart() 1513If the scanner reaches an end-of-file, subsequent calls are undefined 1514unless either @file{yyin} is pointed at a new input file (in which case 1515scanning continues from that file), or @code{yyrestart()} is called. 1516@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which 1517can be NULL, if you've set up @code{YY_INPUT} to scan from a source other 1518than @code{yyin}), and initializes @file{yyin} for scanning from that 1519file. Essentially there is no difference between just assigning 1520@file{yyin} to a new input file or using @code{yyrestart()} to do so; 1521the latter is available for compatibility with previous versions of 1522@code{flex}, and because it can be used to switch input files in the 1523middle of scanning. It can also be used to throw away the current input 1524buffer, by calling it with an argument of @file{yyin}; but it would be 1525better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that 1526@code{yyrestart()} does @emph{not} reset the start condition to 1527@code{INITIAL} (@pxref{Start Conditions}). 1528 1529@cindex RETURN, within actions 1530If @code{yylex()} stops scanning due to executing a @code{return} 1531statement in one of the actions, the scanner may then be called again 1532and it will resume scanning where it left off. 1533 1534@cindex YY_INPUT 1535By default (and for purposes of efficiency), the scanner uses 1536block-reads rather than simple @code{getc()} calls to read characters 1537from @file{yyin}. The nature of how it gets its input can be controlled 1538by defining the @code{YY_INPUT} macro. The calling sequence for 1539@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action 1540is to place up to @code{max_size} characters in the character array 1541@code{buf} and return in the integer variable @code{result} either the 1542number of characters read or the constant @code{YY_NULL} (0 on Unix 1543systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from 1544the global file-pointer @file{yyin}. 1545 1546@cindex YY_INPUT, overriding 1547Here is a sample definition of @code{YY_INPUT} (in the definitions 1548section of the input file): 1549 1550@example 1551@verbatim 1552 %{ 1553 #define YY_INPUT(buf,result,max_size) \ 1554 { \ 1555 int c = getchar(); \ 1556 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 1557 } 1558 %} 1559@end verbatim 1560@end example 1561 1562This definition will change the input processing to occur one character 1563at a time. 1564 1565@cindex yywrap() 1566When the scanner receives an end-of-file indication from YY_INPUT, it 1567then checks the @code{yywrap()} function. If @code{yywrap()} returns 1568false (zero), then it is assumed that the function has gone ahead and 1569set up @file{yyin} to point to another input file, and scanning 1570continues. If it returns true (non-zero), then the scanner terminates, 1571returning 0 to its caller. Note that in either case, the start 1572condition remains unchanged; it does @emph{not} revert to 1573@code{INITIAL}. 1574 1575@cindex yywrap, default for 1576@cindex noyywrap, %option 1577@cindex %option noyywrapp 1578If you do not supply your own version of @code{yywrap()}, then you must 1579either use @code{%option noyywrap} (in which case the scanner behaves as 1580though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to 1581obtain the default version of the routine, which always returns 1. 1582 1583For scanning from in-memory buffers (e.g., scanning strings), see 1584@ref{Scanning Strings}. @xref{Multiple Input Buffers}. 1585 1586@cindex ECHO, and yyout 1587@cindex yyout 1588@cindex stdout, as default for yyout 1589The scanner writes its @code{ECHO} output to the @file{yyout} global 1590(default, @file{stdout}), which may be redefined by the user simply by 1591assigning it to some other @code{FILE} pointer. 1592 1593@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top 1594@chapter Start Conditions 1595 1596@cindex start conditions 1597@code{flex} provides a mechanism for conditionally activating rules. 1598Any rule whose pattern is prefixed with @samp{<sc>} will only be active 1599when the scanner is in the @dfn{start condition} named @code{sc}. For 1600example, 1601 1602@example 1603@verbatim 1604 <STRING>[^"]* { /* eat up the string body ... */ 1605 ... 1606 } 1607@end verbatim 1608@end example 1609 1610will be active only when the scanner is in the @code{STRING} start 1611condition, and 1612 1613@cindex start conditions, multiple 1614@example 1615@verbatim 1616 <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ 1617 ... 1618 } 1619@end verbatim 1620@end example 1621 1622will be active only when the current start condition is either 1623@code{INITIAL}, @code{STRING}, or @code{QUOTE}. 1624 1625@cindex start conditions, inclusive v.s.@: exclusive 1626Start conditions are declared in the definitions (first) section of the 1627input using unindented lines beginning with either @samp{%s} or 1628@samp{%x} followed by a list of names. The former declares 1629@dfn{inclusive} start conditions, the latter @dfn{exclusive} start 1630conditions. A start condition is activated using the @code{BEGIN} 1631action. Until the next @code{BEGIN} action is executed, rules with the 1632given start condition will be active and rules with other start 1633conditions will be inactive. If the start condition is inclusive, then 1634rules with no start conditions at all will also be active. If it is 1635exclusive, then @emph{only} rules qualified with the start condition 1636will be active. A set of rules contingent on the same exclusive start 1637condition describe a scanner which is independent of any of the other 1638rules in the @code{flex} input. Because of this, exclusive start 1639conditions make it easy to specify ``mini-scanners'' which scan portions 1640of the input that are syntactically different from the rest (e.g., 1641comments). 1642 1643If the distinction between inclusive and exclusive start conditions 1644is still a little vague, here's a simple example illustrating the 1645connection between the two. The set of rules: 1646 1647@cindex start conditions, inclusive 1648@example 1649@verbatim 1650 %s example 1651 %% 1652 1653 <example>foo do_something(); 1654 1655 bar something_else(); 1656@end verbatim 1657@end example 1658 1659is equivalent to 1660 1661@cindex start conditions, exclusive 1662@example 1663@verbatim 1664 %x example 1665 %% 1666 1667 <example>foo do_something(); 1668 1669 <INITIAL,example>bar something_else(); 1670@end verbatim 1671@end example 1672 1673Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in 1674the second example wouldn't be active (i.e., couldn't match) when in 1675start condition @code{example}. If we just used @code{<example>} to 1676qualify @code{bar}, though, then it would only be active in 1677@code{example} and not in @code{INITIAL}, while in the first example 1678it's active in both, because in the first example the @code{example} 1679start condition is an inclusive @code{(%s)} start condition. 1680 1681@cindex start conditions, special wildcard condition 1682Also note that the special start-condition specifier 1683@code{<*>} 1684matches every start condition. Thus, the above example could also 1685have been written: 1686 1687@cindex start conditions, use of wildcard condition (<*>) 1688@example 1689@verbatim 1690 %x example 1691 %% 1692 1693 <example>foo do_something(); 1694 1695 <*>bar something_else(); 1696@end verbatim 1697@end example 1698 1699The default rule (to @code{ECHO} any unmatched character) remains active 1700in start conditions. It is equivalent to: 1701 1702@cindex start conditions, behavior of default rule 1703@example 1704@verbatim 1705 <*>.|\n ECHO; 1706@end verbatim 1707@end example 1708 1709@cindex BEGIN, explanation 1710@findex BEGIN 1711@vindex INITIAL 1712@code{BEGIN(0)} returns to the original state where only the rules with 1713no start conditions are active. This state can also be referred to as 1714the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is 1715equivalent to @code{BEGIN(0)}. (The parentheses around the start 1716condition name are not required but are considered good style.) 1717 1718@code{BEGIN} actions can also be given as indented code at the beginning 1719of the rules section. For example, the following will cause the scanner 1720to enter the @code{SPECIAL} start condition whenever @code{yylex()} is 1721called and the global variable @code{enter_special} is true: 1722 1723@cindex start conditions, using BEGIN 1724@example 1725@verbatim 1726 int enter_special; 1727 1728 %x SPECIAL 1729 %% 1730 if ( enter_special ) 1731 BEGIN(SPECIAL); 1732 1733 <SPECIAL>blahblahblah 1734 ...more rules follow... 1735@end verbatim 1736@end example 1737 1738To illustrate the uses of start conditions, here is a scanner which 1739provides two different interpretations of a string like @samp{123.456}. 1740By default it will treat it as three tokens, the integer @samp{123}, a 1741dot (@samp{.}), and the integer @samp{456}. But if the string is 1742preceded earlier in the line by the string @samp{expect-floats} it will 1743treat it as a single token, the floating-point number @samp{123.456}: 1744 1745@cindex start conditions, for different interpretations of same input 1746@example 1747@verbatim 1748 %{ 1749 #include <math.h> 1750 %} 1751 %s expect 1752 1753 %% 1754 expect-floats BEGIN(expect); 1755 1756 <expect>[0-9]+.[0-9]+ { 1757 printf( "found a float, = %f\n", 1758 atof( yytext ) ); 1759 } 1760 <expect>\n { 1761 /* that's the end of the line, so 1762 * we need another "expect-number" 1763 * before we'll recognize any more 1764 * numbers 1765 */ 1766 BEGIN(INITIAL); 1767 } 1768 1769 [0-9]+ { 1770 printf( "found an integer, = %d\n", 1771 atoi( yytext ) ); 1772 } 1773 1774 "." printf( "found a dot\n" ); 1775@end verbatim 1776@end example 1777 1778@cindex comments, example of scanning C comments 1779Here is a scanner which recognizes (and discards) C comments while 1780maintaining a count of the current input line. 1781 1782@cindex recognizing C comments 1783@example 1784@verbatim 1785 %x comment 1786 %% 1787 int line_num = 1; 1788 1789 "/*" BEGIN(comment); 1790 1791 <comment>[^*\n]* /* eat anything that's not a '*' */ 1792 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1793 <comment>\n ++line_num; 1794 <comment>"*"+"/" BEGIN(INITIAL); 1795@end verbatim 1796@end example 1797 1798This scanner goes to a bit of trouble to match as much 1799text as possible with each rule. In general, when attempting to write 1800a high-speed scanner try to match as much possible in each rule, as 1801it's a big win. 1802 1803Note that start-conditions names are really integer values and 1804can be stored as such. Thus, the above could be extended in the 1805following fashion: 1806 1807@cindex start conditions, integer values 1808@cindex using integer values of start condition names 1809@example 1810@verbatim 1811 %x comment foo 1812 %% 1813 int line_num = 1; 1814 int comment_caller; 1815 1816 "/*" { 1817 comment_caller = INITIAL; 1818 BEGIN(comment); 1819 } 1820 1821 ... 1822 1823 <foo>"/*" { 1824 comment_caller = foo; 1825 BEGIN(comment); 1826 } 1827 1828 <comment>[^*\n]* /* eat anything that's not a '*' */ 1829 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1830 <comment>\n ++line_num; 1831 <comment>"*"+"/" BEGIN(comment_caller); 1832@end verbatim 1833@end example 1834 1835@cindex YY_START, example 1836Furthermore, you can access the current start condition using the 1837integer-valued @code{YY_START} macro. For example, the above 1838assignments to @code{comment_caller} could instead be written 1839 1840@cindex getting current start state with YY_START 1841@example 1842@verbatim 1843 comment_caller = YY_START; 1844@end verbatim 1845@end example 1846 1847@vindex YY_START 1848Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that 1849is what's used by AT&T @code{lex}). 1850 1851For historical reasons, start conditions do not have their own 1852name-space within the generated scanner. The start condition names are 1853unmodified in the generated scanner and generated header. 1854@xref{option-header}. @xref{option-prefix}. 1855 1856 1857 1858Finally, here's an example of how to match C-style quoted strings using 1859exclusive start conditions, including expanded escape sequences (but 1860not including checking for a string that's too long): 1861 1862@cindex matching C-style double-quoted strings 1863@example 1864@verbatim 1865 %x str 1866 1867 %% 1868 char string_buf[MAX_STR_CONST]; 1869 char *string_buf_ptr; 1870 1871 1872 \" string_buf_ptr = string_buf; BEGIN(str); 1873 1874 <str>\" { /* saw closing quote - all done */ 1875 BEGIN(INITIAL); 1876 *string_buf_ptr = '\0'; 1877 /* return string constant token type and 1878 * value to parser 1879 */ 1880 } 1881 1882 <str>\n { 1883 /* error - unterminated string constant */ 1884 /* generate error message */ 1885 } 1886 1887 <str>\\[0-7]{1,3} { 1888 /* octal escape sequence */ 1889 int result; 1890 1891 (void) sscanf( yytext + 1, "%o", &result ); 1892 1893 if ( result > 0xff ) 1894 /* error, constant is out-of-bounds */ 1895 1896 *string_buf_ptr++ = result; 1897 } 1898 1899 <str>\\[0-9]+ { 1900 /* generate error - bad escape sequence; something 1901 * like '\48' or '\0777777' 1902 */ 1903 } 1904 1905 <str>\\n *string_buf_ptr++ = '\n'; 1906 <str>\\t *string_buf_ptr++ = '\t'; 1907 <str>\\r *string_buf_ptr++ = '\r'; 1908 <str>\\b *string_buf_ptr++ = '\b'; 1909 <str>\\f *string_buf_ptr++ = '\f'; 1910 1911 <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1912 1913 <str>[^\\\n\"]+ { 1914 char *yptr = yytext; 1915 1916 while ( *yptr ) 1917 *string_buf_ptr++ = *yptr++; 1918 } 1919@end verbatim 1920@end example 1921 1922@cindex start condition, applying to multiple patterns 1923Often, such as in some of the examples above, you wind up writing a 1924whole bunch of rules all preceded by the same start condition(s). Flex 1925makes this a little easier and cleaner by introducing a notion of start 1926condition @dfn{scope}. A start condition scope is begun with: 1927 1928@example 1929@verbatim 1930 <SCs>{ 1931@end verbatim 1932@end example 1933 1934where @code{<SCs>} is a list of one or more start conditions. Inside the 1935start condition scope, every rule automatically has the prefix 1936@code{<SCs>} applied to it, until a @samp{@}} which matches the initial 1937@samp{@{}. So, for example, 1938 1939@cindex extended scope of start conditions 1940@example 1941@verbatim 1942 <ESC>{ 1943 "\\n" return '\n'; 1944 "\\r" return '\r'; 1945 "\\f" return '\f'; 1946 "\\0" return '\0'; 1947 } 1948@end verbatim 1949@end example 1950 1951is equivalent to: 1952 1953@example 1954@verbatim 1955 <ESC>"\\n" return '\n'; 1956 <ESC>"\\r" return '\r'; 1957 <ESC>"\\f" return '\f'; 1958 <ESC>"\\0" return '\0'; 1959@end verbatim 1960@end example 1961 1962Start condition scopes may be nested. 1963 1964@cindex stacks, routines for manipulating 1965@cindex start conditions, use of a stack 1966 1967The following routines are available for manipulating stacks of start conditions: 1968 1969@deftypefun void yy_push_state ( int @code{new_state} ) 1970pushes the current start condition onto the top of the start condition 1971stack and switches to 1972@code{new_state} 1973as though you had used 1974@code{BEGIN new_state} 1975(recall that start condition names are also integers). 1976@end deftypefun 1977 1978@deftypefun void yy_pop_state () 1979pops the top of the stack and switches to it via 1980@code{BEGIN}. 1981@end deftypefun 1982 1983@deftypefun int yy_top_state () 1984returns the top of the stack without altering the stack's contents. 1985@end deftypefun 1986 1987@cindex memory, for start condition stacks 1988The start condition stack grows dynamically and so has no built-in size 1989limitation. If memory is exhausted, program execution aborts. 1990 1991To use start condition stacks, your scanner must include a @code{%option 1992stack} directive (@pxref{Scanner Options}). 1993 1994@node Multiple Input Buffers, EOF, Start Conditions, Top 1995@chapter Multiple Input Buffers 1996 1997@cindex multiple input streams 1998Some scanners (such as those which support ``include'' files) require 1999reading from several input streams. As @code{flex} scanners do a large 2000amount of buffering, one cannot control where the next input will be 2001read from by simply writing a @code{YY_INPUT()} which is sensitive to 2002the scanning context. @code{YY_INPUT()} is only called when the scanner 2003reaches the end of its buffer, which may be a long time after scanning a 2004statement such as an @code{include} statement which requires switching 2005the input source. 2006 2007To negotiate these sorts of problems, @code{flex} provides a mechanism 2008for creating and switching between multiple input buffers. An input 2009buffer is created by using: 2010 2011@cindex memory, allocating input buffers 2012@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) 2013@end deftypefun 2014 2015which takes a @code{FILE} pointer and a size and creates a buffer 2016associated with the given file and large enough to hold @code{size} 2017characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It 2018returns a @code{YY_BUFFER_STATE} handle, which may then be passed to 2019other routines (see below). 2020@tindex YY_BUFFER_STATE 2021The @code{YY_BUFFER_STATE} type is a 2022pointer to an opaque @code{struct yy_buffer_state} structure, so you may 2023safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) 20240)} if you wish, and also refer to the opaque structure in order to 2025correctly declare input buffers in source files other than that of your 2026scanner. Note that the @code{FILE} pointer in the call to 2027@code{yy_create_buffer} is only used as the value of @file{yyin} seen by 2028@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses 2029@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to 2030@code{yy_create_buffer}. You select a particular buffer to scan from 2031using: 2032 2033@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) 2034@end deftypefun 2035 2036The above function switches the scanner's input buffer so subsequent tokens 2037will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may 2038be used by @code{yywrap()} to set things up for continued scanning, instead of 2039opening a new file and pointing @file{yyin} at it. If you are looking for a 2040stack of input buffers, then you want to use @code{yypush_buffer_state()} 2041instead of this function. Note also that switching input sources via either 2042@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the 2043start condition. 2044 2045@cindex memory, deleting input buffers 2046@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) 2047@end deftypefun 2048 2049is used to reclaim the storage associated with a buffer. (@code{buffer} 2050can be NULL, in which case the routine does nothing.) You can also clear 2051the current contents of a buffer using: 2052 2053@cindex pushing an input buffer 2054@cindex stack, input buffer push 2055@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) 2056@end deftypefun 2057 2058This function pushes the new buffer state onto an internal stack. The pushed 2059state becomes the new current state. The stack is maintained by flex and will 2060grow as required. This function is intended to be used instead of 2061@code{yy_switch_to_buffer}, when you want to change states, but preserve the 2062current state for later use. 2063 2064@cindex popping an input buffer 2065@cindex stack, input buffer pop 2066@deftypefun void yypop_buffer_state ( ) 2067@end deftypefun 2068 2069This function removes the current state from the top of the stack, and deletes 2070it by calling @code{yy_delete_buffer}. The next state on the stack, if any, 2071becomes the new current state. 2072 2073@cindex clearing an input buffer 2074@cindex flushing an input buffer 2075@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) 2076@end deftypefun 2077 2078This function discards the buffer's contents, 2079so the next time the scanner attempts to match a token from the 2080buffer, it will first fill the buffer anew using 2081@code{YY_INPUT()}. 2082 2083@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) 2084@end deftypefun 2085 2086is an alias for @code{yy_create_buffer()}, 2087provided for compatibility with the C++ use of @code{new} and 2088@code{delete} for creating and destroying dynamic objects. 2089 2090@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro 2091@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the 2092current buffer. It should not be used as an lvalue. 2093 2094@cindex EOF, example using multiple input buffers 2095Here are two examples of using these features for writing a scanner 2096which expands include files (the 2097@code{<<EOF>>} 2098feature is discussed below). 2099 2100This first example uses yypush_buffer_state and yypop_buffer_state. Flex 2101maintains the stack internally. 2102 2103@cindex handling include files with multiple input buffers 2104@example 2105@verbatim 2106 /* the "incl" state is used for picking up the name 2107 * of an include file 2108 */ 2109 %x incl 2110 %% 2111 include BEGIN(incl); 2112 2113 [a-z]+ ECHO; 2114 [^a-z\n]*\n? ECHO; 2115 2116 <incl>[ \t]* /* eat the whitespace */ 2117 <incl>[^ \t\n]+ { /* got the include file name */ 2118 yyin = fopen( yytext, "r" ); 2119 2120 if ( ! yyin ) 2121 error( ... ); 2122 2123 yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); 2124 2125 BEGIN(INITIAL); 2126 } 2127 2128 <<EOF>> { 2129 yypop_buffer_state(); 2130 2131 if ( !YY_CURRENT_BUFFER ) 2132 { 2133 yyterminate(); 2134 } 2135 } 2136@end verbatim 2137@end example 2138 2139The second example, below, does the same thing as the previous example did, but 2140manages its own input buffer stack manually (instead of letting flex do it). 2141 2142@cindex handling include files with multiple input buffers 2143@example 2144@verbatim 2145 /* the "incl" state is used for picking up the name 2146 * of an include file 2147 */ 2148 %x incl 2149 2150 %{ 2151 #define MAX_INCLUDE_DEPTH 10 2152 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 2153 int include_stack_ptr = 0; 2154 %} 2155 2156 %% 2157 include BEGIN(incl); 2158 2159 [a-z]+ ECHO; 2160 [^a-z\n]*\n? ECHO; 2161 2162 <incl>[ \t]* /* eat the whitespace */ 2163 <incl>[^ \t\n]+ { /* got the include file name */ 2164 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 2165 { 2166 fprintf( stderr, "Includes nested too deeply" ); 2167 exit( 1 ); 2168 } 2169 2170 include_stack[include_stack_ptr++] = 2171 YY_CURRENT_BUFFER; 2172 2173 yyin = fopen( yytext, "r" ); 2174 2175 if ( ! yyin ) 2176 error( ... ); 2177 2178 yy_switch_to_buffer( 2179 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 2180 2181 BEGIN(INITIAL); 2182 } 2183 2184 <<EOF>> { 2185 if ( --include_stack_ptr 0 ) 2186 { 2187 yyterminate(); 2188 } 2189 2190 else 2191 { 2192 yy_delete_buffer( YY_CURRENT_BUFFER ); 2193 yy_switch_to_buffer( 2194 include_stack[include_stack_ptr] ); 2195 } 2196 } 2197@end verbatim 2198@end example 2199 2200@anchor{Scanning Strings} 2201@cindex strings, scanning strings instead of files 2202The following routines are available for setting up input buffers for 2203scanning in-memory strings instead of files. All of them create a new 2204input buffer for scanning the string, and return a corresponding 2205@code{YY_BUFFER_STATE} handle (which you should delete with 2206@code{yy_delete_buffer()} when done with it). They also switch to the 2207new buffer using @code{yy_switch_to_buffer()}, so the next call to 2208@code{yylex()} will start scanning the string. 2209 2210@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) 2211scans a NUL-terminated string. 2212@end deftypefun 2213 2214@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) 2215scans @code{len} bytes (including possibly @code{NUL}s) starting at location 2216@code{bytes}. 2217@end deftypefun 2218 2219Note that both of these functions create and scan a @emph{copy} of the 2220string or bytes. (This may be desirable, since @code{yylex()} modifies 2221the contents of the buffer it is scanning.) You can avoid the copy by 2222using: 2223 2224@vindex YY_END_OF_BUFFER_CHAR 2225@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) 2226which scans in place the buffer starting at @code{base}, consisting of 2227@code{size} bytes, the last two bytes of which @emph{must} be 2228@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not 2229scanned; thus, scanning consists of @code{base[0]} through 2230@code{base[size-2]}, inclusive. 2231@end deftypefun 2232 2233If you fail to set up @code{base} in this manner (i.e., forget the final 2234two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()} 2235returns a NULL pointer instead of creating a new input buffer. 2236 2237@deftp {Data type} yy_size_t 2238is an integral type to which you can cast an integer expression 2239reflecting the size of the buffer. 2240@end deftp 2241 2242@node EOF, Misc Macros, Multiple Input Buffers, Top 2243@chapter End-of-File Rules 2244 2245@cindex EOF, explanation 2246The special rule @code{<<EOF>>} indicates 2247actions which are to be taken when an end-of-file is 2248encountered and @code{yywrap()} returns non-zero (i.e., indicates 2249no further files to process). The action must finish 2250by doing one of the following things: 2251 2252@itemize 2253@item 2254@findex YY_NEW_FILE (now obsolete) 2255assigning @file{yyin} to a new input file (in previous versions of 2256@code{flex}, after doing the assignment you had to call the special 2257action @code{YY_NEW_FILE}. This is no longer necessary.) 2258 2259@item 2260executing a @code{return} statement; 2261 2262@item 2263executing the special @code{yyterminate()} action. 2264 2265@item 2266or, switching to a new buffer using @code{yy_switch_to_buffer()} as 2267shown in the example above. 2268@end itemize 2269 2270<<EOF>> rules may not be used with other patterns; they may only be 2271qualified with a list of start conditions. If an unqualified <<EOF>> 2272rule is given, it applies to @emph{all} start conditions which do not 2273already have <<EOF>> actions. To specify an <<EOF>> rule for only the 2274initial start condition, use: 2275 2276@example 2277@verbatim 2278 <INITIAL><<EOF>> 2279@end verbatim 2280@end example 2281 2282These rules are useful for catching things like unclosed comments. An 2283example: 2284 2285@cindex <<EOF>>, use of 2286@example 2287@verbatim 2288 %x quote 2289 %% 2290 2291 ...other rules for dealing with quotes... 2292 2293 <quote><<EOF>> { 2294 error( "unterminated quote" ); 2295 yyterminate(); 2296 } 2297 <<EOF>> { 2298 if ( *++filelist ) 2299 yyin = fopen( *filelist, "r" ); 2300 else 2301 yyterminate(); 2302 } 2303@end verbatim 2304@end example 2305 2306@node Misc Macros, User Values, EOF, Top 2307@chapter Miscellaneous Macros 2308 2309@hkindex YY_USER_ACTION 2310The macro @code{YY_USER_ACTION} can be defined to provide an action 2311which is always executed prior to the matched rule's action. For 2312example, it could be #define'd to call a routine to convert yytext to 2313lower-case. When @code{YY_USER_ACTION} is invoked, the variable 2314@code{yy_act} gives the number of the matched rule (rules are numbered 2315starting with 1). Suppose you want to profile how often each of your 2316rules is matched. The following would do the trick: 2317 2318@cindex YY_USER_ACTION to track each time a rule is matched 2319@example 2320@verbatim 2321 #define YY_USER_ACTION ++ctr[yy_act] 2322@end verbatim 2323@end example 2324 2325@vindex YY_NUM_RULES 2326where @code{ctr} is an array to hold the counts for the different rules. 2327Note that the macro @code{YY_NUM_RULES} gives the total number of rules 2328(including the default rule), even if you use @samp{-s)}, so a correct 2329declaration for @code{ctr} is: 2330 2331@example 2332@verbatim 2333 int ctr[YY_NUM_RULES]; 2334@end verbatim 2335@end example 2336 2337@hkindex YY_USER_INIT 2338The macro @code{YY_USER_INIT} may be defined to provide an action which 2339is always executed before the first scan (and before the scanner's 2340internal initializations are done). For example, it could be used to 2341call a routine to read in a data table or open a logging file. 2342 2343@findex yy_set_interactive 2344The macro @code{yy_set_interactive(is_interactive)} can be used to 2345control whether the current buffer is considered @dfn{interactive}. An 2346interactive buffer is processed more slowly, but must be used when the 2347scanner's input source is indeed interactive to avoid problems due to 2348waiting to fill buffers (see the discussion of the @samp{-I} flag in 2349@ref{Scanner Options}). A non-zero value in the macro invocation marks 2350the buffer as interactive, a zero value as non-interactive. Note that 2351use of this macro overrides @code{%option always-interactive} or 2352@code{%option never-interactive} (@pxref{Scanner Options}). 2353@code{yy_set_interactive()} must be invoked prior to beginning to scan 2354the buffer that is (or is not) to be considered interactive. 2355 2356@cindex BOL, setting it 2357@findex yy_set_bol 2358The macro @code{yy_set_bol(at_bol)} can be used to control whether the 2359current buffer's scanning context for the next token match is done as 2360though at the beginning of a line. A non-zero macro argument makes 2361rules anchored with @samp{^} active, while a zero argument makes 2362@samp{^} rules inactive. 2363 2364@cindex BOL, checking the BOL flag 2365@findex YY_AT_BOL 2366The macro @code{YY_AT_BOL()} returns true if the next token scanned from 2367the current buffer will have @samp{^} rules active, false otherwise. 2368 2369@cindex actions, redefining YY_BREAK 2370@hkindex YY_BREAK 2371In the generated scanner, the actions are all gathered in one large 2372switch statement and separated using @code{YY_BREAK}, which may be 2373redefined. By default, it is simply a @code{break}, to separate each 2374rule's action from the following rule's. Redefining @code{YY_BREAK} 2375allows, for example, C++ users to #define YY_BREAK to do nothing (while 2376being very careful that every rule ends with a @code{break} or a 2377@code{return}!) to avoid suffering from unreachable statement warnings 2378where because a rule's action ends with @code{return}, the 2379@code{YY_BREAK} is inaccessible. 2380 2381@node User Values, Yacc, Misc Macros, Top 2382@chapter Values Available To the User 2383 2384This chapter summarizes the various values available to the user in the 2385rule actions. 2386 2387@table @code 2388@vindex yytext 2389@item char *yytext 2390holds the text of the current token. It may be modified but not 2391lengthened (you cannot append characters to the end). 2392 2393@cindex yytext, default array size 2394@cindex array, default size for yytext 2395@vindex YYLMAX 2396If the special directive @code{%array} appears in the first section of 2397the scanner description, then @code{yytext} is instead declared 2398@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition 2399that you can redefine in the first section if you don't like the default 2400value (generally 8KB). Using @code{%array} results in somewhat slower 2401scanners, but the value of @code{yytext} becomes immune to calls to 2402@code{unput()}, which potentially destroy its value when @code{yytext} is 2403a character pointer. The opposite of @code{%array} is @code{%pointer}, 2404which is the default. 2405 2406@cindex C++ and %array 2407You cannot use @code{%array} when generating C++ scanner classes (the 2408@samp{-+} flag). 2409 2410@vindex yyleng 2411@item int yyleng 2412holds the length of the current token. 2413 2414@vindex yyin 2415@item FILE *yyin 2416is the file which by default @code{flex} reads from. It may be 2417redefined but doing so only makes sense before scanning begins or after 2418an EOF has been encountered. Changing it in the midst of scanning will 2419have unexpected results since @code{flex} buffers its input; use 2420@code{yyrestart()} instead. Once scanning terminates because an 2421end-of-file has been seen, you can assign @file{yyin} at the new input 2422file and then call the scanner again to continue scanning. 2423 2424@findex yyrestart 2425@item void yyrestart( FILE *new_file ) 2426may be called to point @file{yyin} at the new input file. The 2427switch-over to the new file is immediate (any previously buffered-up 2428input is lost). Note that calling @code{yyrestart()} with @file{yyin} 2429as an argument thus throws away the current input buffer and continues 2430scanning the same input file. 2431 2432@vindex yyout 2433@item FILE *yyout 2434is the file to which @code{ECHO} actions are done. It can be reassigned 2435by the user. 2436 2437@vindex YY_CURRENT_BUFFER 2438@item YY_CURRENT_BUFFER 2439returns a @code{YY_BUFFER_STATE} handle to the current buffer. 2440 2441@vindex YY_START 2442@item YY_START 2443returns an integer value corresponding to the current start condition. 2444You can subsequently use this value with @code{BEGIN} to return to that 2445start condition. 2446@end table 2447 2448@node Yacc, Scanner Options, User Values, Top 2449@chapter Interfacing with Yacc 2450 2451@cindex yacc, interface 2452 2453@vindex yylval, with yacc 2454One of the main uses of @code{flex} is as a companion to the @code{yacc} 2455parser-generator. @code{yacc} parsers expect to call a routine named 2456@code{yylex()} to find the next input token. The routine is supposed to 2457return the type of the next token as well as putting any associated 2458value in the global @code{yylval}. To use @code{flex} with @code{yacc}, 2459one specifies the @samp{-d} option to @code{yacc} to instruct it to 2460generate the file @file{y.tab.h} containing definitions of all the 2461@code{%tokens} appearing in the @code{yacc} input. This file is then 2462included in the @code{flex} scanner. For example, if one of the tokens 2463is @code{TOK_NUMBER}, part of the scanner might look like: 2464 2465@cindex yacc interface 2466@example 2467@verbatim 2468 %{ 2469 #include "y.tab.h" 2470 %} 2471 2472 %% 2473 2474 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2475@end verbatim 2476@end example 2477 2478@node Scanner Options, Performance, Yacc, Top 2479@chapter Scanner Options 2480 2481@cindex command-line options 2482@cindex options, command-line 2483@cindex arguments, command-line 2484 2485The various @code{flex} options are categorized by function in the following 2486menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. 2487 2488@menu 2489* Options for Specifying Filenames:: 2490* Options Affecting Scanner Behavior:: 2491* Code-Level And API Options:: 2492* Options for Scanner Speed and Size:: 2493* Debugging Options:: 2494* Miscellaneous Options:: 2495@end menu 2496 2497Even though there are many scanner options, a typical scanner might only 2498specify the following options: 2499 2500@example 2501@verbatim 2502%option 8bit reentrant bison-bridge 2503%option warn nodefault 2504%option yylineno 2505%option outfile="scanner.c" header-file="scanner.h" 2506@end verbatim 2507@end example 2508 2509The first line specifies the general type of scanner we want. The second line 2510specifies that we are being careful. The third line asks flex to track line 2511numbers. The last line tells flex what to name the files. (The options can be 2512specified in any order. We just divided them.) 2513 2514@code{flex} also provides a mechanism for controlling options within the 2515scanner specification itself, rather than from the flex command-line. 2516This is done by including @code{%option} directives in the first section 2517of the scanner specification. You can specify multiple options with a 2518single @code{%option} directive, and multiple directives in the first 2519section of your flex input file. 2520 2521Most options are given simply as names, optionally preceded by the 2522word @samp{no} (with no intervening whitespace) to negate their meaning. 2523The names are the same as their long-option equivalents (but without the 2524leading @samp{--} ). 2525 2526@code{flex} scans your rule actions to determine whether you use the 2527@code{REJECT} or @code{yymore()} features. The @code{REJECT} and 2528@code{yymore} options are available to override its decision as to 2529whether you use the options, either by setting them (e.g., @code{%option 2530reject)} to indicate the feature is indeed used, or unsetting them to 2531indicate it actually is not used (e.g., @code{%option noyymore)}. 2532 2533 2534A number of options are available for lint purists who want to suppress 2535the appearance of unneeded routines in the generated scanner. Each of 2536the following, if unset (e.g., @code{%option nounput}), results in the 2537corresponding routine not appearing in the generated scanner: 2538 2539@example 2540@verbatim 2541 input, unput 2542 yy_push_state, yy_pop_state, yy_top_state 2543 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2544 2545 yyget_extra, yyset_extra, yyget_leng, yyget_text, 2546 yyget_lineno, yyset_lineno, yyget_in, yyset_in, 2547 yyget_out, yyset_out, yyget_lval, yyset_lval, 2548 yyget_lloc, yyset_lloc, yyget_debug, yyset_debug 2549@end verbatim 2550@end example 2551 2552(though @code{yy_push_state()} and friends won't appear anyway unless 2553you use @code{%option stack)}. 2554 2555@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options 2556@section Options for Specifying Filenames 2557 2558@table @samp 2559 2560@anchor{option-header} 2561@opindex ---header-file 2562@opindex header-file 2563@item --header-file=FILE, @code{%option header-file="FILE"} 2564instructs flex to write a C header to @file{FILE}. This file contains 2565function prototypes, extern variables, and types used by the scanner. 2566Only the external API is exported by the header file. Many macros that 2567are usable from within scanner actions are not exported to the header 2568file. This is due to namespace problems and the goal of a clean 2569external API. 2570 2571While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} 2572is substituted with the appropriate prefix. 2573 2574The @samp{--header-file} option is not compatible with the @samp{--c++} option, 2575since the C++ scanner provides its own header in @file{yyFlexLexer.h}. 2576 2577 2578 2579@anchor{option-outfile} 2580@opindex -o 2581@opindex ---outfile 2582@opindex outfile 2583@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"} 2584directs flex to write the scanner to the file @file{FILE} instead of 2585@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option, 2586then the scanner is written to @file{stdout} but its @code{#line} 2587directives (see the @samp{-l} option above) refer to the file 2588@file{FILE}. 2589 2590 2591 2592@anchor{option-stdout} 2593@opindex -t 2594@opindex ---stdout 2595@opindex stdout 2596@item -t, --stdout, @code{%option stdout} 2597instructs @code{flex} to write the scanner it generates to standard 2598output instead of @file{lex.yy.c}. 2599 2600 2601 2602@opindex ---skel 2603@item -SFILE, --skel=FILE 2604overrides the default skeleton file from which 2605@code{flex} 2606constructs its scanners. You'll never need this option unless you are doing 2607@code{flex} 2608maintenance or development. 2609 2610@opindex ---tables-file 2611@opindex tables-file 2612@item --tables-file=FILE 2613Write serialized scanner dfa tables to FILE. The generated scanner will not 2614contain the tables, and requires them to be loaded at runtime. 2615@xref{serialization}. 2616 2617@opindex ---tables-verify 2618@opindex tables-verify 2619@item --tables-verify 2620This option is for flex development. We document it here in case you stumble 2621upon it by accident or in case you suspect some inconsistency in the serialized 2622tables. Flex will serialize the scanner dfa tables but will also generate the 2623in-code tables as it normally does. At runtime, the scanner will verify that 2624the serialized tables match the in-code tables, instead of loading them. 2625 2626@end table 2627 2628@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options 2629@section Options Affecting Scanner Behavior 2630 2631@table @samp 2632@anchor{option-case-insensitive} 2633@opindex -i 2634@opindex ---case-insensitive 2635@opindex case-insensitive 2636@item -i, --case-insensitive, @code{%option case-insensitive} 2637instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The 2638case of letters given in the @code{flex} input patterns will be ignored, 2639and tokens in the input will be matched regardless of case. The matched 2640text given in @code{yytext} will have the preserved case (i.e., it will 2641not be folded). For tricky behavior, see @ref{case and character ranges}. 2642 2643 2644 2645@anchor{option-lex-compat} 2646@opindex -l 2647@opindex ---lex-compat 2648@opindex lex-compat 2649@item -l, --lex-compat, @code{%option lex-compat} 2650turns on maximum compatibility with the original AT&T @code{lex} 2651implementation. Note that this does not mean @emph{full} compatibility. 2652Use of this option costs a considerable amount of performance, and it 2653cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or 2654@samp{-CF} options. For details on the compatibilities it provides, see 2655@ref{Lex and Posix}. This option also results in the name 2656@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. 2657 2658 2659 2660@anchor{option-batch} 2661@opindex -B 2662@opindex ---batch 2663@opindex batch 2664@item -B, --batch, @code{%option batch} 2665instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of 2666@emph{interactive} scanners generated by @samp{--interactive} (see below). In 2667general, you use @samp{-B} when you are @emph{certain} that your scanner 2668will never be used interactively, and you want to squeeze a 2669@emph{little} more performance out of it. If your goal is instead to 2670squeeze out a @emph{lot} more performance, you should be using the 2671@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically 2672anyway. 2673 2674 2675 2676@anchor{option-interactive} 2677@opindex -I 2678@opindex ---interactive 2679@opindex interactive 2680@item -I, --interactive, @code{%option interactive} 2681instructs @code{flex} to generate an @i{interactive} scanner. An 2682interactive scanner is one that only looks ahead to decide what token 2683has been matched if it absolutely must. It turns out that always 2684looking one extra character ahead, even if the scanner has already seen 2685enough text to disambiguate the current token, is a bit faster than only 2686looking ahead when necessary. But scanners that always look ahead give 2687dreadful interactive performance; for example, when a user types a 2688newline, it is not recognized as a newline token until they enter 2689@emph{another} token, which often means typing in another whole line. 2690 2691@code{flex} scanners default to @code{interactive} unless you use the 2692@samp{-Cf} or @samp{-CF} table-compression options 2693(@pxref{Performance}). That's because if you're looking for 2694high-performance you should be using one of these options, so if you 2695didn't, @code{flex} assumes you'd rather trade off a bit of run-time 2696performance for intuitive interactive behavior. Note also that you 2697@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or 2698@samp{-CF}. Thus, this option is not really needed; it is on by default 2699for all those cases in which it is allowed. 2700 2701You can force a scanner to 2702@emph{not} 2703be interactive by using 2704@samp{--batch} 2705 2706 2707 2708@anchor{option-7bit} 2709@opindex -7 2710@opindex ---7bit 2711@opindex 7bit 2712@item -7, --7bit, @code{%option 7bit} 2713instructs @code{flex} to generate a 7-bit scanner, i.e., one which can 2714only recognize 7-bit characters in its input. The advantage of using 2715@samp{--7bit} is that the scanner's tables can be up to half the size of 2716those generated using the @samp{--8bit}. The disadvantage is that such 2717scanners often hang or crash if their input contains an 8-bit character. 2718 2719Note, however, that unless you generate your scanner using the 2720@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} 2721will save only a small amount of table space, and make your scanner 2722considerably less portable. @code{Flex}'s default behavior is to 2723generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, 2724in which case @code{flex} defaults to generating 7-bit scanners unless 2725your site was always configured to generate 8-bit scanners (as will 2726often be the case with non-USA sites). You can tell whether flex 2727generated a 7-bit or an 8-bit scanner by inspecting the flag summary in 2728the @samp{--verbose} output as described above. 2729 2730Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still 2731defaults to generating an 8-bit scanner, since usually with these 2732compression options full 8-bit tables are not much more expensive than 27337-bit tables. 2734 2735 2736 2737@anchor{option-8bit} 2738@opindex -8 2739@opindex ---8bit 2740@opindex 8bit 2741@item -8, --8bit, @code{%option 8bit} 2742instructs @code{flex} to generate an 8-bit scanner, i.e., one which can 2743recognize 8-bit characters. This flag is only needed for scanners 2744generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to 2745generating an 8-bit scanner anyway. 2746 2747See the discussion of 2748@samp{--7bit} 2749above for @code{flex}'s default behavior and the tradeoffs between 7-bit 2750and 8-bit scanners. 2751 2752 2753 2754@anchor{option-default} 2755@opindex ---default 2756@opindex default 2757@item --default, @code{%option default} 2758generate the default rule. 2759 2760 2761 2762@anchor{option-always-interactive} 2763@opindex ---always-interactive 2764@opindex always-interactive 2765@item --always-interactive, @code{%option always-interactive} 2766instructs flex to generate a scanner which always considers its input 2767@emph{interactive}. Normally, on each new input file the scanner calls 2768@code{isatty()} in an attempt to determine whether the scanner's input 2769source is interactive and thus should be read a character at a time. 2770When this option is used, however, then no such call is made. 2771 2772 2773 2774@opindex ---never-interactive 2775@item --never-interactive, @code{--never-interactive} 2776instructs flex to generate a scanner which never considers its input 2777interactive. This is the opposite of @code{always-interactive}. 2778 2779 2780@anchor{option-posix} 2781@opindex -X 2782@opindex ---posix 2783@opindex posix 2784@item -X, --posix, @code{%option posix} 2785turns on maximum compatibility with the POSIX 1003.2-1992 definition of 2786@code{lex}. Since @code{flex} was originally designed to implement the 2787POSIX definition of @code{lex} this generally involves very few changes 2788in behavior. At the current writing the known differences between 2789@code{flex} and the POSIX standard are: 2790 2791@itemize 2792@item 2793In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower 2794precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). 2795Most POSIX utilities use an Extended Regular Expression (ERE) precedence 2796that has the precedence of the repeat operator higher than concatenation 2797(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex} 2798places the precedence of the repeat operator higher than concatenation 2799which matches the ERE processing of other POSIX utilities. When either 2800@samp{--posix} or @samp{-l} are specified, @code{flex} will use the 2801traditional AT&T and POSIX-compliant precedence for the repeat operator 2802where concatenation has higher precedence than the repeat operator. 2803@end itemize 2804 2805 2806@anchor{option-stack} 2807@opindex ---stack 2808@opindex stack 2809@item --stack, @code{%option stack} 2810enables the use of 2811start condition stacks (@pxref{Start Conditions}). 2812 2813 2814 2815@anchor{option-stdinit} 2816@opindex ---stdinit 2817@opindex stdinit 2818@item --stdinit, @code{%option stdinit} 2819if set (i.e., @b{%option stdinit)} initializes @code{yyin} and 2820@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of 2821@file{NULL}. Some existing @code{lex} programs depend on this behavior, 2822even though it is not compliant with ANSI C, which does not require 2823@file{stdin} and @file{stdout} to be compile-time constant. In a 2824reentrant scanner, however, this is not a problem since initialization 2825is performed in @code{yylex_init} at runtime. 2826 2827 2828 2829@anchor{option-yylineno} 2830@opindex ---yylineno 2831@opindex yylineno 2832@item --yylineno, @code{%option yylineno} 2833directs @code{flex} to generate a scanner 2834that maintains the number of the current line read from its input in the 2835global variable @code{yylineno}. This option is implied by @code{%option 2836lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is 2837accessible regardless of the value of @code{%option yylineno}, however, its 2838value is not modified by @code{flex} unless @code{%option yylineno} is enabled. 2839 2840 2841 2842@anchor{option-yywrap} 2843@opindex ---yywrap 2844@opindex yywrap 2845@item --yywrap, @code{%option yywrap} 2846if unset (i.e., @code{--noyywrap)}, makes the scanner not call 2847@code{yywrap()} upon an end-of-file, but simply assume that there are no 2848more files to scan (until the user points @file{yyin} at a new file and 2849calls @code{yylex()} again). 2850 2851@end table 2852 2853@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options 2854@section Code-Level And API Options 2855 2856@table @samp 2857 2858@anchor{option-ansi-definitions} 2859@opindex ---option-ansi-definitions 2860@opindex ansi-definitions 2861@item --ansi-definitions, @code{%option ansi-definitions} 2862Deprecated, ignored 2863 2864@anchor{option-ansi-prototypes} 2865@opindex ---option-ansi-prototypes 2866@opindex ansi-prototypes 2867@item --ansi-prototypes, @code{%option ansi-prototypes} 2868Deprecated, ignored 2869 2870@anchor{option-bison-bridge} 2871@opindex ---bison-bridge 2872@opindex bison-bridge 2873@item --bison-bridge, @code{%option bison-bridge} 2874instructs flex to generate a C scanner that is 2875meant to be called by a 2876@code{GNU bison} 2877parser. The scanner has minor API changes for 2878@code{bison} 2879compatibility. In particular, the declaration of 2880@code{yylex} 2881is modified to take an additional parameter, 2882@code{yylval}. 2883@xref{Bison Bridge}. 2884 2885@anchor{option-bison-locations} 2886@opindex ---bison-locations 2887@opindex bison-locations 2888@item --bison-locations, @code{%option bison-locations} 2889instruct flex that 2890@code{GNU bison} @code{%locations} are being used. 2891This means @code{yylex} will be passed 2892an additional parameter, @code{yylloc}. This option 2893implies @code{%option bison-bridge}. 2894@xref{Bison Bridge}. 2895 2896@anchor{option-noline} 2897@opindex -L 2898@opindex ---noline 2899@opindex noline 2900@item -L, --noline, @code{%option noline} 2901instructs 2902@code{flex} 2903not to generate 2904@code{#line} 2905directives. Without this option, 2906@code{flex} 2907peppers the generated scanner 2908with @code{#line} directives so error messages in the actions will be correctly 2909located with respect to either the original 2910@code{flex} 2911input file (if the errors are due to code in the input file), or 2912@file{lex.yy.c} 2913(if the errors are 2914@code{flex}'s 2915fault -- you should report these sorts of errors to the email address 2916given in @ref{Reporting Bugs}). 2917 2918 2919 2920@anchor{option-reentrant} 2921@opindex -R 2922@opindex ---reentrant 2923@opindex reentrant 2924@item -R, --reentrant, @code{%option reentrant} 2925instructs flex to generate a reentrant C scanner. The generated scanner 2926may safely be used in a multi-threaded environment. The API for a 2927reentrant scanner is different than for a non-reentrant scanner 2928@pxref{Reentrant}). Because of the API difference between 2929reentrant and non-reentrant @code{flex} scanners, non-reentrant flex 2930code must be modified before it is suitable for use with this option. 2931This option is not compatible with the @samp{--c++} option. 2932 2933The option @samp{--reentrant} does not affect the performance of 2934the scanner. 2935 2936 2937 2938@anchor{option-c++} 2939@opindex -+ 2940@opindex ---c++ 2941@opindex c++ 2942@item -+, --c++, @code{%option c++} 2943specifies that you want flex to generate a C++ 2944scanner class. @xref{Cxx}, for 2945details. 2946 2947 2948 2949@anchor{option-array} 2950@opindex ---array 2951@opindex array 2952@item --array, @code{%option array} 2953specifies that you want yytext to be an array instead of a char* 2954 2955 2956 2957@anchor{option-pointer} 2958@opindex ---pointer 2959@opindex pointer 2960@item --pointer, @code{%option pointer} 2961specify that @code{yytext} should be a @code{char *}, not an array. 2962This default is @code{char *}. 2963 2964 2965 2966@anchor{option-prefix} 2967@opindex -P 2968@opindex ---prefix 2969@opindex prefix 2970@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"} 2971changes the default @samp{yy} prefix used by @code{flex} for all 2972globally-visible variable and function names to instead be 2973@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of 2974@code{yytext} to @code{footext}. It also changes the name of the default 2975output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial 2976list of the names affected: 2977 2978@example 2979@verbatim 2980 yy_create_buffer 2981 yy_delete_buffer 2982 yy_flex_debug 2983 yy_init_buffer 2984 yy_flush_buffer 2985 yy_load_buffer_state 2986 yy_switch_to_buffer 2987 yyin 2988 yyleng 2989 yylex 2990 yylineno 2991 yyout 2992 yyrestart 2993 yytext 2994 yywrap 2995 yyalloc 2996 yyrealloc 2997 yyfree 2998@end verbatim 2999@end example 3000 3001(If you are using a C++ scanner, then only @code{yywrap} and 3002@code{yyFlexLexer} are affected.) Within your scanner itself, you can 3003still refer to the global variables and functions using either version 3004of their name; but externally, they have the modified name. 3005 3006This option lets you easily link together multiple 3007@code{flex} 3008programs into the same executable. Note, though, that using this 3009option also renames 3010@code{yywrap()}, 3011so you now 3012@emph{must} 3013either 3014provide your own (appropriately-named) version of the routine for your 3015scanner, or use 3016@code{%option noyywrap}, 3017as linking with 3018@samp{-lfl} 3019no longer provides one for you by default. 3020 3021 3022 3023@anchor{option-main} 3024@opindex ---main 3025@opindex main 3026@item --main, @code{%option main} 3027 directs flex to provide a default @code{main()} program for the 3028scanner, which simply calls @code{yylex()}. This option implies 3029@code{noyywrap} (see below). 3030 3031 3032 3033@anchor{option-nounistd} 3034@opindex ---nounistd 3035@opindex nounistd 3036@item --nounistd, @code{%option nounistd} 3037suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option 3038is meant to target environments in which @file{unistd.h} does not exist. Be aware 3039that certain options may cause flex to generate code that relies on functions 3040normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.) 3041If you wish to use these functions, you will have to inform your compiler where 3042to find them. 3043@xref{option-always-interactive}. @xref{option-read}. 3044 3045 3046 3047@anchor{option-yyclass} 3048@opindex ---yyclass 3049@opindex yyclass 3050@item --yyclass=NAME, @code{%option yyclass="NAME"} 3051only applies when generating a C++ scanner (the @samp{--c++} option). It 3052informs @code{flex} that you have derived @code{NAME} as a subclass of 3053@code{yyFlexLexer}, so @code{flex} will place your actions in the member 3054function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It 3055also generates a @code{yyFlexLexer::yylex()} member function that emits 3056a run-time error (by invoking @code{yyFlexLexer::LexerError())} if 3057called. @xref{Cxx}. 3058 3059@end table 3060 3061@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options 3062@section Options for Scanner Speed and Size 3063 3064@table @samp 3065 3066@item -C[aefFmr] 3067controls the degree of table compression and, more generally, trade-offs 3068between small scanners and fast scanners. 3069 3070@table @samp 3071@opindex -C 3072@item -C 3073A lone @samp{-C} specifies that the scanner tables should be compressed 3074but neither equivalence classes nor meta-equivalence classes should be 3075used. 3076 3077@anchor{option-align} 3078@opindex -Ca 3079@opindex ---align 3080@opindex align 3081@item -Ca, --align, @code{%option align} 3082(``align'') instructs flex to trade off larger tables in the 3083generated scanner for faster performance because the elements of 3084the tables are better aligned for memory access and computation. On some 3085RISC architectures, fetching and manipulating longwords is more efficient 3086than with smaller-sized units such as shortwords. This option can 3087quadruple the size of the tables used by your scanner. 3088 3089@anchor{option-ecs} 3090@opindex -Ce 3091@opindex ---ecs 3092@opindex ecs 3093@item -Ce, --ecs, @code{%option ecs} 3094directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets 3095of characters which have identical lexical properties (for example, if 3096the only appearance of digits in the @code{flex} input is in the 3097character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be 3098put in the same equivalence class). Equivalence classes usually give 3099dramatic reductions in the final table/object file sizes (typically a 3100factor of 2-5) and are pretty cheap performance-wise (one array look-up 3101per character scanned). 3102 3103@opindex -Cf 3104@item -Cf 3105specifies that the @dfn{full} scanner tables should be generated - 3106@code{flex} should not compress the tables by taking advantages of 3107similar transition functions for different states. 3108 3109@opindex -CF 3110@item -CF 3111specifies that the alternate fast scanner representation (described 3112above under the @samp{--fast} flag) should be used. This option cannot be 3113used with @samp{--c++}. 3114 3115@anchor{option-meta-ecs} 3116@opindex -Cm 3117@opindex ---meta-ecs 3118@opindex meta-ecs 3119@item -Cm, --meta-ecs, @code{%option meta-ecs} 3120directs 3121@code{flex} 3122to construct 3123@dfn{meta-equivalence classes}, 3124which are sets of equivalence classes (or characters, if equivalence 3125classes are not being used) that are commonly used together. Meta-equivalence 3126classes are often a big win when using compressed tables, but they 3127have a moderate performance impact (one or two @code{if} tests and one 3128array look-up per character scanned). 3129 3130@anchor{option-read} 3131@opindex -Cr 3132@opindex ---read 3133@opindex read 3134@item -Cr, --read, @code{%option read} 3135causes the generated scanner to @emph{bypass} use of the standard I/O 3136library (@code{stdio}) for input. Instead of calling @code{fread()} or 3137@code{getc()}, the scanner will use the @code{read()} system call, 3138resulting in a performance gain which varies from system to system, but 3139in general is probably negligible unless you are also using @samp{-Cf} 3140or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for 3141example, you read from @file{yyin} using @code{stdio} prior to calling 3142the scanner (because the scanner will miss whatever text your previous 3143reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect 3144if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). 3145@end table 3146 3147The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense 3148together - there is no opportunity for meta-equivalence classes if the 3149table is not being compressed. Otherwise the options may be freely 3150mixed, and are cumulative. 3151 3152The default setting is @samp{-Cem}, which specifies that @code{flex} 3153should generate equivalence classes and meta-equivalence classes. This 3154setting provides the highest degree of table compression. You can trade 3155off faster-executing scanners at the cost of larger tables with the 3156following generally being true: 3157 3158@example 3159@verbatim 3160 slowest & smallest 3161 -Cem 3162 -Cm 3163 -Ce 3164 -C 3165 -C{f,F}e 3166 -C{f,F} 3167 -C{f,F}a 3168 fastest & largest 3169@end verbatim 3170@end example 3171 3172Note that scanners with the smallest tables are usually generated and 3173compiled the quickest, so during development you will usually want to 3174use the default, maximal compression. 3175 3176@samp{-Cfe} is often a good compromise between speed and size for 3177production scanners. 3178 3179@anchor{option-full} 3180@opindex -f 3181@opindex ---full 3182@opindex full 3183@item -f, --full, @code{%option full} 3184specifies 3185@dfn{fast scanner}. 3186No table compression is done and @code{stdio} is bypassed. 3187The result is large but fast. This option is equivalent to 3188@samp{--Cfr} 3189 3190 3191@anchor{option-fast} 3192@opindex -F 3193@opindex ---fast 3194@opindex fast 3195@item -F, --fast, @code{%option fast} 3196specifies that the @emph{fast} scanner table representation should be 3197used (and @code{stdio} bypassed). This representation is about as fast 3198as the full table representation @samp{--full}, and for some sets of 3199patterns will be considerably smaller (and for others, larger). In 3200general, if the pattern set contains both @emph{keywords} and a 3201catch-all, @emph{identifier} rule, such as in the set: 3202 3203@example 3204@verbatim 3205 "case" return TOK_CASE; 3206 "switch" return TOK_SWITCH; 3207 ... 3208 "default" return TOK_DEFAULT; 3209 [a-z]+ return TOK_ID; 3210@end verbatim 3211@end example 3212 3213then you're better off using the full table representation. If only 3214the @emph{identifier} rule is present and you then use a hash table or some such 3215to detect the keywords, you're better off using 3216@samp{--fast}. 3217 3218This option is equivalent to @samp{-CFr}. It cannot be used 3219with @samp{--c++}. 3220 3221@end table 3222 3223@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options 3224@section Debugging Options 3225 3226@table @samp 3227 3228@anchor{option-backup} 3229@opindex -b 3230@opindex ---backup 3231@opindex backup 3232@item -b, --backup, @code{%option backup} 3233Generate backing-up information to @file{lex.backup}. This is a list of 3234scanner states which require backing up and the input characters on 3235which they do so. By adding rules one can remove backing-up states. If 3236@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} 3237is used, the generated scanner will run faster (see the @samp{--perf-report} flag). 3238Only users who wish to squeeze every last cycle out of their scanners 3239need worry about this option. (@pxref{Performance}). 3240 3241 3242 3243@anchor{option-debug} 3244@opindex -d 3245@opindex ---debug 3246@opindex debug 3247@item -d, --debug, @code{%option debug} 3248makes the generated scanner run in @dfn{debug} mode. Whenever a pattern 3249is recognized and the global variable @code{yy_flex_debug} is non-zero 3250(which is the default), the scanner will write to @file{stderr} a line 3251of the form: 3252 3253@example 3254@verbatim 3255 -accepting rule at line 53 ("the matched text") 3256@end verbatim 3257@end example 3258 3259The line number refers to the location of the rule in the file defining 3260the scanner (i.e., the file that was fed to flex). Messages are also 3261generated when the scanner backs up, accepts the default rule, reaches 3262the end of its input buffer (or encounters a NUL; at this point, the two 3263look the same as far as the scanner's concerned), or reaches an 3264end-of-file. 3265 3266 3267 3268@anchor{option-perf-report} 3269@opindex -p 3270@opindex ---perf-report 3271@opindex perf-report 3272@item -p, --perf-report, @code{%option perf-report} 3273generates a performance report to @file{stderr}. The report consists of 3274comments regarding features of the @code{flex} input file which will 3275cause a serious loss of performance in the resulting scanner. If you 3276give the flag twice, you will also get comments regarding features that 3277lead to minor performance losses. 3278 3279Note that the use of @code{REJECT}, and 3280variable trailing context (@pxref{Limitations}) entails a substantial 3281performance penalty; use of @code{yymore()}, the @samp{^} operator, and 3282the @samp{--interactive} flag entail minor performance penalties. 3283 3284 3285 3286@anchor{option-nodefault} 3287@opindex -s 3288@opindex ---nodefault 3289@opindex nodefault 3290@item -s, --nodefault, @code{%option nodefault} 3291causes the @emph{default rule} (that unmatched scanner input is echoed 3292to @file{stdout)} to be suppressed. If the scanner encounters input 3293that does not match any of its rules, it aborts with an error. This 3294option is useful for finding holes in a scanner's rule set. 3295 3296 3297 3298@anchor{option-trace} 3299@opindex -T 3300@opindex ---trace 3301@opindex trace 3302@item -T, --trace, @code{%option trace} 3303makes @code{flex} run in @dfn{trace} mode. It will generate a lot of 3304messages to @file{stderr} concerning the form of the input and the 3305resultant non-deterministic and deterministic finite automata. This 3306option is mostly for use in maintaining @code{flex}. 3307 3308 3309 3310@anchor{option-nowarn} 3311@opindex -w 3312@opindex ---nowarn 3313@opindex nowarn 3314@item -w, --nowarn, @code{%option nowarn} 3315suppresses warning messages. 3316 3317 3318 3319@anchor{option-verbose} 3320@opindex -v 3321@opindex ---verbose 3322@opindex verbose 3323@item -v, --verbose, @code{%option verbose} 3324specifies that @code{flex} should write to @file{stderr} a summary of 3325statistics regarding the scanner it generates. Most of the statistics 3326are meaningless to the casual @code{flex} user, but the first line 3327identifies the version of @code{flex} (same as reported by @samp{--version}), 3328and the next line the flags used when generating the scanner, including 3329those that are on by default. 3330 3331 3332 3333@anchor{option-warn} 3334@opindex ---warn 3335@opindex warn 3336@item --warn, @code{%option warn} 3337warn about certain things. In particular, if the default rule can be 3338matched but no default rule has been given, the flex will warn you. 3339We recommend using this option always. 3340 3341@end table 3342 3343@node Miscellaneous Options, , Debugging Options, Scanner Options 3344@section Miscellaneous Options 3345 3346@table @samp 3347@opindex -c 3348@item -c 3349A do-nothing option included for POSIX compliance. 3350 3351@opindex -h 3352@opindex ---help 3353@item -h, -?, --help 3354generates a ``help'' summary of @code{flex}'s options to @file{stdout} 3355and then exits. 3356 3357@opindex -n 3358@item -n 3359Another do-nothing option included for 3360POSIX compliance. 3361 3362@opindex -V 3363@opindex ---version 3364@item -V, --version 3365prints the version number to @file{stdout} and exits. 3366 3367@end table 3368 3369 3370@node Performance, Cxx, Scanner Options, Top 3371@chapter Performance Considerations 3372 3373@cindex performance, considerations 3374The main design goal of @code{flex} is that it generate high-performance 3375scanners. It has been optimized for dealing well with large sets of 3376rules. Aside from the effects on scanner speed of the table compression 3377@samp{-C} options outlined above, there are a number of options/actions 3378which degrade performance. These are, from most expensive to least: 3379 3380@cindex REJECT, performance costs 3381@cindex yylineno, performance costs 3382@cindex trailing context, performance costs 3383@example 3384@verbatim 3385 REJECT 3386 arbitrary trailing context 3387 3388 pattern sets that require backing up 3389 %option yylineno 3390 %array 3391 3392 %option interactive 3393 %option always-interactive 3394 3395 ^ beginning-of-line operator 3396 yymore() 3397@end verbatim 3398@end example 3399 3400with the first two all being quite expensive and the last two being 3401quite cheap. Note also that @code{unput()} is implemented as a routine 3402call that potentially does quite a bit of work, while @code{yyless()} is 3403a quite-cheap macro. So if you are just putting back some excess text 3404you scanned, use @code{yyless()}. 3405 3406@code{REJECT} should be avoided at all costs when performance is 3407important. It is a particularly expensive option. 3408 3409There is one case when @code{%option yylineno} can be expensive. That is when 3410your patterns match long tokens that could @emph{possibly} contain a newline 3411character. There is no performance penalty for rules that can not possibly 3412match newlines, since flex does not need to check them for newlines. In 3413general, you should avoid rules such as @code{[^f]+}, which match very long 3414tokens, including newlines, and may possibly match your entire file! A better 3415approach is to separate @code{[^f]+} into two rules: 3416 3417@example 3418@verbatim 3419%option yylineno 3420%% 3421 [^f\n]+ 3422 \n+ 3423@end verbatim 3424@end example 3425 3426The above scanner does not incur a performance penalty. 3427 3428@cindex patterns, tuning for performance 3429@cindex performance, backing up 3430@cindex backing up, example of eliminating 3431Getting rid of backing up is messy and often may be an enormous amount 3432of work for a complicated scanner. In principal, one begins by using 3433the @samp{-b} flag to generate a @file{lex.backup} file. For example, 3434on the input: 3435 3436@cindex backing up, eliminating 3437@example 3438@verbatim 3439 %% 3440 foo return TOK_KEYWORD; 3441 foobar return TOK_KEYWORD; 3442@end verbatim 3443@end example 3444 3445the file looks like: 3446 3447@example 3448@verbatim 3449 State #6 is non-accepting - 3450 associated rule line numbers: 3451 2 3 3452 out-transitions: [ o ] 3453 jam-transitions: EOF [ \001-n p-\177 ] 3454 3455 State #8 is non-accepting - 3456 associated rule line numbers: 3457 3 3458 out-transitions: [ a ] 3459 jam-transitions: EOF [ \001-` b-\177 ] 3460 3461 State #9 is non-accepting - 3462 associated rule line numbers: 3463 3 3464 out-transitions: [ r ] 3465 jam-transitions: EOF [ \001-q s-\177 ] 3466 3467 Compressed tables always back up. 3468@end verbatim 3469@end example 3470 3471The first few lines tell us that there's a scanner state in which it can 3472make a transition on an 'o' but not on any other character, and that in 3473that state the currently scanned text does not match any rule. The 3474state occurs when trying to match the rules found at lines 2 and 3 in 3475the input file. If the scanner is in that state and then reads 3476something other than an 'o', it will have to back up to find a rule 3477which is matched. With a bit of headscratching one can see that this 3478must be the state it's in when it has seen @samp{fo}. When this has 3479happened, if anything other than another @samp{o} is seen, the scanner 3480will have to back up to simply match the @samp{f} (by the default rule). 3481 3482The comment regarding State #8 indicates there's a problem when 3483@samp{foob} has been scanned. Indeed, on any character other than an 3484@samp{a}, the scanner will have to back up to accept "foo". Similarly, 3485the comment for State #9 concerns when @samp{fooba} has been scanned and 3486an @samp{r} does not follow. 3487 3488The final comment reminds us that there's no point going to all the 3489trouble of removing backing up from the rules unless we're using 3490@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so 3491with compressed scanners. 3492 3493@cindex error rules, to eliminate backing up 3494The way to remove the backing up is to add ``error'' rules: 3495 3496@cindex backing up, eliminating by adding error rules 3497@example 3498@verbatim 3499 %% 3500 foo return TOK_KEYWORD; 3501 foobar return TOK_KEYWORD; 3502 3503 fooba | 3504 foob | 3505 fo { 3506 /* false alarm, not really a keyword */ 3507 return TOK_ID; 3508 } 3509@end verbatim 3510@end example 3511 3512Eliminating backing up among a list of keywords can also be done using a 3513``catch-all'' rule: 3514 3515@cindex backing up, eliminating with catch-all rule 3516@example 3517@verbatim 3518 %% 3519 foo return TOK_KEYWORD; 3520 foobar return TOK_KEYWORD; 3521 3522 [a-z]+ return TOK_ID; 3523@end verbatim 3524@end example 3525 3526This is usually the best solution when appropriate. 3527 3528Backing up messages tend to cascade. With a complicated set of rules 3529it's not uncommon to get hundreds of messages. If one can decipher 3530them, though, it often only takes a dozen or so rules to eliminate the 3531backing up (though it's easy to make a mistake and have an error rule 3532accidentally match a valid token. A possible future @code{flex} feature 3533will be to automatically add rules to eliminate backing up). 3534 3535It's important to keep in mind that you gain the benefits of eliminating 3536backing up only if you eliminate @emph{every} instance of backing up. 3537Leaving just one means you gain nothing. 3538 3539@emph{Variable} trailing context (where both the leading and trailing 3540parts do not have a fixed length) entails almost the same performance 3541loss as @code{REJECT} (i.e., substantial). So when possible a rule 3542like: 3543 3544@cindex trailing context, variable length 3545@example 3546@verbatim 3547 %% 3548 mouse|rat/(cat|dog) run(); 3549@end verbatim 3550@end example 3551 3552is better written: 3553 3554@example 3555@verbatim 3556 %% 3557 mouse/cat|dog run(); 3558 rat/cat|dog run(); 3559@end verbatim 3560@end example 3561 3562or as 3563 3564@example 3565@verbatim 3566 %% 3567 mouse|rat/cat run(); 3568 mouse|rat/dog run(); 3569@end verbatim 3570@end example 3571 3572Note that here the special '|' action does @emph{not} provide any 3573savings, and can even make things worse (@pxref{Limitations}). 3574 3575Another area where the user can increase a scanner's performance (and 3576one that's easier to implement) arises from the fact that the longer the 3577tokens matched, the faster the scanner will run. This is because with 3578long tokens the processing of most input characters takes place in the 3579(short) inner scanning loop, and does not often have to go through the 3580additional work of setting up the scanning environment (e.g., 3581@code{yytext}) for the action. Recall the scanner for C comments: 3582 3583@cindex performance optimization, matching longer tokens 3584@example 3585@verbatim 3586 %x comment 3587 %% 3588 int line_num = 1; 3589 3590 "/*" BEGIN(comment); 3591 3592 <comment>[^*\n]* 3593 <comment>"*"+[^*/\n]* 3594 <comment>\n ++line_num; 3595 <comment>"*"+"/" BEGIN(INITIAL); 3596@end verbatim 3597@end example 3598 3599This could be sped up by writing it as: 3600 3601@example 3602@verbatim 3603 %x comment 3604 %% 3605 int line_num = 1; 3606 3607 "/*" BEGIN(comment); 3608 3609 <comment>[^*\n]* 3610 <comment>[^*\n]*\n ++line_num; 3611 <comment>"*"+[^*/\n]* 3612 <comment>"*"+[^*/\n]*\n ++line_num; 3613 <comment>"*"+"/" BEGIN(INITIAL); 3614@end verbatim 3615@end example 3616 3617Now instead of each newline requiring the processing of another action, 3618recognizing the newlines is distributed over the other rules to keep the 3619matched text as long as possible. Note that @emph{adding} rules does 3620@emph{not} slow down the scanner! The speed of the scanner is 3621independent of the number of rules or (modulo the considerations given 3622at the beginning of this section) how complicated the rules are with 3623regard to operators such as @samp{*} and @samp{|}. 3624 3625@cindex keywords, for performance 3626@cindex performance, using keywords 3627A final example in speeding up a scanner: suppose you want to scan 3628through a file containing identifiers and keywords, one per line 3629and with no other extraneous characters, and recognize all the 3630keywords. A natural first approach is: 3631 3632@cindex performance optimization, recognizing keywords 3633@example 3634@verbatim 3635 %% 3636 asm | 3637 auto | 3638 break | 3639 ... etc ... 3640 volatile | 3641 while /* it's a keyword */ 3642 3643 .|\n /* it's not a keyword */ 3644@end verbatim 3645@end example 3646 3647To eliminate the back-tracking, introduce a catch-all rule: 3648 3649@example 3650@verbatim 3651 %% 3652 asm | 3653 auto | 3654 break | 3655 ... etc ... 3656 volatile | 3657 while /* it's a keyword */ 3658 3659 [a-z]+ | 3660 .|\n /* it's not a keyword */ 3661@end verbatim 3662@end example 3663 3664Now, if it's guaranteed that there's exactly one word per line, then we 3665can reduce the total number of matches by a half by merging in the 3666recognition of newlines with that of the other tokens: 3667 3668@example 3669@verbatim 3670 %% 3671 asm\n | 3672 auto\n | 3673 break\n | 3674 ... etc ... 3675 volatile\n | 3676 while\n /* it's a keyword */ 3677 3678 [a-z]+\n | 3679 .|\n /* it's not a keyword */ 3680@end verbatim 3681@end example 3682 3683One has to be careful here, as we have now reintroduced backing up 3684into the scanner. In particular, while 3685@emph{we} 3686know that there will never be any characters in the input stream 3687other than letters or newlines, 3688@code{flex} 3689can't figure this out, and it will plan for possibly needing to back up 3690when it has scanned a token like @samp{auto} and then the next character 3691is something other than a newline or a letter. Previously it would 3692then just match the @samp{auto} rule and be done, but now it has no @samp{auto} 3693rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, 3694we could either duplicate all rules but without final newlines, or, 3695since we never expect to encounter such an input and therefore don't 3696how it's classified, we can introduce one more catch-all rule, this 3697one which doesn't include a newline: 3698 3699@example 3700@verbatim 3701 %% 3702 asm\n | 3703 auto\n | 3704 break\n | 3705 ... etc ... 3706 volatile\n | 3707 while\n /* it's a keyword */ 3708 3709 [a-z]+\n | 3710 [a-z]+ | 3711 .|\n /* it's not a keyword */ 3712@end verbatim 3713@end example 3714 3715Compiled with @samp{-Cf}, this is about as fast as one can get a 3716@code{flex} scanner to go for this particular problem. 3717 3718A final note: @code{flex} is slow when matching @code{NUL}s, 3719particularly when a token contains multiple @code{NUL}s. It's best to 3720write rules which match @emph{short} amounts of text if it's anticipated 3721that the text will often include @code{NUL}s. 3722 3723Another final note regarding performance: as mentioned in 3724@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge 3725tokens is a slow process because it presently requires that the (huge) 3726token be rescanned from the beginning. Thus if performance is vital, 3727you should attempt to match ``large'' quantities of text but not 3728``huge'' quantities, where the cutoff between the two is at about 8K 3729characters per token. 3730 3731@node Cxx, Reentrant, Performance, Top 3732@chapter Generating C++ Scanners 3733 3734@cindex c++, experimental form of scanner class 3735@cindex experimental form of c++ scanner class 3736@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} 3737and may change considerably between major releases. 3738 3739@cindex C++ 3740@cindex member functions, C++ 3741@cindex methods, c++ 3742@code{flex} provides two different ways to generate scanners for use 3743with C++. The first way is to simply compile a scanner generated by 3744@code{flex} using a C++ compiler instead of a C compiler. You should 3745not encounter any compilation errors (@pxref{Reporting Bugs}). You can 3746then use C++ code in your rule actions instead of C code. Note that the 3747default input source for your scanner remains @file{yyin}, and default 3748echoing is still done to @file{yyout}. Both of these remain @code{FILE 3749*} variables and not C++ @emph{streams}. 3750 3751You can also use @code{flex} to generate a C++ scanner class, using the 3752@samp{-+} option (or, equivalently, @code{%option c++)}, which is 3753automatically specified if the name of the @code{flex} executable ends 3754in a '+', such as @code{flex++}. When using this option, @code{flex} 3755defaults to generating the scanner to the file @file{lex.yy.cc} instead 3756of @file{lex.yy.c}. The generated scanner includes the header file 3757@file{FlexLexer.h}, which defines the interface to two C++ classes. 3758 3759The first class in @file{FlexLexer.h}, @code{FlexLexer}, 3760provides an abstract base class defining the general scanner class 3761interface. It provides the following member functions: 3762 3763@table @code 3764@findex YYText (C++ only) 3765@item const char* YYText() 3766returns the text of the most recently matched token, the equivalent of 3767@code{yytext}. 3768 3769@findex YYLeng (C++ only) 3770@item int YYLeng() 3771returns the length of the most recently matched token, the equivalent of 3772@code{yyleng}. 3773 3774@findex lineno (C++ only) 3775@item int lineno() const 3776returns the current input line number (see @code{%option yylineno)}, or 3777@code{1} if @code{%option yylineno} was not used. 3778 3779@findex set_debug (C++ only) 3780@item void set_debug( int flag ) 3781sets the debugging flag for the scanner, equivalent to assigning to 3782@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build 3783the scanner using @code{%option debug} to include debugging information 3784in it. 3785 3786@findex debug (C++ only) 3787@item int debug() const 3788returns the current setting of the debugging flag. 3789@end table 3790 3791Also provided are member functions equivalent to 3792@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the 3793first argument is an @code{istream&} object reference and not a 3794@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and 3795@code{yyrestart()} (again, the first argument is a @code{istream&} 3796object reference). 3797 3798@tindex yyFlexLexer (C++ only) 3799@tindex FlexLexer (C++ only) 3800The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, 3801which is derived from @code{FlexLexer}. It defines the following 3802additional member functions: 3803 3804@table @code 3805@findex yyFlexLexer constructor (C++ only) 3806@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3807@item yyFlexLexer( istream& arg_yyin, ostream& arg_yyout ) 3808constructs a @code{yyFlexLexer} object using the given streams for input 3809and output. If not specified, the streams default to @code{cin} and 3810@code{cout}, respectively. @code{yyFlexLexer} does not take ownership of 3811its stream arguments. It's up to the user to ensure the streams pointed 3812to remain alive at least as long as the @code{yyFlexLexer} instance. 3813 3814@findex yylex (C++ version) 3815@item virtual int yylex() 3816performs the same role is @code{yylex()} does for ordinary @code{flex} 3817scanners: it scans the input stream, consuming tokens, until a rule's 3818action returns a value. If you derive a subclass @code{S} from 3819@code{yyFlexLexer} and want to access the member functions and variables 3820of @code{S} inside @code{yylex()}, then you need to use @code{%option 3821yyclass="S"} to inform @code{flex} that you will be using that subclass 3822instead of @code{yyFlexLexer}. In this case, rather than generating 3823@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()} 3824(and also generates a dummy @code{yyFlexLexer::yylex()} that calls 3825@code{yyFlexLexer::LexerError()} if called). 3826 3827@findex switch_streams (C++ only) 3828@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) 3829@item virtual void switch_streams(istream& new_in, ostream& new_out) 3830reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to 3831@code{new_out} (if non-null), deleting the previous input buffer if 3832@code{yyin} is reassigned. 3833 3834@item int yylex( istream* new_in, ostream* new_out = 0 ) 3835@item int yylex( istream& new_in, ostream& new_out ) 3836first switches the input streams via @code{switch_streams( new_in, 3837new_out )} and then returns the value of @code{yylex()}. 3838@end table 3839 3840In addition, @code{yyFlexLexer} defines the following protected virtual 3841functions which you can redefine in derived classes to tailor the 3842scanner: 3843 3844@table @code 3845@findex LexerInput (C++ only) 3846@item virtual int LexerInput( char* buf, int max_size ) 3847reads up to @code{max_size} characters into @code{buf} and returns the 3848number of characters read. To indicate end-of-input, return 0 3849characters. Note that @code{interactive} scanners (see the @samp{-B} 3850and @samp{-I} flags in @ref{Scanner Options}) define the macro 3851@code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to 3852take different actions depending on whether or not the scanner might be 3853scanning an interactive input source, you can test for the presence of 3854this name via @code{#ifdef} statements. 3855 3856@findex LexerOutput (C++ only) 3857@item virtual void LexerOutput( const char* buf, int size ) 3858writes out @code{size} characters from the buffer @code{buf}, which, while 3859@code{NUL}-terminated, may also contain internal @code{NUL}s if the 3860scanner's rules can match text with @code{NUL}s in them. 3861 3862@cindex error reporting, in C++ 3863@findex LexerError (C++ only) 3864@item virtual void LexerError( const char* msg ) 3865reports a fatal error message. The default version of this function 3866writes the message to the stream @code{cerr} and exits. 3867@end table 3868 3869Note that a @code{yyFlexLexer} object contains its @emph{entire} 3870scanning state. Thus you can use such objects to create reentrant 3871scanners, but see also @ref{Reentrant}. You can instantiate multiple 3872instances of the same @code{yyFlexLexer} class, and you can also combine 3873multiple C++ scanner classes together in the same program using the 3874@samp{-P} option discussed above. 3875 3876Finally, note that the @code{%array} feature is not available to C++ 3877scanner classes; you must use @code{%pointer} (the default). 3878 3879Here is an example of a simple C++ scanner: 3880 3881@cindex C++ scanners, use of 3882@example 3883@verbatim 3884 // An example of using the flex C++ scanner class. 3885 3886 %{ 3887 #include <iostream> 3888 using namespace std; 3889 int mylineno = 0; 3890 %} 3891 3892 %option noyywrap c++ 3893 3894 string \"[^\n"]+\" 3895 3896 ws [ \t]+ 3897 3898 alpha [A-Za-z] 3899 dig [0-9] 3900 name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* 3901 num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? 3902 num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? 3903 number {num1}|{num2} 3904 3905 %% 3906 3907 {ws} /* skip blanks and tabs */ 3908 3909 "/*" { 3910 int c; 3911 3912 while((c = yyinput()) != 0) 3913 { 3914 if(c == '\n') 3915 ++mylineno; 3916 3917 else if(c == '*') 3918 { 3919 if((c = yyinput()) == '/') 3920 break; 3921 else 3922 unput(c); 3923 } 3924 } 3925 } 3926 3927 {number} cout << "number " << YYText() << '\n'; 3928 3929 \n mylineno++; 3930 3931 {name} cout << "name " << YYText() << '\n'; 3932 3933 {string} cout << "string " << YYText() << '\n'; 3934 3935 %% 3936 3937 // This include is required if main() is an another source file. 3938 //#include <FlexLexer.h> 3939 3940 int main( int /* argc */, char** /* argv */ ) 3941 { 3942 FlexLexer* lexer = new yyFlexLexer; 3943 while(lexer->yylex() != 0) 3944 ; 3945 return 0; 3946 } 3947@end verbatim 3948@end example 3949 3950@cindex C++, multiple different scanners 3951If you want to create multiple (different) lexer classes, you use the 3952@samp{-P} flag (or the @code{prefix=} option) to rename each 3953@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can 3954include @file{<FlexLexer.h>} in your other sources once per lexer class, 3955first renaming @code{yyFlexLexer} as follows: 3956 3957@cindex include files, with C++ 3958@cindex header files, with C++ 3959@cindex C++ scanners, including multiple scanners 3960@example 3961@verbatim 3962 #undef yyFlexLexer 3963 #define yyFlexLexer xxFlexLexer 3964 #include <FlexLexer.h> 3965 3966 #undef yyFlexLexer 3967 #define yyFlexLexer zzFlexLexer 3968 #include <FlexLexer.h> 3969@end verbatim 3970@end example 3971 3972if, for example, you used @code{%option prefix="xx"} for one of your 3973scanners and @code{%option prefix="zz"} for the other. 3974 3975@node Reentrant, Lex and Posix, Cxx, Top 3976@chapter Reentrant C Scanners 3977 3978@cindex reentrant, explanation 3979@code{flex} has the ability to generate a reentrant C scanner. This is 3980accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated 3981scanner is both portable, and safe to use in one or more separate threads of 3982control. The most common use for reentrant scanners is from within 3983multi-threaded applications. Any thread may create and execute a reentrant 3984@code{flex} scanner without the need for synchronization with other threads. 3985 3986@menu 3987* Reentrant Uses:: 3988* Reentrant Overview:: 3989* Reentrant Example:: 3990* Reentrant Detail:: 3991* Reentrant Functions:: 3992@end menu 3993 3994@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant 3995@section Uses for Reentrant Scanners 3996 3997However, there are other uses for a reentrant scanner. For example, you 3998could scan two or more files simultaneously to implement a @code{diff} at 3999the token level (i.e., instead of at the character level): 4000 4001@cindex reentrant scanners, multiple interleaved scanners 4002@example 4003@verbatim 4004 /* Example of maintaining more than one active scanner. */ 4005 4006 do { 4007 int tok1, tok2; 4008 4009 tok1 = yylex( scanner_1 ); 4010 tok2 = yylex( scanner_2 ); 4011 4012 if( tok1 != tok2 ) 4013 printf("Files are different."); 4014 4015 } while ( tok1 && tok2 ); 4016@end verbatim 4017@end example 4018 4019Another use for a reentrant scanner is recursion. 4020(Note that a recursive scanner can also be created using a non-reentrant scanner and 4021buffer states. @xref{Multiple Input Buffers}.) 4022 4023The following crude scanner supports the @samp{eval} command by invoking 4024another instance of itself. 4025 4026@cindex reentrant scanners, recursive invocation 4027@example 4028@verbatim 4029 /* Example of recursive invocation. */ 4030 4031 %option reentrant 4032 4033 %% 4034 "eval(".+")" { 4035 yyscan_t scanner; 4036 YY_BUFFER_STATE buf; 4037 4038 yylex_init( &scanner ); 4039 yytext[yyleng-1] = ' '; 4040 4041 buf = yy_scan_string( yytext + 5, scanner ); 4042 yylex( scanner ); 4043 4044 yy_delete_buffer(buf,scanner); 4045 yylex_destroy( scanner ); 4046 } 4047 ... 4048 %% 4049@end verbatim 4050@end example 4051 4052@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant 4053@section An Overview of the Reentrant API 4054 4055@cindex reentrant, API explanation 4056The API for reentrant scanners is different than for non-reentrant 4057scanners. Here is a quick overview of the API: 4058 4059@itemize 4060@code{%option reentrant} must be specified. 4061 4062@item 4063All functions take one additional argument: @code{yyscanner} 4064 4065@item 4066All global variables are replaced by their macro equivalents. 4067(We tell you this because it may be important to you during debugging.) 4068 4069@item 4070@code{yylex_init} and @code{yylex_destroy} must be called before and 4071after @code{yylex}, respectively. 4072 4073@item 4074Accessor methods (get/set functions) provide access to common 4075@code{flex} variables. 4076 4077@item 4078User-specific data can be stored in @code{yyextra}. 4079@end itemize 4080 4081@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant 4082@section Reentrant Example 4083 4084First, an example of a reentrant scanner: 4085@cindex reentrant, example of 4086@example 4087@verbatim 4088 /* This scanner prints "//" comments. */ 4089 4090 %option reentrant stack noyywrap 4091 %x COMMENT 4092 4093 %% 4094 4095 "//" yy_push_state( COMMENT, yyscanner); 4096 .|\n 4097 4098 <COMMENT>\n yy_pop_state( yyscanner ); 4099 <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); 4100 4101 %% 4102 4103 int main ( int argc, char * argv[] ) 4104 { 4105 yyscan_t scanner; 4106 4107 yylex_init ( &scanner ); 4108 yylex ( scanner ); 4109 yylex_destroy ( scanner ); 4110 return 0; 4111 } 4112@end verbatim 4113@end example 4114 4115@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant 4116@section The Reentrant API in Detail 4117 4118Here are the things you need to do or know to use the reentrant C API of 4119@code{flex}. 4120 4121@menu 4122* Specify Reentrant:: 4123* Extra Reentrant Argument:: 4124* Global Replacement:: 4125* Init and Destroy Functions:: 4126* Accessor Methods:: 4127* Extra Data:: 4128* About yyscan_t:: 4129@end menu 4130 4131@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail 4132@subsection Declaring a Scanner As Reentrant 4133 4134 %option reentrant (--reentrant) must be specified. 4135 4136Notice that @code{%option reentrant} is specified in the above example 4137(@pxref{Reentrant Example}. Had this option not been specified, 4138@code{flex} would have happily generated a non-reentrant scanner without 4139complaining. You may explicitly specify @code{%option noreentrant}, if 4140you do @emph{not} want a reentrant scanner, although it is not 4141necessary. The default is to generate a non-reentrant scanner. 4142 4143@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail 4144@subsection The Extra Argument 4145 4146@cindex reentrant, calling functions 4147@vindex yyscanner (reentrant only) 4148All functions take one additional argument: @code{yyscanner}. 4149 4150Notice that the calls to @code{yy_push_state} and @code{yy_pop_state} 4151both have an argument, @code{yyscanner} , that is not present in a 4152non-reentrant scanner. Here are the declarations of 4153@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner: 4154 4155@example 4156@verbatim 4157 static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; 4158 static void yy_pop_state ( yyscan_t yyscanner ) ; 4159@end verbatim 4160@end example 4161 4162Notice that the argument @code{yyscanner} appears in the declaration of 4163both functions. In fact, all @code{flex} functions in a reentrant 4164scanner have this additional argument. It is always the last argument 4165in the argument list, it is always of type @code{yyscan_t} (which is 4166typedef'd to @code{void *}) and it is 4167always named @code{yyscanner}. As you may have guessed, 4168@code{yyscanner} is a pointer to an opaque data structure encapsulating 4169the current state of the scanner. For a list of function declarations, 4170see @ref{Reentrant Functions}. Note that preprocessor macros, such as 4171@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this 4172additional argument. 4173 4174@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail 4175@subsection Global Variables Replaced By Macros 4176 4177@cindex reentrant, accessing flex variables 4178All global variables in traditional flex have been replaced by macro equivalents. 4179 4180Note that in the above example, @code{yyout} and @code{yytext} are 4181not plain variables. These are macros that will expand to their equivalent lvalue. 4182All of the familiar @code{flex} globals have been replaced by their macro 4183equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno}, 4184@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc} 4185are macros. You may safely use these macros in actions as if they were plain 4186variables. We only tell you this so you don't expect to link to these variables 4187externally. Currently, each macro expands to a member of an internal struct, e.g., 4188 4189@example 4190@verbatim 4191#define yytext (((struct yyguts_t*)yyscanner)->yytext_r) 4192@end verbatim 4193@end example 4194 4195One important thing to remember about 4196@code{yytext} 4197and friends is that 4198@code{yytext} 4199is not a global variable in a reentrant 4200scanner, you can not access it directly from outside an action or from 4201other functions. You must use an accessor method, e.g., 4202@code{yyget_text}, 4203to accomplish this. (See below). 4204 4205@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail 4206@subsection Init and Destroy Functions 4207 4208@cindex memory, considerations for reentrant scanners 4209@cindex reentrant, initialization 4210@findex yylex_init 4211@findex yylex_destroy 4212 4213@code{yylex_init} and @code{yylex_destroy} must be called before and 4214after @code{yylex}, respectively. 4215 4216@example 4217@verbatim 4218 int yylex_init ( yyscan_t * ptr_yy_globals ) ; 4219 int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; 4220 int yylex ( yyscan_t yyscanner ) ; 4221 int yylex_destroy ( yyscan_t yyscanner ) ; 4222@end verbatim 4223@end example 4224 4225The function @code{yylex_init} must be called before calling any other 4226function. The argument to @code{yylex_init} is the address of an 4227uninitialized pointer to be filled in by @code{yylex_init}, overwriting 4228any previous contents. The function @code{yylex_init_extra} may be used 4229instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}. 4230See the section on yyextra, below, for more details. 4231 4232The value stored in @code{ptr_yy_globals} should 4233thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex 4234does not save the argument passed to @code{yylex_init}, so it is safe to 4235pass the address of a local pointer to @code{yylex_init} so long as it remains 4236in scope for the duration of all calls to the scanner, up to and including 4237the call to @code{yylex_destroy}. 4238 4239The function 4240@code{yylex} should be familiar to you by now. The reentrant version 4241takes one argument, which is the value returned (via an argument) by 4242@code{yylex_init}. Otherwise, it behaves the same as the non-reentrant 4243version of @code{yylex}. 4244 4245Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success, 4246or non-zero on failure, in which case errno is set to one of the following values: 4247 4248@itemize 4249@item ENOMEM 4250Memory allocation error. @xref{memory-management}. 4251@item EINVAL 4252Invalid argument. 4253@end itemize 4254 4255 4256The function @code{yylex_destroy} should be 4257called to free resources used by the scanner. After @code{yylex_destroy} 4258is called, the contents of @code{yyscanner} should not be used. Of 4259course, there is no need to destroy a scanner if you plan to reuse it. 4260A @code{flex} scanner (both reentrant and non-reentrant) may be 4261restarted by calling @code{yyrestart}. 4262 4263Below is an example of a program that creates a scanner, uses it, then destroys 4264it when done: 4265 4266@example 4267@verbatim 4268 int main () 4269 { 4270 yyscan_t scanner; 4271 int tok; 4272 4273 yylex_init(&scanner); 4274 4275 while ((tok=yylex(scanner)) > 0) 4276 printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); 4277 4278 yylex_destroy(scanner); 4279 return 0; 4280 } 4281@end verbatim 4282@end example 4283 4284@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail 4285@subsection Accessing Variables with Reentrant Scanners 4286 4287@cindex reentrant, accessor functions 4288Accessor methods (get/set functions) provide access to common 4289@code{flex} variables. 4290 4291Many scanners that you build will be part of a larger project. Portions 4292of your project will need access to @code{flex} values, such as 4293@code{yytext}. In a non-reentrant scanner, these values are global, so 4294there is no problem accessing them. However, in a reentrant scanner, there are no 4295global @code{flex} values. You can not access them directly. Instead, 4296you must access @code{flex} values using accessor methods (get/set 4297functions). Each accessor method is named @code{yyget_NAME} or 4298@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex} 4299variable you want. For example: 4300 4301@cindex accessor functions, use of 4302@example 4303@verbatim 4304 /* Set the last character of yytext to NULL. */ 4305 void chop ( yyscan_t scanner ) 4306 { 4307 int len = yyget_leng( scanner ); 4308 yyget_text( scanner )[len - 1] = '\0'; 4309 } 4310@end verbatim 4311@end example 4312 4313The above code may be called from within an action like this: 4314 4315@example 4316@verbatim 4317 %% 4318 .+\n { chop( yyscanner );} 4319@end verbatim 4320@end example 4321 4322You may find that @code{%option header-file} is particularly useful for generating 4323prototypes of all the accessor functions. @xref{option-header}. 4324 4325@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail 4326@subsection Extra Data 4327 4328@cindex reentrant, extra data 4329@vindex yyextra 4330User-specific data can be stored in @code{yyextra}. 4331 4332In a reentrant scanner, it is unwise to use global variables to 4333communicate with or maintain state between different pieces of your program. 4334However, you may need access to external data or invoke external functions 4335from within the scanner actions. 4336Likewise, you may need to pass information to your scanner 4337(e.g., open file descriptors, or database connections). 4338In a non-reentrant scanner, the only way to do this would be through the 4339use of global variables. 4340@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner. 4341This data is accessible through the accessor methods 4342@code{yyget_extra} and @code{yyset_extra} 4343from outside the scanner, and through the shortcut macro 4344@code{yyextra} 4345from within the scanner itself. They are defined as follows: 4346 4347@tindex YY_EXTRA_TYPE (reentrant only) 4348@findex yyget_extra 4349@findex yyset_extra 4350@example 4351@verbatim 4352 #define YY_EXTRA_TYPE void* 4353 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4354 void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); 4355@end verbatim 4356@end example 4357 4358In addition, an extra form of @code{yylex_init} is provided, 4359@code{yylex_init_extra}. This function is provided so that the yyextra value can 4360be accessed from within the very first yyalloc, used to allocate 4361the scanner itself. 4362 4363By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You 4364may redefine this type using @code{%option extra-type="your_type"} in 4365the scanner: 4366 4367@cindex YY_EXTRA_TYPE, defining your own type 4368@example 4369@verbatim 4370 /* An example of overriding YY_EXTRA_TYPE. */ 4371 %{ 4372 #include <sys/stat.h> 4373 #include <unistd.h> 4374 %} 4375 %option reentrant 4376 %option extra-type="struct stat *" 4377 %% 4378 4379 __filesize__ printf( "%ld", yyextra->st_size ); 4380 __lastmod__ printf( "%ld", yyextra->st_mtime ); 4381 %% 4382 void scan_file( char* filename ) 4383 { 4384 yyscan_t scanner; 4385 struct stat buf; 4386 FILE *in; 4387 4388 in = fopen( filename, "r" ); 4389 stat( filename, &buf ); 4390 4391 yylex_init_extra( buf, &scanner ); 4392 yyset_in( in, scanner ); 4393 yylex( scanner ); 4394 yylex_destroy( scanner ); 4395 4396 fclose( in ); 4397 } 4398@end verbatim 4399@end example 4400 4401 4402@node About yyscan_t, , Extra Data, Reentrant Detail 4403@subsection About yyscan_t 4404 4405@tindex yyscan_t (reentrant only) 4406@code{yyscan_t} is defined as: 4407 4408@example 4409@verbatim 4410 typedef void* yyscan_t; 4411@end verbatim 4412@end example 4413 4414It is initialized by @code{yylex_init()} to point to 4415an internal structure. You should never access this value 4416directly. In particular, you should never attempt to free it 4417(use @code{yylex_destroy()} instead.) 4418 4419@node Reentrant Functions, , Reentrant Detail, Reentrant 4420@section Functions and Macros Available in Reentrant C Scanners 4421 4422The following Functions are available in a reentrant scanner: 4423 4424@findex yyget_text 4425@findex yyget_leng 4426@findex yyget_in 4427@findex yyget_out 4428@findex yyget_lineno 4429@findex yyset_in 4430@findex yyset_out 4431@findex yyset_lineno 4432@findex yyget_debug 4433@findex yyset_debug 4434@findex yyget_extra 4435@findex yyset_extra 4436 4437@example 4438@verbatim 4439 char *yyget_text ( yyscan_t scanner ); 4440 int yyget_leng ( yyscan_t scanner ); 4441 FILE *yyget_in ( yyscan_t scanner ); 4442 FILE *yyget_out ( yyscan_t scanner ); 4443 int yyget_lineno ( yyscan_t scanner ); 4444 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4445 int yyget_debug ( yyscan_t scanner ); 4446 4447 void yyset_debug ( int flag, yyscan_t scanner ); 4448 void yyset_in ( FILE * in_str , yyscan_t scanner ); 4449 void yyset_out ( FILE * out_str , yyscan_t scanner ); 4450 void yyset_lineno ( int line_number , yyscan_t scanner ); 4451 void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); 4452@end verbatim 4453@end example 4454 4455There are no ``set'' functions for yytext and yyleng. This is intentional. 4456 4457The following Macro shortcuts are available in actions in a reentrant 4458scanner: 4459 4460@example 4461@verbatim 4462 yytext 4463 yyleng 4464 yyin 4465 yyout 4466 yylineno 4467 yyextra 4468 yy_flex_debug 4469@end verbatim 4470@end example 4471 4472@cindex yylineno, in a reentrant scanner 4473In a reentrant C scanner, support for yylineno is always present 4474(i.e., you may access yylineno), but the value is never modified by 4475@code{flex} unless @code{%option yylineno} is enabled. This is to allow 4476the user to maintain the line count independently of @code{flex}. 4477 4478@anchor{bison-functions} 4479The following functions and macros are made available when @code{%option 4480bison-bridge} (@samp{--bison-bridge}) is specified: 4481 4482@example 4483@verbatim 4484 YYSTYPE * yyget_lval ( yyscan_t scanner ); 4485 void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); 4486 yylval 4487@end verbatim 4488@end example 4489 4490The following functions and macros are made available 4491when @code{%option bison-locations} (@samp{--bison-locations}) is specified: 4492 4493@example 4494@verbatim 4495 YYLTYPE *yyget_lloc ( yyscan_t scanner ); 4496 void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); 4497 yylloc 4498@end verbatim 4499@end example 4500 4501Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for 4502yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are 4503generated by @code{bison}, and are included in section 1 of the @code{flex} 4504input. 4505 4506@node Lex and Posix, Memory Management, Reentrant, Top 4507@chapter Incompatibilities with Lex and Posix 4508 4509@cindex POSIX and lex 4510@cindex lex (traditional) and POSIX 4511 4512@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two 4513implementations do not share any code, though), with some extensions and 4514incompatibilities, both of which are of concern to those who wish to 4515write scanners acceptable to both implementations. @code{flex} is fully 4516compliant with the POSIX @code{lex} specification, except that when 4517using @code{%pointer} (the default), a call to @code{unput()} destroys 4518the contents of @code{yytext}, which is counter to the POSIX 4519specification. In this section we discuss all of the known areas of 4520incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX 4521specification. @code{flex}'s @samp{-l} option turns on maximum 4522compatibility with the original AT&T @code{lex} implementation, at the 4523cost of a major loss in the generated scanner's performance. We note 4524below which incompatibilities can be overcome using the @samp{-l} 4525option. @code{flex} is fully compatible with @code{lex} with the 4526following exceptions: 4527 4528@itemize 4529@item 4530The undocumented @code{lex} scanner internal variable @code{yylineno} is 4531not supported unless @samp{-l} or @code{%option yylineno} is used. 4532 4533@item 4534@code{yylineno} should be maintained on a per-buffer basis, rather than 4535a per-scanner (single global variable) basis. 4536 4537@item 4538@code{yylineno} is not part of the POSIX specification. 4539 4540@item 4541The @code{input()} routine is not redefinable, though it may be called 4542to read characters following whatever has been matched by a rule. If 4543@code{input()} encounters an end-of-file the normal @code{yywrap()} 4544processing is done. A ``real'' end-of-file is returned by 4545@code{input()} as @code{EOF}. 4546 4547@item 4548Input is instead controlled by defining the @code{YY_INPUT()} macro. 4549 4550@item 4551The @code{flex} restriction that @code{input()} cannot be redefined is 4552in accordance with the POSIX specification, which simply does not 4553specify any way of controlling the scanner's input other than by making 4554an initial assignment to @file{yyin}. 4555 4556@item 4557The @code{unput()} routine is not redefinable. This restriction is in 4558accordance with POSIX. 4559 4560@item 4561@code{flex} scanners are not as reentrant as @code{lex} scanners. In 4562particular, if you have an interactive scanner and an interrupt handler 4563which long-jumps out of the scanner, and the scanner is subsequently 4564called again, you may get the following message: 4565 4566@cindex error messages, end of buffer missed 4567@example 4568@verbatim 4569 fatal flex scanner internal error--end of buffer missed 4570@end verbatim 4571@end example 4572 4573To reenter the scanner, first use: 4574 4575@cindex restarting the scanner 4576@example 4577@verbatim 4578 yyrestart( yyin ); 4579@end verbatim 4580@end example 4581 4582Note that this call will throw away any buffered input; usually this 4583isn't a problem with an interactive scanner. @xref{Reentrant}, for 4584@code{flex}'s reentrant API. 4585 4586@item 4587Also note that @code{flex} C++ scanner classes 4588@emph{are} 4589reentrant, so if using C++ is an option for you, you should use 4590them instead. @xref{Cxx}, and @ref{Reentrant} for details. 4591 4592@item 4593@code{output()} is not supported. Output from the @b{ECHO} macro is 4594done to the file-pointer @code{yyout} (default @file{stdout)}. 4595 4596@item 4597@code{output()} is not part of the POSIX specification. 4598 4599@item 4600@code{lex} does not support exclusive start conditions (%x), though they 4601are in the POSIX specification. 4602 4603@item 4604When definitions are expanded, @code{flex} encloses them in parentheses. 4605With @code{lex}, the following: 4606 4607@cindex name definitions, not POSIX 4608@example 4609@verbatim 4610 NAME [A-Z][A-Z0-9]* 4611 %% 4612 foo{NAME}? printf( "Found it\n" ); 4613 %% 4614@end verbatim 4615@end example 4616 4617will not match the string @samp{foo} because when the macro is expanded 4618the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence 4619is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With 4620@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} 4621and so the string @samp{foo} will match. 4622 4623@item 4624Note that if the definition begins with @samp{^} or ends with @samp{$} 4625then it is @emph{not} expanded with parentheses, to allow these 4626operators to appear in definitions without losing their special 4627meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators 4628cannot be used in a @code{flex} definition. 4629 4630@item 4631Using @samp{-l} results in the @code{lex} behavior of no parentheses 4632around the definition. 4633 4634@item 4635The POSIX specification is that the definition be enclosed in parentheses. 4636 4637@item 4638Some implementations of @code{lex} allow a rule's action to begin on a 4639separate line, if the rule's pattern has trailing whitespace: 4640 4641@cindex patterns and actions on different lines 4642@example 4643@verbatim 4644 %% 4645 foo|bar<space here> 4646 { foobar_action();} 4647@end verbatim 4648@end example 4649 4650@code{flex} does not support this feature. 4651 4652@item 4653The @code{lex} @code{%r} (generate a Ratfor scanner) option is not 4654supported. It is not part of the POSIX specification. 4655 4656@item 4657After a call to @code{unput()}, @emph{yytext} is undefined until the 4658next token is matched, unless the scanner was built using @code{%array}. 4659This is not the case with @code{lex} or the POSIX specification. The 4660@samp{-l} option does away with this incompatibility. 4661 4662@item 4663The precedence of the @samp{@{,@}} (numeric range) operator is 4664different. The AT&T and POSIX specifications of @code{lex} 4665interpret @samp{abc@{1,3@}} as match one, two, 4666or three occurrences of @samp{abc}'', whereas @code{flex} interprets it 4667as ``match @samp{ab} followed by one, two, or three occurrences of 4668@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this 4669incompatibility. 4670 4671@item 4672The precedence of the @samp{^} operator is different. @code{lex} 4673interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a 4674line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match 4675either @samp{foo} or @samp{bar} if they come at the beginning of a 4676line''. The latter is in agreement with the POSIX specification. 4677 4678@item 4679The special table-size declarations such as @code{%a} supported by 4680@code{lex} are not required by @code{flex} scanners.. @code{flex} 4681ignores them. 4682@item 4683The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be 4684written for use with either @code{flex} or @code{lex}. Scanners also 4685include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} 4686and @code{YY_FLEX_SUBMINOR_VERSION} 4687indicating which version of @code{flex} generated the scanner. For 4688example, for the 2.5.22 release, these defines would be 2, 5 and 22 4689respectively. If the version of @code{flex} being used is a beta 4690version, then the symbol @code{FLEX_BETA} is defined. 4691 4692@item 4693The symbols @samp{[[} and @samp{]]} in the code sections of the input 4694may conflict with the m4 delimiters. @xref{M4 Dependency}. 4695 4696 4697@end itemize 4698 4699@cindex POSIX comp;compliance 4700@cindex non-POSIX features of flex 4701The following @code{flex} features are not included in @code{lex} or the 4702POSIX specification: 4703 4704@itemize 4705@item 4706C++ scanners 4707@item 4708%option 4709@item 4710start condition scopes 4711@item 4712start condition stacks 4713@item 4714interactive/non-interactive scanners 4715@item 4716yy_scan_string() and friends 4717@item 4718yyterminate() 4719@item 4720yy_set_interactive() 4721@item 4722yy_set_bol() 4723@item 4724YY_AT_BOL() 4725 <<EOF>> 4726@item 4727<*> 4728@item 4729YY_DECL 4730@item 4731YY_START 4732@item 4733YY_USER_ACTION 4734@item 4735YY_USER_INIT 4736@item 4737#line directives 4738@item 4739%@{@}'s around actions 4740@item 4741reentrant C API 4742@item 4743multiple actions on a line 4744@item 4745almost all of the @code{flex} command-line options 4746@end itemize 4747 4748The feature ``multiple actions on a line'' 4749refers to the fact that with @code{flex} you can put multiple actions on 4750the same line, separated with semi-colons, while with @code{lex}, the 4751following: 4752 4753@example 4754@verbatim 4755 foo handle_foo(); ++num_foos_seen; 4756@end verbatim 4757@end example 4758 4759is (rather surprisingly) truncated to 4760 4761@example 4762@verbatim 4763 foo handle_foo(); 4764@end verbatim 4765@end example 4766 4767@code{flex} does not truncate the action. Actions that are not enclosed 4768in braces are simply terminated at the end of the line. 4769 4770@node Memory Management, Serialized Tables, Lex and Posix, Top 4771@chapter Memory Management 4772 4773@cindex memory management 4774@anchor{memory-management} 4775This chapter describes how flex handles dynamic memory, and how you can 4776override the default behavior. 4777 4778@menu 4779* The Default Memory Management:: 4780* Overriding The Default Memory Management:: 4781* A Note About yytext And Memory:: 4782@end menu 4783 4784@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management 4785@section The Default Memory Management 4786 4787Flex allocates dynamic memory during initialization, and once in a while from 4788within a call to yylex(). Initialization takes place during the first call to 4789yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a 4790buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} 4791@xref{faq-memory-leak}. 4792 4793Flex allocates dynamic memory for four purposes, listed below @footnote{The 4794quantities given here are approximate, and may vary due to host architecture, 4795compiler configuration, or due to future enhancements to flex.} 4796 4797@table @asis 4798 4799@item 16kB for the input buffer. 4800Flex allocates memory for the character buffer used to perform pattern 4801matching. Flex must read ahead from the input stream and store it in a large 4802character buffer. This buffer is typically the largest chunk of dynamic memory 4803flex consumes. This buffer will grow if necessary, doubling the size each time. 4804Flex frees this memory when you call yylex_destroy(). The default size of this 4805buffer (16384 bytes) is almost always too large. The ideal size for this 4806buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few 4807extra bytes for housekeeping. Currently, to override the size of the input buffer 4808you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan 4809to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management 4810API. 4811 4812@item 64kb for the REJECT state. This will only be allocated if you use REJECT. 4813The size is large enough to hold the same number of states as characters in the input buffer. If you override the size of the 4814input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well. 4815 4816@item 100 bytes for the start condition stack. 4817Flex allocates memory for the start condition stack. This is the stack used 4818for pushing start states, i.e., with yy_push_state(). It will grow if 4819necessary. Since the states are simply integers, this stack doesn't consume 4820much memory. This stack is not present if @code{%option stack} is not 4821specified. You will rarely need to tune this buffer. The ideal size for this 4822stack is the maximum depth expected. The memory for this stack is 4823automatically destroyed when you call yylex_destroy(). @xref{option-stack}. 4824 4825@item 40 bytes for each YY_BUFFER_STATE. 4826Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself 4827is about 40 bytes, plus an additional large character buffer (described above.) 4828The initial buffer state is created during initialization, and with each call 4829to yy_create_buffer(). You can't tune the size of this, but you can tune the 4830character buffer as described above. Any buffer state that you explicitly 4831create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You 4832must call yy_delete_buffer() to free the memory. The exception to this rule is 4833that flex will delete the current buffer automatically when you call 4834yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. 4835That way, flex will not try to delete the buffer a second time (possibly 4836crashing your program!) At the time of this writing, flex does not provide a 4837growable stack for the buffer states. You have to manage that yourself. 4838@xref{Multiple Input Buffers}. 4839 4840@item 84 bytes for the reentrant scanner guts 4841Flex allocates about 84 bytes for the reentrant scanner structure when 4842you call yylex_init(). It is destroyed when the user calls yylex_destroy(). 4843 4844@end table 4845 4846 4847@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management 4848@section Overriding The Default Memory Management 4849 4850@cindex yyalloc, overriding 4851@cindex yyrealloc, overriding 4852@cindex yyfree, overriding 4853 4854Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree} 4855when it needs to allocate or free memory. By default, these functions are 4856wrappers around the standard C functions, @code{malloc}, @code{realloc}, and 4857@code{free}, respectively. You can override the default implementations by telling 4858flex that you will provide your own implementations. 4859 4860To override the default implementations, you must do two things: 4861 4862@enumerate 4863 4864@item Suppress the default implementations by specifying one or more of the 4865following options: 4866 4867@itemize 4868@opindex noyyalloc 4869@item @code{%option noyyalloc} 4870@item @code{%option noyyrealloc} 4871@item @code{%option noyyfree}. 4872@end itemize 4873 4874@item Provide your own implementation of the following functions: @footnote{It 4875is not necessary to override all (or any) of the memory management routines. 4876You may, for example, override @code{yyrealloc}, but not @code{yyfree} or 4877@code{yyalloc}.} 4878 4879@example 4880@verbatim 4881// For a non-reentrant scanner 4882void * yyalloc (size_t bytes); 4883void * yyrealloc (void * ptr, size_t bytes); 4884void yyfree (void * ptr); 4885 4886// For a reentrant scanner 4887void * yyalloc (size_t bytes, void * yyscanner); 4888void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); 4889void yyfree (void * ptr, void * yyscanner); 4890@end verbatim 4891@end example 4892 4893@end enumerate 4894 4895In the following example, we will override all three memory routines. We assume 4896that there is a custom allocator with garbage collection. In order to make this 4897example interesting, we will use a reentrant scanner, passing a pointer to the 4898custom allocator through @code{yyextra}. 4899 4900@cindex overriding the memory routines 4901@example 4902@verbatim 4903%{ 4904#include "some_allocator.h" 4905%} 4906 4907/* Suppress the default implementations. */ 4908%option noyyalloc noyyrealloc noyyfree 4909%option reentrant 4910 4911/* Initialize the allocator. */ 4912%{ 4913#define YY_EXTRA_TYPE struct allocator* 4914#define YY_USER_INIT yyextra = allocator_create(); 4915%} 4916 4917%% 4918.|\n ; 4919%% 4920 4921/* Provide our own implementations. */ 4922void * yyalloc (size_t bytes, void* yyscanner) { 4923 return allocator_alloc (yyextra, bytes); 4924} 4925 4926void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { 4927 return allocator_realloc (yyextra, bytes); 4928} 4929 4930void yyfree (void * ptr, void * yyscanner) { 4931 /* Do nothing -- we leave it to the garbage collector. */ 4932} 4933 4934@end verbatim 4935@end example 4936 4937 4938@node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management 4939@section A Note About yytext And Memory 4940 4941@cindex yytext, memory considerations 4942 4943When flex finds a match, @code{yytext} points to the first character of the 4944match in the input buffer. The string itself is part of the input buffer, and 4945is @emph{NOT} allocated separately. The value of yytext will be overwritten the next 4946time yylex() is called. In short, the value of yytext is only valid from within 4947the matched rule's action. 4948 4949Often, you want the value of yytext to persist for later processing, i.e., by a 4950parser with non-zero lookahead. In order to preserve yytext, you will have to 4951copy it with strdup() or a similar function. But this introduces some headache 4952because your parser is now responsible for freeing the copy of yytext. If you 4953use a yacc or bison parser, (commonly used with flex), you will discover that 4954the error recovery mechanisms can cause memory to be leaked. 4955 4956To prevent memory leaks from strdup'd yytext, you will have to track the memory 4957somehow. Our experience has shown that a garbage collection mechanism or a 4958pooled memory mechanism will save you a lot of grief when writing parsers. 4959 4960@node Serialized Tables, Diagnostics, Memory Management, Top 4961@chapter Serialized Tables 4962@cindex serialization 4963@cindex memory, serialized tables 4964 4965@anchor{serialization} 4966A @code{flex} scanner has the ability to save the DFA tables to a file, and 4967load them at runtime when needed. The motivation for this feature is to reduce 4968the runtime memory footprint. Traditionally, these tables have been compiled into 4969the scanner as C arrays, and are sometimes quite large. Since the tables are 4970compiled into the scanner, the memory used by the tables can never be freed. 4971This is a waste of memory, especially if an application uses several scanners, 4972but none of them at the same time. 4973 4974The serialization feature allows the tables to be loaded at runtime, before 4975scanning begins. The tables may be discarded when scanning is finished. 4976 4977@menu 4978* Creating Serialized Tables:: 4979* Loading and Unloading Serialized Tables:: 4980* Tables File Format:: 4981@end menu 4982 4983@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables 4984@section Creating Serialized Tables 4985@cindex tables, creating serialized 4986@cindex serialization of tables 4987 4988You may create a scanner with serialized tables by specifying: 4989 4990@example 4991@verbatim 4992 %option tables-file=FILE 4993or 4994 --tables-file=FILE 4995@end verbatim 4996@end example 4997 4998These options instruct flex to save the DFA tables to the file @var{FILE}. The tables 4999will @emph{not} be embedded in the generated scanner. The scanner will not 5000function on its own. The scanner will be dependent upon the serialized tables. You must 5001load the tables from this file at runtime before you can scan anything. 5002 5003If you do not specify a filename to @code{--tables-file}, the tables will be 5004saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix. 5005 5006If your project uses several different scanners, you can concatenate the 5007serialized tables into one file, and flex will find the correct set of tables, 5008using the scanner prefix as part of the lookup key. An example follows: 5009 5010@cindex serialized tables, multiple scanners 5011@example 5012@verbatim 5013$ flex --tables-file --prefix=cpp cpp.l 5014$ flex --tables-file --prefix=c c.l 5015$ cat lex.cpp.tables lex.c.tables > all.tables 5016@end verbatim 5017@end example 5018 5019The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did 5020not specify a filename, the tables were serialized to @file{lex.c.tables} and 5021@file{lex.cpp.tables}, respectively. Then, we concatenated the two files 5022together into @file{all.tables}, which we will distribute with our project. At 5023runtime, we will open the file and tell flex to load the tables from it. Flex 5024will find the correct tables automatically. (See next section). 5025 5026@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables 5027@section Loading and Unloading Serialized Tables 5028@cindex tables, loading and unloading 5029@cindex loading tables at runtime 5030@cindex tables, freeing 5031@cindex freeing tables 5032@cindex memory, serialized tables 5033 5034If you've built your scanner with @code{%option tables-file}, then you must 5035load the scanner tables at runtime. This can be accomplished with the following 5036function: 5037 5038@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) 5039Locates scanner tables in the stream pointed to by @var{fp} and loads them. 5040Memory for the tables is allocated via @code{yyalloc}. You must call this 5041function before the first call to @code{yylex}. The argument @var{scanner} 5042only appears in the reentrant scanner. 5043This function returns @samp{0} (zero) on success, or non-zero on error. 5044@end deftypefun 5045 5046The loaded tables are @strong{not} automatically destroyed (unloaded) when you 5047call @code{yylex_destroy}. The reason is that you may create several scanners 5048of the same type (in a reentrant scanner), each of which needs access to these 5049tables. To avoid a nasty memory leak, you must call the following function: 5050 5051@deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) 5052Unloads the scanner tables. The tables must be loaded again before you can scan 5053any more data. The argument @var{scanner} only appears in the reentrant 5054scanner. This function returns @samp{0} (zero) on success, or non-zero on 5055error. 5056@end deftypefun 5057 5058@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not 5059thread-safe.} You must ensure that these functions are called exactly once (for 5060each scanner type) in a threaded program, before any thread calls @code{yylex}. 5061After the tables are loaded, they are never written to, and no thread 5062protection is required thereafter -- until you destroy them. 5063 5064@node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables 5065@section Tables File Format 5066@cindex tables, file format 5067@cindex file format, serialized tables 5068 5069This section defines the file format of serialized @code{flex} tables. 5070 5071The tables format allows for one or more sets of tables to be 5072specified, where each set corresponds to a given scanner. Scanners are 5073indexed by name, as described below. The file format is as follows: 5074 5075@example 5076@verbatim 5077 TABLE SET 1 5078 +-------------------------------+ 5079 Header | uint32 th_magic; | 5080 | uint32 th_hsize; | 5081 | uint32 th_ssize; | 5082 | uint16 th_flags; | 5083 | char th_version[]; | 5084 | char th_name[]; | 5085 | uint8 th_pad64[]; | 5086 +-------------------------------+ 5087 Table 1 | uint16 td_id; | 5088 | uint16 td_flags; | 5089 | uint32 td_hilen; | 5090 | uint32 td_lolen; | 5091 | void td_data[]; | 5092 | uint8 td_pad64[]; | 5093 +-------------------------------+ 5094 Table 2 | | 5095 . . . 5096 . . . 5097 . . . 5098 . . . 5099 Table n | | 5100 +-------------------------------+ 5101 TABLE SET 2 5102 . 5103 . 5104 . 5105 TABLE SET N 5106@end verbatim 5107@end example 5108 5109The above diagram shows that a complete set of tables consists of a header 5110followed by multiple individual tables. Furthermore, multiple complete sets may 5111be present in the same file, each set with its own header and tables. The sets 5112are contiguous in the file. The only way to know if another set follows is to 5113check the next four bytes for the magic number (or check for EOF). The header 5114and tables sections are padded to 64-bit boundaries. Below we describe each 5115field in detail. This format does not specify how the scanner will expand the 5116given data, i.e., data may be serialized as int8, but expanded to an int32 5117array at runtime. This is to reduce the size of the serialized data where 5118possible. Remember, @emph{all integer values are in network byte order}. 5119 5120@noindent 5121Fields of a table header: 5122 5123@table @code 5124@item th_magic 5125Magic number, always 0xF13C57B1. 5126 5127@item th_hsize 5128Size of this entire header, in bytes, including all fields plus any padding. 5129 5130@item th_ssize 5131Size of this entire set, in bytes, including the header, all tables, plus 5132any padding. 5133 5134@item th_flags 5135Bit flags for this table set. Currently unused. 5136 5137@item th_version[] 5138Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is 5139the version of flex that was used to create the serialized tables. 5140 5141@item th_name[] 5142Contains the name of this table set. The default is @samp{yytables}, 5143and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. 5144 5145@item th_pad64[] 5146Zero or more NULL bytes, padding the entire header to the next 64-bit boundary 5147as calculated from the beginning of the header. 5148@end table 5149 5150@noindent 5151Fields of a table: 5152 5153@table @code 5154@item td_id 5155Specifies the table identifier. Possible values are: 5156@table @code 5157@item YYTD_ID_ACCEPT (0x01) 5158@code{yy_accept} 5159@item YYTD_ID_BASE (0x02) 5160@code{yy_base} 5161@item YYTD_ID_CHK (0x03) 5162@code{yy_chk} 5163@item YYTD_ID_DEF (0x04) 5164@code{yy_def} 5165@item YYTD_ID_EC (0x05) 5166@code{yy_ec } 5167@item YYTD_ID_META (0x06) 5168@code{yy_meta} 5169@item YYTD_ID_NUL_TRANS (0x07) 5170@code{yy_NUL_trans} 5171@item YYTD_ID_NXT (0x08) 5172@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen} 5173field below. 5174@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09) 5175@code{yy_rule_can_match_eol} 5176@item YYTD_ID_START_STATE_LIST (0x0A) 5177@code{yy_start_state_list}. This array is handled specially because it is an 5178array of pointers to structs. See the @code{td_flags} field below. 5179@item YYTD_ID_TRANSITION (0x0B) 5180@code{yy_transition}. This array is handled specially because it is an array of 5181structs. See the @code{td_lolen} field below. 5182@item YYTD_ID_ACCLIST (0x0C) 5183@code{yy_acclist} 5184@end table 5185 5186@item td_flags 5187Bit flags describing how to interpret the data in @code{td_data}. 5188The data arrays are one-dimensional by default, but may be 5189two dimensional as specified in the @code{td_hilen} field. 5190 5191@table @code 5192@item YYTD_DATA8 (0x01) 5193The data is serialized as an array of type int8. 5194@item YYTD_DATA16 (0x02) 5195The data is serialized as an array of type int16. 5196@item YYTD_DATA32 (0x04) 5197The data is serialized as an array of type int32. 5198@item YYTD_PTRANS (0x08) 5199The data is a list of indexes of entries in the expanded @code{yy_transition} 5200array. Each index should be expanded to a pointer to the corresponding entry 5201in the @code{yy_transition} array. We count on the fact that the 5202@code{yy_transition} array has already been seen. 5203@item YYTD_STRUCT (0x10) 5204The data is a list of yy_trans_info structs, each of which consists of 5205two integers. There is no padding between struct elements or between structs. 5206The type of each member is determined by the @code{YYTD_DATA*} bits. 5207@end table 5208 5209@item td_hilen 5210If @code{td_hilen} is non-zero, then the data is a two-dimensional array. 5211Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the 5212number of elements in the higher dimensional array, and @code{td_lolen} contains 5213the number of elements in the lowest dimension. 5214 5215Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or 5216@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified 5217by the @code{td_flags} field. It is possible for both @code{td_lolen} and 5218@code{td_hilen} to be zero, in which case @code{td_data} is a zero length 5219array, and no data is loaded, i.e., this table is simply skipped. Flex does not 5220currently generate tables of zero length. 5221 5222@item td_lolen 5223Specifies the number of elements in the lowest dimension array. If this is 5224a one-dimensional array, then it is simply the number of elements in this array. 5225The element size is determined by the @code{td_flags} field. 5226 5227@item td_data[] 5228The table data. This array may be a one- or two-dimensional array, of type 5229@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or 5230@code{struct yy_trans_info*}, depending upon the values in the 5231@code{td_flags}, @code{td_hilen}, and @code{td_lolen} fields. 5232 5233@item td_pad64[] 5234Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as 5235calculated from the beginning of this table. 5236@end table 5237 5238@node Diagnostics, Limitations, Serialized Tables, Top 5239@chapter Diagnostics 5240 5241@cindex error reporting, diagnostic messages 5242@cindex warnings, diagnostic messages 5243 5244The following is a list of @code{flex} diagnostic messages: 5245 5246@itemize 5247@item 5248@samp{warning, rule cannot be matched} indicates that the given rule 5249cannot be matched because it follows other rules that will always match 5250the same text as it. For example, in the following @samp{foo} cannot be 5251matched because it comes after an identifier ``catch-all'' rule: 5252 5253@cindex warning, rule cannot be matched 5254@example 5255@verbatim 5256 [a-z]+ got_identifier(); 5257 foo got_foo(); 5258@end verbatim 5259@end example 5260 5261Using @code{REJECT} in a scanner suppresses this warning. 5262 5263@item 5264@samp{warning, -s option given but default rule can be matched} means 5265that it is possible (perhaps only in a particular start condition) that 5266the default rule (match any single character) is the only one that will 5267match a particular input. Since @samp{-s} was given, presumably this is 5268not intended. 5269 5270@item 5271@code{reject_used_but_not_detected undefined} or 5272@code{yymore_used_but_not_detected undefined}. These errors can occur 5273at compile time. They indicate that the scanner uses @code{REJECT} or 5274@code{yymore()} but that @code{flex} failed to notice the fact, meaning 5275that @code{flex} scanned the first two sections looking for occurrences 5276of these actions and failed to find any, but somehow you snuck some in 5277(via a #include file, for example). Use @code{%option reject} or 5278@code{%option yymore} to indicate to @code{flex} that you really do use 5279these features. 5280 5281@item 5282@samp{flex scanner jammed}. a scanner compiled with 5283@samp{-s} has encountered an input string which wasn't matched by any of 5284its rules. This error can also occur due to internal problems. 5285 5286@item 5287@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} 5288and one of its rules matched a string longer than the @code{YYLMAX} 5289constant (8K bytes by default). You can increase the value by 5290#define'ing @code{YYLMAX} in the definitions section of your @code{flex} 5291input. 5292 5293@item 5294@samp{scanner requires -8 flag to use the character 'x'}. Your scanner 5295specification includes recognizing the 8-bit character @samp{'x'} and 5296you did not specify the -8 flag, and your scanner defaulted to 7-bit 5297because you used the @samp{-Cf} or @samp{-CF} table compression options. 5298See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for 5299details. 5300 5301@item 5302@samp{flex scanner push-back overflow}. you used @code{unput()} to push 5303back so much text that the scanner's buffer could not hold both the 5304pushed-back text and the current token in @code{yytext}. Ideally the 5305scanner should dynamically resize the buffer in this case, but at 5306present it does not. 5307 5308@item 5309@samp{input buffer overflow, can't enlarge buffer because scanner uses 5310REJECT}. the scanner was working on matching an extremely large token 5311and needed to expand the input buffer. This doesn't work with scanners 5312that use @code{REJECT}. 5313 5314@item 5315@samp{fatal flex scanner internal error--end of buffer missed}. This can 5316occur in a scanner which is reentered after a long-jump has jumped out 5317(or over) the scanner's activation frame. Before reentering the 5318scanner, use: 5319@example 5320@verbatim 5321 yyrestart( yyin ); 5322@end verbatim 5323@end example 5324or, as noted above, switch to using the C++ scanner class. 5325 5326@item 5327@samp{too many start conditions in <> construct!} you listed more start 5328conditions in a <> construct than exist (so you must have listed at 5329least one of them twice). 5330@end itemize 5331 5332@node Limitations, Bibliography, Diagnostics, Top 5333@chapter Limitations 5334 5335@cindex limitations of flex 5336 5337Some trailing context patterns cannot be properly matched and generate 5338warning messages (@samp{dangerous trailing context}). These are 5339patterns where the ending of the first part of the rule matches the 5340beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' 5341matches the 'x' at the beginning of the trailing context. (Note that 5342the POSIX draft states that the text matched by such patterns is 5343undefined.) For some trailing context rules, parts which are actually 5344fixed-length are not recognized as such, leading to the abovementioned 5345performance loss. In particular, parts using @samp{|} or @samp{@{n@}} 5346(such as @samp{foo@{3@}}) are always considered variable-length. 5347Combining trailing context with the special @samp{|} action can result 5348in @emph{fixed} trailing context being turned into the more expensive 5349@emph{variable} trailing context. For example, in the following: 5350 5351@cindex warning, dangerous trailing context 5352@example 5353@verbatim 5354 %% 5355 abc | 5356 xyz/def 5357@end verbatim 5358@end example 5359 5360Use of @code{unput()} invalidates yytext and yyleng, unless the 5361@code{%array} directive or the @samp{-l} option has been used. 5362Pattern-matching of @code{NUL}s is substantially slower than matching 5363other characters. Dynamic resizing of the input buffer is slow, as it 5364entails rescanning all the text matched so far by the current (generally 5365huge) token. Due to both buffering of input and read-ahead, you cannot 5366intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()}, 5367with @code{flex} rules and expect it to work. Call @code{input()} 5368instead. The total table entries listed by the @samp{-v} flag excludes 5369the number of table entries needed to determine what rule has been 5370matched. The number of entries is equal to the number of DFA states if 5371the scanner does not use @code{REJECT}, and somewhat greater than the 5372number of states if it does. @code{REJECT} cannot be used with the 5373@samp{-f} or @samp{-F} options. 5374 5375The @code{flex} internal algorithms need documentation. 5376 5377@node Bibliography, FAQ, Limitations, Top 5378@chapter Additional Reading 5379 5380You may wish to read more about the following programs: 5381@itemize 5382@item lex 5383@item yacc 5384@item sed 5385@item awk 5386@end itemize 5387 5388The following books may contain material of interest: 5389 5390John Levine, Tony Mason, and Doug Brown, 5391@emph{Lex & Yacc}, 5392O'Reilly and Associates. Be sure to get the 2nd edition. 5393 5394M. E. Lesk and E. Schmidt, 5395@emph{LEX -- Lexical Analyzer Generator} 5396 5397Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, 5398Techniques and Tools}, Addison-Wesley (1986). Describes the 5399pattern-matching techniques used by @code{flex} (deterministic finite 5400automata). 5401 5402@node FAQ, Appendices, Bibliography, Top 5403@unnumbered FAQ 5404 5405From time to time, the @code{flex} maintainer receives certain 5406questions. Rather than repeat answers to well-understood problems, we 5407publish them here. 5408 5409@menu 5410* When was flex born?:: 5411* How do I expand backslash-escape sequences in C-style quoted strings?:: 5412* Why do flex scanners call fileno if it is not ANSI compatible?:: 5413* Does flex support recursive pattern definitions?:: 5414* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 5415* Flex is not matching my patterns in the same order that I defined them.:: 5416* My actions are executing out of order or sometimes not at all.:: 5417* How can I have multiple input sources feed into the same scanner at the same time?:: 5418* Can I build nested parsers that work with the same input file?:: 5419* How can I match text only at the end of a file?:: 5420* How can I make REJECT cascade across start condition boundaries?:: 5421* Why cant I use fast or full tables with interactive mode?:: 5422* How much faster is -F or -f than -C?:: 5423* If I have a simple grammar cant I just parse it with flex?:: 5424* Why doesn't yyrestart() set the start state back to INITIAL?:: 5425* How can I match C-style comments?:: 5426* The period isn't working the way I expected.:: 5427* Can I get the flex manual in another format?:: 5428* Does there exist a "faster" NDFA->DFA algorithm?:: 5429* How does flex compile the DFA so quickly?:: 5430* How can I use more than 8192 rules?:: 5431* How do I abandon a file in the middle of a scan and switch to a new file?:: 5432* How do I execute code only during initialization (only before the first scan)?:: 5433* How do I execute code at termination?:: 5434* Where else can I find help?:: 5435* Can I include comments in the "rules" section of the file?:: 5436* I get an error about undefined yywrap().:: 5437* How can I change the matching pattern at run time?:: 5438* How can I expand macros in the input?:: 5439* How can I build a two-pass scanner?:: 5440* How do I match any string not matched in the preceding rules?:: 5441* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 5442* Is there a way to make flex treat NULL like a regular character?:: 5443* Whenever flex can not match the input it says "flex scanner jammed".:: 5444* Why doesn't flex have non-greedy operators like perl does?:: 5445* Memory leak - 16386 bytes allocated by malloc.:: 5446* How do I track the byte offset for lseek()?:: 5447* How do I use my own I/O classes in a C++ scanner?:: 5448* How do I skip as many chars as possible?:: 5449* deleteme00:: 5450* Are certain equivalent patterns faster than others?:: 5451* Is backing up a big deal?:: 5452* Can I fake multi-byte character support?:: 5453* deleteme01:: 5454* Can you discuss some flex internals?:: 5455* unput() messes up yy_at_bol:: 5456* The | operator is not doing what I want:: 5457* Why can't flex understand this variable trailing context pattern?:: 5458* The ^ operator isn't working:: 5459* Trailing context is getting confused with trailing optional patterns:: 5460* Is flex GNU or not?:: 5461* ERASEME53:: 5462* I need to scan if-then-else blocks and while loops:: 5463* ERASEME55:: 5464* ERASEME56:: 5465* ERASEME57:: 5466* Is there a repository for flex scanners?:: 5467* How can I conditionally compile or preprocess my flex input file?:: 5468* Where can I find grammars for lex and yacc?:: 5469* I get an end-of-buffer message for each character scanned.:: 5470* unnamed-faq-62:: 5471* unnamed-faq-63:: 5472* unnamed-faq-64:: 5473* unnamed-faq-65:: 5474* unnamed-faq-66:: 5475* unnamed-faq-67:: 5476* unnamed-faq-68:: 5477* unnamed-faq-69:: 5478* unnamed-faq-70:: 5479* unnamed-faq-71:: 5480* unnamed-faq-72:: 5481* unnamed-faq-73:: 5482* unnamed-faq-74:: 5483* unnamed-faq-75:: 5484* unnamed-faq-76:: 5485* unnamed-faq-77:: 5486* unnamed-faq-78:: 5487* unnamed-faq-79:: 5488* unnamed-faq-80:: 5489* unnamed-faq-81:: 5490* unnamed-faq-82:: 5491* unnamed-faq-83:: 5492* unnamed-faq-84:: 5493* unnamed-faq-85:: 5494* unnamed-faq-86:: 5495* unnamed-faq-87:: 5496* unnamed-faq-88:: 5497* unnamed-faq-90:: 5498* unnamed-faq-91:: 5499* unnamed-faq-92:: 5500* unnamed-faq-93:: 5501* unnamed-faq-94:: 5502* unnamed-faq-95:: 5503* unnamed-faq-96:: 5504* unnamed-faq-97:: 5505* unnamed-faq-98:: 5506* unnamed-faq-99:: 5507* unnamed-faq-100:: 5508* unnamed-faq-101:: 5509* What is the difference between YYLEX_PARAM and YY_DECL?:: 5510* Why do I get "conflicting types for yylex" error?:: 5511* How do I access the values set in a Flex action from within a Bison action?:: 5512@end menu 5513 5514@node When was flex born? 5515@unnumberedsec When was flex born? 5516 5517Vern Paxson took over 5518the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it 5519was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 5520a legend was born :-). 5521 5522@node How do I expand backslash-escape sequences in C-style quoted strings? 5523@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings? 5524 5525A key point when scanning quoted strings is that you cannot (easily) write 5526a single rule that will precisely match the string if you allow things 5527like embedded escape sequences and newlines. If you try to match strings 5528with a single rule then you'll wind up having to rescan the string anyway 5529to find any escape sequences. 5530 5531Instead you can use exclusive start conditions and a set of rules, one for 5532matching non-escaped text, one for matching a single escape, one for 5533matching an embedded newline, and one for recognizing the end of the 5534string. Each of these rules is then faced with the question of where to 5535put its intermediary results. The best solution is for the rules to 5536append their local value of @code{yytext} to the end of a ``string literal'' 5537buffer. A rule like the escape-matcher will append to the buffer the 5538meaning of the escape sequence rather than the literal text in @code{yytext}. 5539In this way, @code{yytext} does not need to be modified at all. 5540 5541@node Why do flex scanners call fileno if it is not ANSI compatible? 5542@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? 5543 5544Flex scanners call @code{fileno()} in order to get the file descriptor 5545corresponding to @code{yyin}. The file descriptor may be passed to 5546@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified. 5547If your system does not have @code{fileno()} support, to get rid of the 5548@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()} 5549call, you must specify one of @code{%option always-interactive} or 5550@code{%option never-interactive}. 5551 5552@node Does flex support recursive pattern definitions? 5553@unnumberedsec Does flex support recursive pattern definitions? 5554 5555e.g., 5556 5557@example 5558@verbatim 5559%% 5560block "{"({block}|{statement})*"}" 5561@end verbatim 5562@end example 5563 5564No. You cannot have recursive definitions. The pattern-matching power of 5565regular expressions in general (and therefore flex scanners, too) is 5566limited. In particular, regular expressions cannot ``balance'' parentheses 5567to an arbitrary degree. For example, it's impossible to write a regular 5568expression that matches all strings containing the same number of '@{'s 5569as '@}'s. For more powerful pattern matching, you need a parser, such 5570as @cite{GNU bison}. 5571 5572@node How do I skip huge chunks of input (tens of megabytes) while using flex? 5573@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex? 5574 5575Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}. 5576 5577@node Flex is not matching my patterns in the same order that I defined them. 5578@unnumberedsec Flex is not matching my patterns in the same order that I defined them. 5579 5580@code{flex} picks the 5581rule that matches the most text (i.e., the longest possible input string). 5582This is because @code{flex} uses an entirely different matching technique 5583(``deterministic finite automata'') that actually does all of the matching 5584simultaneously, in parallel. (Seems impossible, but it's actually a fairly 5585simple technique once you understand the principles.) 5586 5587A side-effect of this parallel matching is that when the input matches more 5588than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This 5589is explained further in the manual, in the section @xref{Matching}. 5590 5591If you want @code{flex} to choose a shorter match, then you can work around this 5592behavior by expanding your short 5593rule to match more text, then put back the extra: 5594 5595@example 5596@verbatim 5597data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; 5598@end verbatim 5599@end example 5600 5601Another fix would be to make the second rule active only during the 5602@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive 5603by declaring it with @code{%x} instead of @code{%s}. 5604 5605A final fix is to change the input language so that the ambiguity for 5606@samp{data_} is removed, by adding characters to it that don't match the 5607identifier rule, or by removing characters (such as @samp{_}) from the 5608identifier rule so it no longer matches @samp{data_}. (Of course, you might 5609also not have the option of changing the input language.) 5610 5611@node My actions are executing out of order or sometimes not at all. 5612@unnumberedsec My actions are executing out of order or sometimes not at all. 5613 5614Most likely, you have (in error) placed the opening @samp{@{} of the action 5615block on a different line than the rule, e.g., 5616 5617@example 5618@verbatim 5619^(foo|bar) 5620{ <<<--- WRONG! 5621 5622} 5623@end verbatim 5624@end example 5625 5626@code{flex} requires that the opening @samp{@{} of an action associated with a rule 5627begin on the same line as does the rule. You need instead to write your rules 5628as follows: 5629 5630@example 5631@verbatim 5632^(foo|bar) { // CORRECT! 5633 5634} 5635@end verbatim 5636@end example 5637 5638@node How can I have multiple input sources feed into the same scanner at the same time? 5639@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? 5640 5641If @dots{} 5642@itemize 5643@item 5644your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag), 5645@item 5646AND you run your scanner interactively (@samp{-I} option; default unless using special table 5647compression options), 5648@item 5649AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, 5650@end itemize 5651 5652then every time it matches a token, it will have exhausted its input 5653buffer (because the scanner is free of backtracking). This means you 5654can safely use @code{select()} at the point and only call @code{yylex()} for another 5655token if @code{select()} indicates there's data available. 5656 5657That is, move the @code{select()} out from the input function to a point where 5658it determines whether @code{yylex()} gets called for the next token. 5659 5660With this approach, you will still have problems if your input can arrive 5661piecemeal; @code{select()} could inform you that the beginning of a token is 5662available, you call @code{yylex()} to get it, but it winds up blocking waiting 5663for the later characters in the token. 5664 5665Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That 5666is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is 5667available. If input is available for the scanner, it reads and returns the 5668next byte. If input is available from another source, it calls whatever 5669function is responsible for reading from that source. (If no input is 5670available, it blocks until some input is available.) I've used this technique in an 5671interpreter I wrote that both reads keyboard input using a @code{flex} scanner and 5672IPC traffic from sockets, and it works fine. 5673 5674@node Can I build nested parsers that work with the same input file? 5675@unnumberedsec Can I build nested parsers that work with the same input file? 5676 5677This is not going to work without some additional effort. The reason is 5678that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the 5679``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K 5680of input available on yyin, and subsequent calls to other @code{yylex()}'s won't 5681see that input. You might be tempted to work around this problem by 5682redefining @code{YY_INPUT} to only return a small amount of text, but it turns out 5683that that approach is quite difficult. Instead, the best solution is to 5684combine all of your scanners into one large scanner, using a different 5685exclusive start condition for each. 5686 5687@node How can I match text only at the end of a file? 5688@unnumberedsec How can I match text only at the end of a file? 5689 5690There is no way to write a rule which is ``match this text, but only if 5691it comes at the end of the file''. You can fake it, though, if you happen 5692to have a character lying around that you don't allow in your input. 5693Then you redefine @code{YY_INPUT} to call your own routine which, if it sees 5694an @samp{EOF}, returns the magic character first (and remembers to return a 5695real @code{EOF} next time it's called). Then you could write: 5696 5697@example 5698@verbatim 5699<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ 5700@end verbatim 5701@end example 5702 5703@node How can I make REJECT cascade across start condition boundaries? 5704@unnumberedsec How can I make REJECT cascade across start condition boundaries? 5705 5706You can do this as follows. Suppose you have a start condition @samp{A}, and 5707after exhausting all of the possible matches in @samp{<A>}, you want to try 5708matches in @samp{<INITIAL>}. Then you could use the following: 5709 5710@example 5711@verbatim 5712%x A 5713%% 5714<A>rule_that_is_long ...; REJECT; 5715<A>rule ...; REJECT; /* shorter rule */ 5716<A>etc. 5717... 5718<A>.|\n { 5719/* Shortest and last rule in <A>, so 5720* cascaded REJECTs will eventually 5721* wind up matching this rule. We want 5722* to now switch to the initial state 5723* and try matching from there instead. 5724*/ 5725yyless(0); /* put back matched text */ 5726BEGIN(INITIAL); 5727} 5728@end verbatim 5729@end example 5730 5731@node Why cant I use fast or full tables with interactive mode? 5732@unnumberedsec Why can't I use fast or full tables with interactive mode? 5733 5734One of the assumptions 5735flex makes is that interactive applications are inherently slow (they're 5736waiting on a human after all). 5737It has to do with how the scanner detects that it must be finished scanning 5738a token. For interactive scanners, after scanning each character the current 5739state is looked up in a table (essentially) to see whether there's a chance 5740of another input character possibly extending the length of the match. If 5741not, the scanner halts. For non-interactive scanners, the end-of-token test 5742is much simpler, basically a compare with 0, so no memory bus cycles. Since 5743the test occurs in the innermost scanning loop, one would like to make it go 5744as fast as possible. 5745 5746Still, it seems reasonable to allow the user to choose to trade off a bit 5747of performance in this area to gain the corresponding flexibility. There 5748might be another reason, though, why fast scanners don't support the 5749interactive option. 5750 5751@node How much faster is -F or -f than -C? 5752@unnumberedsec How much faster is -F or -f than -C? 5753 5754Much faster (factor of 2-3). 5755 5756@node If I have a simple grammar cant I just parse it with flex? 5757@unnumberedsec If I have a simple grammar can't I just parse it with flex? 5758 5759Is your grammar recursive? That's almost always a sign that you're 5760better off using a parser/scanner rather than just trying to use a scanner 5761alone. 5762 5763@node Why doesn't yyrestart() set the start state back to INITIAL? 5764@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? 5765 5766There are two reasons. The first is that there might 5767be programs that rely on the start state not changing across file changes. 5768The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required, 5769so fixing the problem there doesn't solve the more general problem. 5770 5771@node How can I match C-style comments? 5772@unnumberedsec How can I match C-style comments? 5773 5774You might be tempted to try something like this: 5775 5776@example 5777@verbatim 5778"/*".*"*/" // WRONG! 5779@end verbatim 5780@end example 5781 5782or, worse, this: 5783 5784@example 5785@verbatim 5786"/*"(.|\n)"*/" // WRONG! 5787@end verbatim 5788@end example 5789 5790The above rules will eat too much input, and blow up on things like: 5791 5792@example 5793@verbatim 5794/* a comment */ do_my_thing( "oops */" ); 5795@end verbatim 5796@end example 5797 5798Here is one way which allows you to track line information: 5799 5800@example 5801@verbatim 5802<INITIAL>{ 5803"/*" BEGIN(IN_COMMENT); 5804} 5805<IN_COMMENT>{ 5806"*/" BEGIN(INITIAL); 5807[^*\n]+ // eat comment in chunks 5808"*" // eat the lone star 5809\n yylineno++; 5810} 5811@end verbatim 5812@end example 5813 5814@node The period isn't working the way I expected. 5815@unnumberedsec The '.' isn't working the way I expected. 5816 5817Here are some tips for using @samp{.}: 5818 5819@itemize 5820@item 5821A common mistake is to place the grouping parenthesis AFTER an operator, when 5822you really meant to place the parenthesis BEFORE the operator, e.g., you 5823probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. 5824 5825The first pattern matches the words @samp{foo} or @samp{bar} any number of 5826times, e.g., it matches the text @samp{barfoofoobarfoo}. The 5827second pattern matches a single instance of @code{foo} or a single instance of 5828@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . 5829@item 5830A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), 5831and NOT ``any character except newline''. 5832@item 5833Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). 5834If you really want to match ANY character, including newlines, then use @code{(.|\n)} 5835Beware that the regex @code{(.|\n)+} will match your entire input! 5836@item 5837Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} 5838@end itemize 5839 5840@node Can I get the flex manual in another format? 5841@unnumberedsec Can I get the flex manual in another format? 5842 5843The @code{flex} source distribution includes a texinfo manual. You are 5844free to convert that texinfo into whatever format you desire. The 5845@code{texinfo} package includes tools for conversion to a number of formats. 5846 5847@node Does there exist a "faster" NDFA->DFA algorithm? 5848@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? 5849 5850There's no way around the potential exponential running time - it 5851can take you exponential time just to enumerate all of the DFA states. 5852In practice, though, the running time is closer to linear, or sometimes 5853quadratic. 5854 5855@node How does flex compile the DFA so quickly? 5856@unnumberedsec How does flex compile the DFA so quickly? 5857 5858There are two big speed wins that @code{flex} uses: 5859 5860@enumerate 5861@item 5862It analyzes the input rules to construct equivalence classes for those 5863characters that always make the same transitions. It then rewrites the NFA 5864using equivalence classes for transitions instead of characters. This cuts 5865down the NFA->DFA computation time dramatically, to the point where, for 5866uncompressed DFA tables, the DFA generation is often I/O bound in writing out 5867the tables. 5868@item 5869It maintains hash values for previously computed DFA states, so testing 5870whether a newly constructed DFA state is equivalent to a previously constructed 5871state can be done very quickly, by first comparing hash values. 5872@end enumerate 5873 5874@node How can I use more than 8192 rules? 5875@unnumberedsec How can I use more than 8192 rules? 5876 5877@code{Flex} is compiled with an upper limit of 8192 rules per scanner. 5878If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex} 5879with the following changes in @file{flexdef.h}: 5880 5881@example 5882@verbatim 5883< #define YY_TRAILING_MASK 0x2000 5884< #define YY_TRAILING_HEAD_MASK 0x4000 5885-- 5886> #define YY_TRAILING_MASK 0x20000000 5887> #define YY_TRAILING_HEAD_MASK 0x40000000 5888@end verbatim 5889@end example 5890 5891This should work okay as long as your C compiler uses 32 bit integers. 5892But you might want to think about whether using such a huge number of rules 5893is the best way to solve your problem. 5894 5895The following may also be relevant: 5896 5897With luck, you should be able to increase the definitions in flexdef.h for: 5898 5899@example 5900@verbatim 5901#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 5902#define MAXIMUM_MNS 31999 5903#define BAD_SUBSCRIPT -32767 5904@end verbatim 5905@end example 5906 5907recompile everything, and it'll all work. Flex only has these 16-bit-like 5908values built into it because a long time ago it was developed on a machine 5909with 16-bit ints. I've given this advice to others in the past but haven't 5910heard back from them whether it worked okay or not... 5911 5912@node How do I abandon a file in the middle of a scan and switch to a new file? 5913@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? 5914 5915Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a 5916``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}. 5917 5918@node How do I execute code only during initialization (only before the first scan)? 5919@unnumberedsec How do I execute code only during initialization (only before the first scan)? 5920 5921You can specify an initial action by defining the macro @code{YY_USER_INIT} (though 5922note that @code{yyout} may not be available at the time this macro is executed). Or you 5923can add to the beginning of your rules section: 5924 5925@example 5926@verbatim 5927%% 5928 /* Must be indented! */ 5929 static int did_init = 0; 5930 5931 if ( ! did_init ){ 5932do_my_init(); 5933 did_init = 1; 5934 } 5935@end verbatim 5936@end example 5937 5938@node How do I execute code at termination? 5939@unnumberedsec How do I execute code at termination? 5940 5941You can specify an action for the @code{<<EOF>>} rule. 5942 5943@node Where else can I find help? 5944@unnumberedsec Where else can I find help? 5945 5946You can find the flex homepage on the web at 5947@uref{http://flex.sourceforge.net/}. See that page for details about flex 5948mailing lists as well. 5949 5950@node Can I include comments in the "rules" section of the file? 5951@unnumberedsec Can I include comments in the "rules" section of the file? 5952 5953Yes, just about anywhere you want to. See the manual for the specific syntax. 5954 5955@node I get an error about undefined yywrap(). 5956@unnumberedsec I get an error about undefined yywrap(). 5957 5958You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a} 5959(which provides one), or use 5960 5961@example 5962@verbatim 5963%option noyywrap 5964@end verbatim 5965@end example 5966 5967in your source to say you don't want a @code{yywrap()} function. 5968 5969@node How can I change the matching pattern at run time? 5970@unnumberedsec How can I change the matching pattern at run time? 5971 5972You can't, it's compiled into a static table when flex builds the scanner. 5973 5974@node How can I expand macros in the input? 5975@unnumberedsec How can I expand macros in the input? 5976 5977The best way to approach this problem is at a higher level, e.g., in the parser. 5978 5979However, you can do this using multiple input buffers. 5980 5981@example 5982@verbatim 5983%% 5984macro/[a-z]+ { 5985/* Saw the macro "macro" followed by extra stuff. */ 5986main_buffer = YY_CURRENT_BUFFER; 5987expansion_buffer = yy_scan_string(expand(yytext)); 5988yy_switch_to_buffer(expansion_buffer); 5989} 5990 5991<<EOF>> { 5992if ( expansion_buffer ) 5993{ 5994// We were doing an expansion, return to where 5995// we were. 5996yy_switch_to_buffer(main_buffer); 5997yy_delete_buffer(expansion_buffer); 5998expansion_buffer = 0; 5999} 6000else 6001yyterminate(); 6002} 6003@end verbatim 6004@end example 6005 6006You probably will want a stack of expansion buffers to allow nested macros. 6007From the above though hopefully the idea is clear. 6008 6009@node How can I build a two-pass scanner? 6010@unnumberedsec How can I build a two-pass scanner? 6011 6012One way to do it is to filter the first pass to a temporary file, 6013then process the temporary file on the second pass. You will probably see a 6014performance hit, due to all the disk I/O. 6015 6016When you need to look ahead far forward like this, it almost always means 6017that the right solution is to build a parse tree of the entire input, then 6018walk it after the parse in order to generate the output. In a sense, this 6019is a two-pass approach, once through the text and once through the parse 6020tree, but the performance hit for the latter is usually an order of magnitude 6021smaller, since everything is already classified, in binary format, and 6022residing in memory. 6023 6024@node How do I match any string not matched in the preceding rules? 6025@unnumberedsec How do I match any string not matched in the preceding rules? 6026 6027One way to assign precedence, is to place the more specific rules first. If 6028two rules would match the same input (same sequence of characters) then the 6029first rule listed in the @code{flex} input wins, e.g., 6030 6031@example 6032@verbatim 6033%% 6034foo[a-zA-Z_]+ return FOO_ID; 6035bar[a-zA-Z_]+ return BAR_ID; 6036[a-zA-Z_]+ return GENERIC_ID; 6037@end verbatim 6038@end example 6039 6040Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the 6041same amount of text as the more specific rules, and in that case the 6042@code{flex} scanner will pick the first rule listed in your scanner as the 6043one to match. 6044 6045@node I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6046@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6047 6048Those are internal variables pointing into the AT&T scanner's input buffer. I 6049imagine they're being manipulated in user versions of the @code{input()} and @code{unput()} 6050functions. If so, what you need to do is analyze those functions to figure out 6051what they're doing, and then replace @code{input()} with an appropriate definition of 6052@code{YY_INPUT}. You shouldn't need to (and must not) replace 6053@code{flex}'s @code{unput()} function. 6054 6055@node Is there a way to make flex treat NULL like a regular character? 6056@unnumberedsec Is there a way to make flex treat NULL like a regular character? 6057 6058Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient 6059version of @code{flex}. The latest release is version @value{VERSION}. 6060 6061@node Whenever flex can not match the input it says "flex scanner jammed". 6062@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". 6063 6064You need to add a rule that matches the otherwise-unmatched text, 6065e.g., 6066 6067@example 6068@verbatim 6069%option yylineno 6070%% 6071[[a bunch of rules here]] 6072 6073. printf("bad input character '%s' at line %d\n", yytext, yylineno); 6074@end verbatim 6075@end example 6076 6077See @code{%option default} for more information. 6078 6079@node Why doesn't flex have non-greedy operators like perl does? 6080@unnumberedsec Why doesn't flex have non-greedy operators like perl does? 6081 6082A DFA can do a non-greedy match by stopping 6083the first time it enters an accepting state, instead of consuming input until 6084it determines that no further matching is possible (a ``jam'' state). This 6085is actually easier to implement than longest leftmost match (which flex does). 6086 6087But it's also much less useful than longest leftmost match. In general, 6088when you find yourself wishing for non-greedy matching, that's usually a 6089sign that you're trying to make the scanner do some parsing. That's 6090generally the wrong approach, since it lacks the power to do a decent job. 6091Better is to either introduce a separate parser, or to split the scanner 6092into multiple scanners using (exclusive) start conditions. 6093 6094You might have 6095a separate start state once you've seen the @samp{BEGIN}. In that state, you 6096might then have a regex that will match @samp{END} (to kick you out of the 6097state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... 6098 6099This approach also has much better error-reporting properties. 6100 6101@node Memory leak - 16386 bytes allocated by malloc. 6102@unnumberedsec Memory leak - 16386 bytes allocated by malloc. 6103@anchor{faq-memory-leak} 6104 6105UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not 6106call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read 6107on. 6108 6109The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and 6110about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in 6111the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ 6112scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed. 6113 6114However, the leak won't multiply since the buffer is reused no matter how many 6115times you call @code{yylex()}. 6116 6117If you want to reclaim the memory when you are completely done scanning, then 6118you might try this: 6119 6120@example 6121@verbatim 6122/* For non-reentrant C scanner only. */ 6123yy_delete_buffer(YY_CURRENT_BUFFER); 6124yy_init = 1; 6125@end verbatim 6126@end example 6127 6128Note: @code{yy_init} is an "internal variable", and hasn't been tested in this 6129situation. It is possible that some other globals may need resetting as well. 6130 6131@node How do I track the byte offset for lseek()? 6132@unnumberedsec How do I track the byte offset for lseek()? 6133 6134@example 6135@verbatim 6136> We thought that it would be possible to have this number through the 6137> evaluation of the following expression: 6138> 6139> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf 6140@end verbatim 6141@end example 6142 6143While this is the right idea, it has two problems. The first is that 6144it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during 6145an invocation of @code{YY_INPUT} (or that your input source will return less 6146even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem 6147is that when refilling its internal buffer, @code{flex} keeps some characters 6148from the previous buffer (because usually it's in the middle of a match, 6149and needs those characters to construct @code{yytext} for the match once it's 6150done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't 6151be exactly the number of characters already read from the current buffer. 6152 6153An alternative solution is to count the number of characters you've matched 6154since starting to scan. This can be done by using @code{YY_USER_ACTION}. For 6155example, 6156 6157@example 6158@verbatim 6159#define YY_USER_ACTION num_chars += yyleng; 6160@end verbatim 6161@end example 6162 6163(You need to be careful to update your bookkeeping if you use @code{yymore(}), 6164@code{yyless()}, @code{unput()}, or @code{input()}.) 6165 6166@node How do I use my own I/O classes in a C++ scanner? 6167@section How do I use my own I/O classes in a C++ scanner? 6168 6169When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. 6170 6171@cindex LexerOutput, overriding 6172@cindex LexerInput, overriding 6173@cindex overriding LexerOutput 6174@cindex overriding LexerInput 6175@cindex customizing I/O in C++ scanners 6176@cindex C++ I/O, customizing 6177You can do this by passing the various functions (such as @code{LexerInput()} 6178and @code{LexerOutput()}) NULL @code{iostream*}'s, and then 6179dealing with your own I/O classes surreptitiously (i.e., stashing them in 6180special member variables). This works because the only assumption about 6181the lexer regarding what's done with the iostream's is that they're 6182ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever 6183is necessary with them. 6184 6185@c faq edit stopped here 6186@node How do I skip as many chars as possible? 6187@unnumberedsec How do I skip as many chars as possible? 6188 6189How do I skip as many chars as possible -- without interfering with the other 6190patterns? 6191 6192In the example below, we want to skip over characters until we see the phrase 6193"endskip". The following will @emph{NOT} work correctly (do you see why not?) 6194 6195@example 6196@verbatim 6197/* INCORRECT SCANNER */ 6198%x SKIP 6199%% 6200<INITIAL>startskip BEGIN(SKIP); 6201... 6202<SKIP>"endskip" BEGIN(INITIAL); 6203<SKIP>.* ; 6204@end verbatim 6205@end example 6206 6207The problem is that the pattern .* will eat up the word "endskip." 6208The simplest (but slow) fix is: 6209 6210@example 6211@verbatim 6212<SKIP>"endskip" BEGIN(INITIAL); 6213<SKIP>. ; 6214@end verbatim 6215@end example 6216 6217The fix involves making the second rule match more, without 6218making it match "endskip" plus something else. So for example: 6219 6220@example 6221@verbatim 6222<SKIP>"endskip" BEGIN(INITIAL); 6223<SKIP>[^e]+ ; 6224<SKIP>. ;/* so you eat up e's, too */ 6225@end verbatim 6226@end example 6227 6228@c TODO: Evaluate this faq. 6229@node deleteme00 6230@unnumberedsec deleteme00 6231@example 6232@verbatim 6233QUESTION: 6234When was flex born? 6235 6236Vern Paxson took over 6237the Software Tools lex project from Jef Poskanzer in 1982. At that point it 6238was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 6239a legend was born :-). 6240@end verbatim 6241@end example 6242 6243@c TODO: Evaluate this faq. 6244@node Are certain equivalent patterns faster than others? 6245@unnumberedsec Are certain equivalent patterns faster than others? 6246@example 6247@verbatim 6248To: Adoram Rogel <adoram@orna.hybridge.com> 6249Subject: Re: Flex 2.5.2 performance questions 6250In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. 6251Date: Wed, 18 Sep 96 10:51:02 PDT 6252From: Vern Paxson <vern> 6253 6254[Note, the most recent flex release is 2.5.4, which you can get from 6255ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] 6256 6257> 1. Using the pattern 6258> ([Ff](oot)?)?[Nn](ote)?(\.)? 6259> instead of 6260> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) 6261> (in a very complicated flex program) caused the program to slow from 6262> 300K+/min to 100K/min (no other changes were done). 6263 6264These two are not equivalent. For example, the first can match "footnote." 6265but the second can only match "footnote". This is almost certainly the 6266cause in the discrepancy - the slower scanner run is matching more tokens, 6267and/or having to do more backing up. 6268 6269> 2. Which of these two are better: [Ff]oot or (F|f)oot ? 6270 6271From a performance point of view, they're equivalent (modulo presumably 6272minor effects such as memory cache hit rates; and the presence of trailing 6273context, see below). From a space point of view, the first is slightly 6274preferable. 6275 6276> 3. I have a pattern that look like this: 6277> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) 6278> 6279> running yet another complicated program that includes the following rule: 6280> <snext>{and}/{no4}{bb}{pats} 6281> 6282> gets me to "too complicated - over 32,000 states"... 6283 6284I can't tell from this example whether the trailing context is variable-length 6285or fixed-length (it could be the latter if {and} is fixed-length). If it's 6286variable length, which flex -p will tell you, then this reflects a basic 6287performance problem, and if you can eliminate it by restructuring your 6288scanner, you will see significant improvement. 6289 6290> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about 6291> 10 patterns and changed the rule to be 5 rules. 6292> This did compile, but what is the rule of thumb here ? 6293 6294The rule is to avoid trailing context other than fixed-length, in which for 6295a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use 6296of the '|' operator automatically makes the pattern variable length, so in 6297this case '[Ff]oot' is preferred to '(F|f)oot'. 6298 6299> 4. I changed a rule that looked like this: 6300> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... 6301> 6302> to the next 2 rules: 6303> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} 6304> <snext8>{and}{bb}/{ROMAN} { BEGIN... 6305> 6306> Again, I understand the using [^...] will cause a great performance loss 6307 6308Actually, it doesn't cause any sort of performance loss. It's a surprising 6309fact about regular expressions that they always match in linear time 6310regardless of how complex they are. 6311 6312> but are there any specific rules about it ? 6313 6314See the "Performance Considerations" section of the man page, and also 6315the example in MISC/fastwc/. 6316 6317 Vern 6318@end verbatim 6319@end example 6320 6321@c TODO: Evaluate this faq. 6322@node Is backing up a big deal? 6323@unnumberedsec Is backing up a big deal? 6324@example 6325@verbatim 6326To: Adoram Rogel <adoram@hybridge.com> 6327Subject: Re: Flex 2.5.2 performance questions 6328In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. 6329Date: Thu, 19 Sep 96 09:58:00 PDT 6330From: Vern Paxson <vern> 6331 6332> a lot about the backing up problem. 6333> I believe that there lies my biggest problem, and I'll try to improve 6334> it. 6335 6336Since you have variable trailing context, this is a bigger performance 6337problem. Fixing it is usually easier than fixing backing up, which in a 6338complicated scanner (yours seems to fit the bill) can be extremely 6339difficult to do correctly. 6340 6341You also don't mention what flags you are using for your scanner. 6342-f makes a large speed difference, and -Cfe buys you nearly as much 6343speed but the resulting scanner is considerably smaller. 6344 6345> I have an | operator in {and} and in {pats} so both of them are variable 6346> length. 6347 6348-p should have reported this. 6349 6350> Is changing one of them to fixed-length is enough ? 6351 6352Yes. 6353 6354> Is it possible to change the 32,000 states limit ? 6355 6356Yes. I've appended instructions on how. Before you make this change, 6357though, you should think about whether there are ways to fundamentally 6358simplify your scanner - those are certainly preferable! 6359 6360 Vern 6361 6362To increase the 32K limit (on a machine with 32 bit integers), you increase 6363the magnitude of the following in flexdef.h: 6364 6365#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6366#define MAXIMUM_MNS 31999 6367#define BAD_SUBSCRIPT -32767 6368#define MAX_SHORT 32700 6369 6370Adding a 0 or two after each should do the trick. 6371@end verbatim 6372@end example 6373 6374@c TODO: Evaluate this faq. 6375@node Can I fake multi-byte character support? 6376@unnumberedsec Can I fake multi-byte character support? 6377@example 6378@verbatim 6379To: Heeman_Lee@hp.com 6380Subject: Re: flex - multi-byte support? 6381In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. 6382Date: Fri, 04 Oct 1996 11:42:18 PDT 6383From: Vern Paxson <vern> 6384 6385> I assume as long as my *.l file defines the 6386> range of expected character code values (in octal format), flex will 6387> scan the file and read multi-byte characters correctly. But I have no 6388> confidence in this assumption. 6389 6390Your lack of confidence is justified - this won't work. 6391 6392Flex has in it a widespread assumption that the input is processed 6393one byte at a time. Fixing this is on the to-do list, but is involved, 6394so it won't happen any time soon. In the interim, the best I can suggest 6395(unless you want to try fixing it yourself) is to write your rules in 6396terms of pairs of bytes, using definitions in the first section: 6397 6398 X \xfe\xc2 6399 ... 6400 %% 6401 foo{X}bar found_foo_fe_c2_bar(); 6402 6403etc. Definitely a pain - sorry about that. 6404 6405By the way, the email address you used for me is ancient, indicating you 6406have a very old version of flex. You can get the most recent, 2.5.4, from 6407ftp.ee.lbl.gov. 6408 6409 Vern 6410@end verbatim 6411@end example 6412 6413@c TODO: Evaluate this faq. 6414@node deleteme01 6415@unnumberedsec deleteme01 6416@example 6417@verbatim 6418To: moleary@primus.com 6419Subject: Re: Flex / Unicode compatibility question 6420In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. 6421Date: Tue, 22 Oct 1996 11:06:13 PDT 6422From: Vern Paxson <vern> 6423 6424Unfortunately flex at the moment has a widespread assumption within it 6425that characters are processed 8 bits at a time. I don't see any easy 6426fix for this (other than writing your rules in terms of double characters - 6427a pain). I also don't know of a wider lex, though you might try surfing 6428the Plan 9 stuff because I know it's a Unicode system, and also the PCCT 6429toolkit (try searching say Alta Vista for "Purdue Compiler Construction 6430Toolkit"). 6431 6432Fixing flex to handle wider characters is on the long-term to-do list. 6433But since flex is a strictly spare-time project these days, this probably 6434won't happen for quite a while, unless someone else does it first. 6435 6436 Vern 6437@end verbatim 6438@end example 6439 6440@c TODO: Evaluate this faq. 6441@node Can you discuss some flex internals? 6442@unnumberedsec Can you discuss some flex internals? 6443@example 6444@verbatim 6445To: Johan Linde <jl@theophys.kth.se> 6446Subject: Re: translation of flex 6447In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. 6448Date: Mon, 11 Nov 1996 10:33:50 PST 6449From: Vern Paxson <vern> 6450 6451> I'm working for the Swedish team translating GNU program, and I'm currently 6452> working with flex. I have a few questions about some of the messages which 6453> I hope you can answer. 6454 6455All of the things you're wondering about, by the way, concerning flex 6456internals - probably the only person who understands what they mean in 6457English is me! So I wouldn't worry too much about getting them right. 6458That said ... 6459 6460> #: main.c:545 6461> msgid " %d protos created\n" 6462> 6463> Does proto mean prototype? 6464 6465Yes - prototypes of state compression tables. 6466 6467> #: main.c:539 6468> msgid " %d/%d (peak %d) template nxt-chk entries created\n" 6469> 6470> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) 6471> However, 'template next-check entries' doesn't make much sense to me. To be 6472> able to find a good translation I need to know a little bit more about it. 6473 6474There is a scheme in the Aho/Sethi/Ullman compiler book for compressing 6475scanner tables. It involves creating two pairs of tables. The first has 6476"base" and "default" entries, the second has "next" and "check" entries. 6477The "base" entry is indexed by the current state and yields an index into 6478the next/check table. The "default" entry gives what to do if the state 6479transition isn't found in next/check. The "next" entry gives the next 6480state to enter, but only if the "check" entry verifies that this entry is 6481correct for the current state. Flex creates templates of series of 6482next/check entries and then encodes differences from these templates as a 6483way to compress the tables. 6484 6485> #: main.c:533 6486> msgid " %d/%d base-def entries created\n" 6487> 6488> The same problem here for 'base-def'. 6489 6490See above. 6491 6492 Vern 6493@end verbatim 6494@end example 6495 6496@c TODO: Evaluate this faq. 6497@node unput() messes up yy_at_bol 6498@unnumberedsec unput() messes up yy_at_bol 6499@example 6500@verbatim 6501To: Xinying Li <xli@npac.syr.edu> 6502Subject: Re: FLEX ? 6503In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. 6504Date: Wed, 13 Nov 1996 19:51:54 PST 6505From: Vern Paxson <vern> 6506 6507> "unput()" them to input flow, question occurs. If I do this after I scan 6508> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That 6509> means the carriage flag has gone. 6510 6511You can control this by calling yy_set_bol(). It's described in the manual. 6512 6513> And if in pre-reading it goes to the end of file, is anything done 6514> to control the end of curren buffer and end of file? 6515 6516No, there's no way to put back an end-of-file. 6517 6518> By the way I am using flex 2.5.2 and using the "-l". 6519 6520The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and 65212.5.3. You can get it from ftp.ee.lbl.gov. 6522 6523 Vern 6524@end verbatim 6525@end example 6526 6527@c TODO: Evaluate this faq. 6528@node The | operator is not doing what I want 6529@unnumberedsec The | operator is not doing what I want 6530@example 6531@verbatim 6532To: Alain.ISSARD@st.com 6533Subject: Re: Start condition with FLEX 6534In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. 6535Date: Mon, 18 Nov 1996 10:41:34 PST 6536From: Vern Paxson <vern> 6537 6538> I am not able to use the start condition scope and to use the | (OR) with 6539> rules having start conditions. 6540 6541The problem is that if you use '|' as a regular expression operator, for 6542example "a|b" meaning "match either 'a' or 'b'", then it must *not* have 6543any blanks around it. If you instead want the special '|' *action* (which 6544from your scanner appears to be the case), which is a way of giving two 6545different rules the same action: 6546 6547 foo | 6548 bar matched_foo_or_bar(); 6549 6550then '|' *must* be separated from the first rule by whitespace and *must* 6551be followed by a new line. You *cannot* write it as: 6552 6553 foo | bar matched_foo_or_bar(); 6554 6555even though you might think you could because yacc supports this syntax. 6556The reason for this unfortunately incompatibility is historical, but it's 6557unlikely to be changed. 6558 6559Your problems with start condition scope are simply due to syntax errors 6560from your use of '|' later confusing flex. 6561 6562Let me know if you still have problems. 6563 6564 Vern 6565@end verbatim 6566@end example 6567 6568@c TODO: Evaluate this faq. 6569@node Why can't flex understand this variable trailing context pattern? 6570@unnumberedsec Why can't flex understand this variable trailing context pattern? 6571@example 6572@verbatim 6573To: Gregory Margo <gmargo@newton.vip.best.com> 6574Subject: Re: flex-2.5.3 bug report 6575In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. 6576Date: Sat, 23 Nov 1996 17:07:32 PST 6577From: Vern Paxson <vern> 6578 6579> Enclosed is a lex file that "real" lex will process, but I cannot get 6580> flex to process it. Could you try it and maybe point me in the right direction? 6581 6582Your problem is that some of the definitions in the scanner use the '/' 6583trailing context operator, and have it enclosed in ()'s. Flex does not 6584allow this operator to be enclosed in ()'s because doing so allows undefined 6585regular expressions such as "(a/b)+". So the solution is to remove the 6586parentheses. Note that you must also be building the scanner with the -l 6587option for AT&T lex compatibility. Without this option, flex automatically 6588encloses the definitions in parentheses. 6589 6590 Vern 6591@end verbatim 6592@end example 6593 6594@c TODO: Evaluate this faq. 6595@node The ^ operator isn't working 6596@unnumberedsec The ^ operator isn't working 6597@example 6598@verbatim 6599To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> 6600Subject: Re: Flex Bug ? 6601In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. 6602Date: Tue, 26 Nov 1996 11:15:05 PST 6603From: Vern Paxson <vern> 6604 6605> In my lexer code, i have the line : 6606> ^\*.* { } 6607> 6608> Thus all lines starting with an astrix (*) are comment lines. 6609> This does not work ! 6610 6611I can't get this problem to reproduce - it works fine for me. Note 6612though that if what you have is slightly different: 6613 6614 COMMENT ^\*.* 6615 %% 6616 {COMMENT} { } 6617 6618then it won't work, because flex pushes back macro definitions enclosed 6619in ()'s, so the rule becomes 6620 6621 (^\*.*) { } 6622 6623and now that the '^' operator is not at the immediate beginning of the 6624line, it's interpreted as just a regular character. You can avoid this 6625behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". 6626 6627 Vern 6628@end verbatim 6629@end example 6630 6631@c TODO: Evaluate this faq. 6632@node Trailing context is getting confused with trailing optional patterns 6633@unnumberedsec Trailing context is getting confused with trailing optional patterns 6634@example 6635@verbatim 6636To: Adoram Rogel <adoram@hybridge.com> 6637Subject: Re: Flex 2.5.4 BOF ??? 6638In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. 6639Date: Wed, 27 Nov 1996 10:56:25 PST 6640From: Vern Paxson <vern> 6641 6642> Organization(s)?/[a-z] 6643> 6644> This matched "Organizations" (looking in debug mode, the trailing s 6645> was matched with trailing context instead of the optional (s) in the 6646> end of the word. 6647 6648That should only happen with lex. Flex can properly match this pattern. 6649(That might be what you're saying, I'm just not sure.) 6650 6651> Is there a way to avoid this dangerous trailing context problem ? 6652 6653Unfortunately, there's no easy way. On the other hand, I don't see why 6654it should be a problem. Lex's matching is clearly wrong, and I'd hope 6655that usually the intent remains the same as expressed with the pattern, 6656so flex's matching will be correct. 6657 6658 Vern 6659@end verbatim 6660@end example 6661 6662@c TODO: Evaluate this faq. 6663@node Is flex GNU or not? 6664@unnumberedsec Is flex GNU or not? 6665@example 6666@verbatim 6667To: Cameron MacKinnon <mackin@interlog.com> 6668Subject: Re: Flex documentation bug 6669In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. 6670Date: Sun, 01 Dec 1996 22:29:39 PST 6671From: Vern Paxson <vern> 6672 6673> I'm not sure how or where to submit bug reports (documentation or 6674> otherwise) for the GNU project stuff ... 6675 6676Well, strictly speaking flex isn't part of the GNU project. They just 6677distribute it because no one's written a decent GPL'd lex replacement. 6678So you should send bugs directly to me. Those sent to the GNU folks 6679sometimes find there way to me, but some may drop between the cracks. 6680 6681> In GNU Info, under the section 'Start Conditions', and also in the man 6682> page (mine's dated April '95) is a nice little snippet showing how to 6683> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in 6684> size. Unfortunately, no overflow checking is ever done ... 6685 6686This is already mentioned in the manual: 6687 6688Finally, here's an example of how to match C-style quoted 6689strings using exclusive start conditions, including expanded 6690escape sequences (but not including checking for a string 6691that's too long): 6692 6693The reason for not doing the overflow checking is that it will needlessly 6694clutter up an example whose main purpose is just to demonstrate how to 6695use flex. 6696 6697The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. 6698 6699 Vern 6700@end verbatim 6701@end example 6702 6703@c TODO: Evaluate this faq. 6704@node ERASEME53 6705@unnumberedsec ERASEME53 6706@example 6707@verbatim 6708To: tsv@cs.UManitoba.CA 6709Subject: Re: Flex (reg).. 6710In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. 6711Date: Thu, 06 Mar 1997 15:54:19 PST 6712From: Vern Paxson <vern> 6713 6714> [:alpha:] ([:alnum:] | \\_)* 6715 6716If your rule really has embedded blanks as shown above, then it won't 6717work, as the first blank delimits the rule from the action. (It wouldn't 6718even compile ...) You need instead: 6719 6720[:alpha:]([:alnum:]|\\_)* 6721 6722and that should work fine - there's no restriction on what can go inside 6723of ()'s except for the trailing context operator, '/'. 6724 6725 Vern 6726@end verbatim 6727@end example 6728 6729@c TODO: Evaluate this faq. 6730@node I need to scan if-then-else blocks and while loops 6731@unnumberedsec I need to scan if-then-else blocks and while loops 6732@example 6733@verbatim 6734To: "Mike Stolnicki" <mstolnic@ford.com> 6735Subject: Re: FLEX help 6736In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. 6737Date: Fri, 30 May 1997 10:46:35 PDT 6738From: Vern Paxson <vern> 6739 6740> We'd like to add "if-then-else", "while", and "for" statements to our 6741> language ... 6742> We've investigated many possible solutions. The one solution that seems 6743> the most reasonable involves knowing the position of a TOKEN in yyin. 6744 6745I strongly advise you to instead build a parse tree (abstract syntax tree) 6746and loop over that instead. You'll find this has major benefits in keeping 6747your interpreter simple and extensible. 6748 6749That said, the functionality you mention for get_position and set_position 6750have been on the to-do list for a while. As flex is a purely spare-time 6751project for me, no guarantees when this will be added (in particular, it 6752for sure won't be for many months to come). 6753 6754 Vern 6755@end verbatim 6756@end example 6757 6758@c TODO: Evaluate this faq. 6759@node ERASEME55 6760@unnumberedsec ERASEME55 6761@example 6762@verbatim 6763To: Colin Paul Adams <colin@colina.demon.co.uk> 6764Subject: Re: Flex C++ classes and Bison 6765In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. 6766Date: Fri, 15 Aug 1997 10:48:19 PDT 6767From: Vern Paxson <vern> 6768 6769> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control 6770> *parm) 6771> 6772> I have been trying to get this to work as a C++ scanner, but it does 6773> not appear to be possible (warning that it matches no declarations in 6774> yyFlexLexer, or something like that). 6775> 6776> Is this supposed to be possible, or is it being worked on (I DID 6777> notice the comment that scanner classes are still experimental, so I'm 6778> not too hopeful)? 6779 6780What you need to do is derive a subclass from yyFlexLexer that provides 6781the above yylex() method, squirrels away lvalp and parm into member 6782variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. 6783 6784 Vern 6785@end verbatim 6786@end example 6787 6788@c TODO: Evaluate this faq. 6789@node ERASEME56 6790@unnumberedsec ERASEME56 6791@example 6792@verbatim 6793To: Mikael.Latvala@lmf.ericsson.se 6794Subject: Re: Possible mistake in Flex v2.5 document 6795In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. 6796Date: Fri, 05 Sep 1997 10:01:54 PDT 6797From: Vern Paxson <vern> 6798 6799> In that example you show how to count comment lines when using 6800> C style /* ... */ comments. My question is, shouldn't you take into 6801> account a scenario where end of a comment marker occurs inside 6802> character or string literals? 6803 6804The scanner certainly needs to also scan character and string literals. 6805However it does that (there's an example in the man page for strings), the 6806lexer will recognize the beginning of the literal before it runs across the 6807embedded "/*". Consequently, it will finish scanning the literal before it 6808even considers the possibility of matching "/*". 6809 6810Example: 6811 6812 '([^']*|{ESCAPE_SEQUENCE})' 6813 6814will match all the text between the ''s (inclusive). So the lexer 6815considers this as a token beginning at the first ', and doesn't even 6816attempt to match other tokens inside it. 6817 6818I thinnk this subtlety is not worth putting in the manual, as I suspect 6819it would confuse more people than it would enlighten. 6820 6821 Vern 6822@end verbatim 6823@end example 6824 6825@c TODO: Evaluate this faq. 6826@node ERASEME57 6827@unnumberedsec ERASEME57 6828@example 6829@verbatim 6830To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> 6831Subject: Re: flex limitations 6832In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. 6833Date: Mon, 08 Sep 1997 11:38:08 PDT 6834From: Vern Paxson <vern> 6835 6836> %% 6837> [a-zA-Z]+ /* skip a line */ 6838> { printf("got %s\n", yytext); } 6839> %% 6840 6841What version of flex are you using? If I feed this to 2.5.4, it complains: 6842 6843 "bug.l", line 5: EOF encountered inside an action 6844 "bug.l", line 5: unrecognized rule 6845 "bug.l", line 5: fatal parse error 6846 6847Not the world's greatest error message, but it manages to flag the problem. 6848 6849(With the introduction of start condition scopes, flex can't accommodate 6850an action on a separate line, since it's ambiguous with an indented rule.) 6851 6852You can get 2.5.4 from ftp.ee.lbl.gov. 6853 6854 Vern 6855@end verbatim 6856@end example 6857 6858@c TODO: Evaluate this faq. 6859@node Is there a repository for flex scanners? 6860@unnumberedsec Is there a repository for flex scanners? 6861 6862Not that we know of. You might try asking on comp.compilers. 6863 6864@c TODO: Evaluate this faq. 6865@node How can I conditionally compile or preprocess my flex input file? 6866@unnumberedsec How can I conditionally compile or preprocess my flex input file? 6867 6868 6869Flex doesn't have a preprocessor like C does. You might try using m4, or the C 6870preprocessor plus a sed script to clean up the result. 6871 6872 6873@c TODO: Evaluate this faq. 6874@node Where can I find grammars for lex and yacc? 6875@unnumberedsec Where can I find grammars for lex and yacc? 6876 6877In the sources for flex and bison. 6878 6879@c TODO: Evaluate this faq. 6880@node I get an end-of-buffer message for each character scanned. 6881@unnumberedsec I get an end-of-buffer message for each character scanned. 6882 6883This will happen if your LexerInput() function returns only one character 6884at a time, which can happen either if you're scanner is "interactive", or 6885if the streams library on your platform always returns 1 for yyin->gcount(). 6886 6887Solution: override LexerInput() with a version that returns whole buffers. 6888 6889@c TODO: Evaluate this faq. 6890@node unnamed-faq-62 6891@unnumberedsec unnamed-faq-62 6892@example 6893@verbatim 6894To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 6895Subject: Re: Flex maximums 6896In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. 6897Date: Mon, 17 Nov 1997 17:16:15 PST 6898From: Vern Paxson <vern> 6899 6900> I took a quick look into the flex-sources and altered some #defines in 6901> flexdefs.h: 6902> 6903> #define INITIAL_MNS 64000 6904> #define MNS_INCREMENT 1024000 6905> #define MAXIMUM_MNS 64000 6906 6907The things to fix are to add a couple of zeroes to: 6908 6909#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6910#define MAXIMUM_MNS 31999 6911#define BAD_SUBSCRIPT -32767 6912#define MAX_SHORT 32700 6913 6914and, if you get complaints about too many rules, make the following change too: 6915 6916 #define YY_TRAILING_MASK 0x200000 6917 #define YY_TRAILING_HEAD_MASK 0x400000 6918 6919- Vern 6920@end verbatim 6921@end example 6922 6923@c TODO: Evaluate this faq. 6924@node unnamed-faq-63 6925@unnumberedsec unnamed-faq-63 6926@example 6927@verbatim 6928To: jimmey@lexis-nexis.com (Jimmey Todd) 6929Subject: Re: FLEX question regarding istream vs ifstream 6930In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. 6931Date: Mon, 15 Dec 1997 13:21:35 PST 6932From: Vern Paxson <vern> 6933 6934> stdin_handle = YY_CURRENT_BUFFER; 6935> ifstream fin( "aFile" ); 6936> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); 6937> 6938> What I'm wanting to do, is pass the contents of a file thru one set 6939> of rules and then pass stdin thru another set... It works great if, I 6940> don't use the C++ classes. But since everything else that I'm doing is 6941> in C++, I thought I'd be consistent. 6942> 6943> The problem is that 'yy_create_buffer' is expecting an istream* as it's 6944> first argument (as stated in the man page). However, fin is a ifstream 6945> object. Any ideas on what I might be doing wrong? Any help would be 6946> appreciated. Thanks!! 6947 6948You need to pass &fin, to turn it into an ifstream* instead of an ifstream. 6949Then its type will be compatible with the expected istream*, because ifstream 6950is derived from istream. 6951 6952 Vern 6953@end verbatim 6954@end example 6955 6956@c TODO: Evaluate this faq. 6957@node unnamed-faq-64 6958@unnumberedsec unnamed-faq-64 6959@example 6960@verbatim 6961To: Enda Fadian <fadiane@piercom.ie> 6962Subject: Re: Question related to Flex man page? 6963In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. 6964Date: Tue, 16 Dec 1997 14:17:09 PST 6965From: Vern Paxson <vern> 6966 6967> Can you explain to me what is ment by a long-jump in relation to flex? 6968 6969Using the longjmp() function while inside yylex() or a routine called by it. 6970 6971> what is the flex activation frame. 6972 6973Just yylex()'s stack frame. 6974 6975> As far as I can see yyrestart will bring me back to the sart of the input 6976> file and using flex++ isnot really an option! 6977 6978No, yyrestart() doesn't imply a rewind, even though its name might sound 6979like it does. It tells the scanner to flush its internal buffers and 6980start reading from the given file at its present location. 6981 6982 Vern 6983@end verbatim 6984@end example 6985 6986@c TODO: Evaluate this faq. 6987@node unnamed-faq-65 6988@unnumberedsec unnamed-faq-65 6989@example 6990@verbatim 6991To: hassan@larc.info.uqam.ca (Hassan Alaoui) 6992Subject: Re: Need urgent Help 6993In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. 6994Date: Sun, 21 Dec 1997 21:30:46 PST 6995From: Vern Paxson <vern> 6996 6997> /usr/lib/yaccpar: In function `int yyparse()': 6998> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' 6999> 7000> ld: Undefined symbol 7001> _yylex 7002> _yyparse 7003> _yyin 7004 7005This is a known problem with Solaris C++ (and/or Solaris yacc). I believe 7006the fix is to explicitly insert some 'extern "C"' statements for the 7007corresponding routines/symbols. 7008 7009 Vern 7010@end verbatim 7011@end example 7012 7013@c TODO: Evaluate this faq. 7014@node unnamed-faq-66 7015@unnumberedsec unnamed-faq-66 7016@example 7017@verbatim 7018To: mc0307@mclink.it 7019Cc: gnu@prep.ai.mit.edu 7020Subject: Re: [mc0307@mclink.it: Help request] 7021In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. 7022Date: Sun, 21 Dec 1997 22:33:37 PST 7023From: Vern Paxson <vern> 7024 7025> This is my definition for float and integer types: 7026> . . . 7027> NZD [1-9] 7028> ... 7029> I've tested my program on other lex version (on UNIX Sun Solaris an HP 7030> UNIX) and it work well, so I think that my definitions are correct. 7031> There are any differences between Lex and Flex? 7032 7033There are indeed differences, as discussed in the man page. The one 7034you are probably running into is that when flex expands a name definition, 7035it puts parentheses around the expansion, while lex does not. There's 7036an example in the man page of how this can lead to different matching. 7037Flex's behavior complies with the POSIX standard (or at least with the 7038last POSIX draft I saw). 7039 7040 Vern 7041@end verbatim 7042@end example 7043 7044@c TODO: Evaluate this faq. 7045@node unnamed-faq-67 7046@unnumberedsec unnamed-faq-67 7047@example 7048@verbatim 7049To: hassan@larc.info.uqam.ca (Hassan Alaoui) 7050Subject: Re: Thanks 7051In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. 7052Date: Mon, 22 Dec 1997 14:35:05 PST 7053From: Vern Paxson <vern> 7054 7055> Thank you very much for your help. I compile and link well with C++ while 7056> declaring 'yylex ...' extern, But a little problem remains. I get a 7057> segmentation default when executing ( I linked with lfl library) while it 7058> works well when using LEX instead of flex. Do you have some ideas about the 7059> reason for this ? 7060 7061The one possible reason for this that comes to mind is if you've defined 7062yytext as "extern char yytext[]" (which is what lex uses) instead of 7063"extern char *yytext" (which is what flex uses). If it's not that, then 7064I'm afraid I don't know what the problem might be. 7065 7066 Vern 7067@end verbatim 7068@end example 7069 7070@c TODO: Evaluate this faq. 7071@node unnamed-faq-68 7072@unnumberedsec unnamed-faq-68 7073@example 7074@verbatim 7075To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> 7076Subject: Re: flex 2.5: c++ scanners & start conditions 7077In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. 7078Date: Tue, 06 Jan 1998 19:19:30 PST 7079From: Vern Paxson <vern> 7080 7081> The problem is that when I do this (using %option c++) start 7082> conditions seem to not apply. 7083 7084The BEGIN macro modifies the yy_start variable. For C scanners, this 7085is a static with scope visible through the whole file. For C++ scanners, 7086it's a member variable, so it only has visible scope within a member 7087function. Your lexbegin() routine is not a member function when you 7088build a C++ scanner, so it's not modifying the correct yy_start. The 7089diagnostic that indicates this is that you found you needed to add 7090a declaration of yy_start in order to get your scanner to compile when 7091using C++; instead, the correct fix is to make lexbegin() a member 7092function (by deriving from yyFlexLexer). 7093 7094 Vern 7095@end verbatim 7096@end example 7097 7098@c TODO: Evaluate this faq. 7099@node unnamed-faq-69 7100@unnumberedsec unnamed-faq-69 7101@example 7102@verbatim 7103To: "Boris Zinin" <boris@ippe.rssi.ru> 7104Subject: Re: current position in flex buffer 7105In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. 7106Date: Mon, 12 Jan 1998 12:03:15 PST 7107From: Vern Paxson <vern> 7108 7109> The problem is how to determine the current position in flex active 7110> buffer when a rule is matched.... 7111 7112You will need to keep track of this explicitly, such as by redefining 7113YY_USER_ACTION to count the number of characters matched. 7114 7115The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. 7116 7117 Vern 7118@end verbatim 7119@end example 7120 7121@c TODO: Evaluate this faq. 7122@node unnamed-faq-70 7123@unnumberedsec unnamed-faq-70 7124@example 7125@verbatim 7126To: Bik.Dhaliwal@bis.org 7127Subject: Re: Flex question 7128In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. 7129Date: Tue, 27 Jan 1998 22:41:52 PST 7130From: Vern Paxson <vern> 7131 7132> That requirement involves knowing 7133> the character position at which a particular token was matched 7134> in the lexer. 7135 7136The way you have to do this is by explicitly keeping track of where 7137you are in the file, by counting the number of characters scanned 7138for each token (available in yyleng). It may prove convenient to 7139do this by redefining YY_USER_ACTION, as described in the manual. 7140 7141 Vern 7142@end verbatim 7143@end example 7144 7145@c TODO: Evaluate this faq. 7146@node unnamed-faq-71 7147@unnumberedsec unnamed-faq-71 7148@example 7149@verbatim 7150To: Vladimir Alexiev <vladimir@cs.ualberta.ca> 7151Subject: Re: flex: how to control start condition from parser? 7152In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. 7153Date: Tue, 27 Jan 1998 22:45:37 PST 7154From: Vern Paxson <vern> 7155 7156> It seems useful for the parser to be able to tell the lexer about such 7157> context dependencies, because then they don't have to be limited to 7158> local or sequential context. 7159 7160One way to do this is to have the parser call a stub routine that's 7161included in the scanner's .l file, and consequently that has access ot 7162BEGIN. The only ugliness is that the parser can't pass in the state 7163it wants, because those aren't visible - but if you don't have many 7164such states, then using a different set of names doesn't seem like 7165to much of a burden. 7166 7167While generating a .h file like you suggests is certainly cleaner, 7168flex development has come to a virtual stand-still :-(, so a workaround 7169like the above is much more pragmatic than waiting for a new feature. 7170 7171 Vern 7172@end verbatim 7173@end example 7174 7175@c TODO: Evaluate this faq. 7176@node unnamed-faq-72 7177@unnumberedsec unnamed-faq-72 7178@example 7179@verbatim 7180To: Barbara Denny <denny@3com.com> 7181Subject: Re: freebsd flex bug? 7182In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. 7183Date: Fri, 30 Jan 1998 12:42:32 PST 7184From: Vern Paxson <vern> 7185 7186> lex.yy.c:1996: parse error before `=' 7187 7188This is the key, identifying this error. (It may help to pinpoint 7189it by using flex -L, so it doesn't generate #line directives in its 7190output.) I will bet you heavy money that you have a start condition 7191name that is also a variable name, or something like that; flex spits 7192out #define's for each start condition name, mapping them to a number, 7193so you can wind up with: 7194 7195 %x foo 7196 %% 7197 ... 7198 %% 7199 void bar() 7200 { 7201 int foo = 3; 7202 } 7203 7204and the penultimate will turn into "int 1 = 3" after C preprocessing, 7205since flex will put "#define foo 1" in the generated scanner. 7206 7207 Vern 7208@end verbatim 7209@end example 7210 7211@c TODO: Evaluate this faq. 7212@node unnamed-faq-73 7213@unnumberedsec unnamed-faq-73 7214@example 7215@verbatim 7216To: Maurice Petrie <mpetrie@infoscigroup.com> 7217Subject: Re: Lost flex .l file 7218In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. 7219Date: Mon, 02 Feb 1998 11:15:12 PST 7220From: Vern Paxson <vern> 7221 7222> I am curious as to 7223> whether there is a simple way to backtrack from the generated source to 7224> reproduce the lost list of tokens we are searching on. 7225 7226In theory, it's straight-forward to go from the DFA representation 7227back to a regular-expression representation - the two are isomorphic. 7228In practice, a huge headache, because you have to unpack all the tables 7229back into a single DFA representation, and then write a program to munch 7230on that and translate it into an RE. 7231 7232Sorry for the less-than-happy news ... 7233 7234 Vern 7235@end verbatim 7236@end example 7237 7238@c TODO: Evaluate this faq. 7239@node unnamed-faq-74 7240@unnumberedsec unnamed-faq-74 7241@example 7242@verbatim 7243To: jimmey@lexis-nexis.com (Jimmey Todd) 7244Subject: Re: Flex performance question 7245In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7246Date: Thu, 19 Feb 1998 08:48:51 PST 7247From: Vern Paxson <vern> 7248 7249> What I have found, is that the smaller the data chunk, the faster the 7250> program executes. This is the opposite of what I expected. Should this be 7251> happening this way? 7252 7253This is exactly what will happen if your input file has embedded NULs. 7254From the man page: 7255 7256A final note: flex is slow when matching NUL's, particularly 7257when a token contains multiple NUL's. It's best to write 7258rules which match short amounts of text if it's anticipated 7259that the text will often include NUL's. 7260 7261So that's the first thing to look for. 7262 7263 Vern 7264@end verbatim 7265@end example 7266 7267@c TODO: Evaluate this faq. 7268@node unnamed-faq-75 7269@unnumberedsec unnamed-faq-75 7270@example 7271@verbatim 7272To: jimmey@lexis-nexis.com (Jimmey Todd) 7273Subject: Re: Flex performance question 7274In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7275Date: Thu, 19 Feb 1998 15:42:25 PST 7276From: Vern Paxson <vern> 7277 7278So there are several problems. 7279 7280First, to go fast, you want to match as much text as possible, which 7281your scanners don't in the case that what they're scanning is *not* 7282a <RN> tag. So you want a rule like: 7283 7284 [^<]+ 7285 7286Second, C++ scanners are particularly slow if they're interactive, 7287which they are by default. Using -B speeds it up by a factor of 3-4 7288on my workstation. 7289 7290Third, C++ scanners that use the istream interface are slow, because 7291of how poorly implemented istream's are. I built two versions of 7292the following scanner: 7293 7294 %% 7295 .*\n 7296 .* 7297 %% 7298 7299and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. 7300The C++ istream version, using -B, takes 3.8 seconds. 7301 7302 Vern 7303@end verbatim 7304@end example 7305 7306@c TODO: Evaluate this faq. 7307@node unnamed-faq-76 7308@unnumberedsec unnamed-faq-76 7309@example 7310@verbatim 7311To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> 7312Subject: Re: FLEX 2.5 & THE YEAR 2000 7313In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. 7314Date: Wed, 03 Jun 1998 10:22:26 PDT 7315From: Vern Paxson <vern> 7316 7317> I am researching the Y2K problem with General Electric R&D 7318> and need to know if there are any known issues concerning 7319> the above mentioned software and Y2K regardless of version. 7320 7321There shouldn't be, all it ever does with the date is ask the system 7322for it and then print it out. 7323 7324 Vern 7325@end verbatim 7326@end example 7327 7328@c TODO: Evaluate this faq. 7329@node unnamed-faq-77 7330@unnumberedsec unnamed-faq-77 7331@example 7332@verbatim 7333To: "Hans Dermot Doran" <htd@ibhdoran.com> 7334Subject: Re: flex problem 7335In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. 7336Date: Tue, 21 Jul 1998 14:23:34 PDT 7337From: Vern Paxson <vern> 7338 7339> To overcome this, I gets() the stdin into a string and lex the string. The 7340> string is lexed OK except that the end of string isn't lexed properly 7341> (yy_scan_string()), that is the lexer dosn't recognise the end of string. 7342 7343Flex doesn't contain mechanisms for recognizing buffer endpoints. But if 7344you use fgets instead (which you should anyway, to protect against buffer 7345overflows), then the final \n will be preserved in the string, and you can 7346scan that in order to find the end of the string. 7347 7348 Vern 7349@end verbatim 7350@end example 7351 7352@c TODO: Evaluate this faq. 7353@node unnamed-faq-78 7354@unnumberedsec unnamed-faq-78 7355@example 7356@verbatim 7357To: soumen@almaden.ibm.com 7358Subject: Re: Flex++ 2.5.3 instance member vs. static member 7359In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. 7360Date: Tue, 28 Jul 1998 01:10:34 PDT 7361From: Vern Paxson <vern> 7362 7363> %{ 7364> int mylineno = 0; 7365> %} 7366> ws [ \t]+ 7367> alpha [A-Za-z] 7368> dig [0-9] 7369> %% 7370> 7371> Now you'd expect mylineno to be a member of each instance of class 7372> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to 7373> indicate otherwise; unless I am missing something the declaration of 7374> mylineno seems to be outside any class scope. 7375> 7376> How will this work if I want to run a multi-threaded application with each 7377> thread creating a FlexLexer instance? 7378 7379Derive your own subclass and make mylineno a member variable of it. 7380 7381 Vern 7382@end verbatim 7383@end example 7384 7385@c TODO: Evaluate this faq. 7386@node unnamed-faq-79 7387@unnumberedsec unnamed-faq-79 7388@example 7389@verbatim 7390To: Adoram Rogel <adoram@hybridge.com> 7391Subject: Re: More than 32K states change hangs 7392In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. 7393Date: Tue, 04 Aug 1998 22:28:45 PDT 7394From: Vern Paxson <vern> 7395 7396> Vern Paxson, 7397> 7398> I followed your advice, posted on Usenet bu you, and emailed to me 7399> personally by you, on how to overcome the 32K states limit. I'm running 7400> on Linux machines. 7401> I took the full source of version 2.5.4 and did the following changes in 7402> flexdef.h: 7403> #define JAMSTATE -327660 7404> #define MAXIMUM_MNS 319990 7405> #define BAD_SUBSCRIPT -327670 7406> #define MAX_SHORT 327000 7407> 7408> and compiled. 7409> All looked fine, including check and bigcheck, so I installed. 7410 7411Hmmm, you shouldn't increase MAX_SHORT, though looking through my email 7412archives I see that I did indeed recommend doing so. Try setting it back 7413to 32700; that should suffice that you no longer need -Ca. If it still 7414hangs, then the interesting question is - where? 7415 7416> Compiling the same hanged program with a out-of-the-box (RedHat 4.2 7417> distribution of Linux) 7418> flex 2.5.4 binary works. 7419 7420Since Linux comes with source code, you should diff it against what 7421you have to see what problems they missed. 7422 7423> Should I always compile with the -Ca option now ? even short and simple 7424> filters ? 7425 7426No, definitely not. It's meant to be for those situations where you 7427absolutely must squeeze every last cycle out of your scanner. 7428 7429 Vern 7430@end verbatim 7431@end example 7432 7433@c TODO: Evaluate this faq. 7434@node unnamed-faq-80 7435@unnumberedsec unnamed-faq-80 7436@example 7437@verbatim 7438To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> 7439Subject: Re: flex output for static code portion 7440In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. 7441Date: Mon, 17 Aug 1998 23:57:42 PDT 7442From: Vern Paxson <vern> 7443 7444> I would like to use flex under the hood to generate a binary file 7445> containing the data structures that control the parse. 7446 7447This has been on the wish-list for a long time. In principle it's 7448straight-forward - you redirect mkdata() et al's I/O to another file, 7449and modify the skeleton to have a start-up function that slurps these 7450into dynamic arrays. The concerns are (1) the scanner generation code 7451is hairy and full of corner cases, so it's easy to get surprised when 7452going down this path :-( ; and (2) being careful about buffering so 7453that when the tables change you make sure the scanner starts in the 7454correct state and reading at the right point in the input file. 7455 7456> I was wondering if you know of anyone who has used flex in this way. 7457 7458I don't - but it seems like a reasonable project to undertake (unlike 7459numerous other flex tweaks :-). 7460 7461 Vern 7462@end verbatim 7463@end example 7464 7465@c TODO: Evaluate this faq. 7466@node unnamed-faq-81 7467@unnumberedsec unnamed-faq-81 7468@example 7469@verbatim 7470Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) 7471 by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 7472 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) 7473Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) 7474 by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 7475 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 7476Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 7477From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> 7478Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> 7479Subject: "flex scanner push-back overflow" 7480To: vern@ee.lbl.gov 7481Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) 7482Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7483X-NoJunk: Do NOT send commercial mail, spam or ads to this address! 7484X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ 7485X-Mailer: ELM [version 2.4ME+ PL28 (25)] 7486MIME-Version: 1.0 7487Content-Type: text/plain; charset=US-ASCII 7488Content-Transfer-Encoding: 7bit 7489 7490Hi Vern, 7491 7492Yesterday, I encountered a strange problem: I use the macro processor m4 7493to include some lengthy lists into a .l file. Following is a flex macro 7494definition that causes some serious pain in my neck: 7495 7496AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) 7497 7498The complete list contains about 10kB. When I try to "flex" this file 7499(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased 7500some of the predefined values in flexdefs.h) I get the error: 7501 7502myflex/flex -8 sentag.tmp.l 7503flex scanner push-back overflow 7504 7505When I remove the slashes in the macro definition everything works fine. 7506As I understand it, the double quotes escape the slash-character so it 7507really means "/" and not "trailing context". Furthermore, I tried to 7508escape the slashes with backslashes, but with no use, the same error message 7509appeared when flexing the code. 7510 7511Do you have an idea what's going on here? 7512 7513Greetings from Germany, 7514 Georg 7515-- 7516Georg Rehm georg@cl-ki.uni-osnabrueck.de 7517Institute for Semantic Information Processing, University of Osnabrueck, FRG 7518@end verbatim 7519@end example 7520 7521@c TODO: Evaluate this faq. 7522@node unnamed-faq-82 7523@unnumberedsec unnamed-faq-82 7524@example 7525@verbatim 7526To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7527Subject: Re: "flex scanner push-back overflow" 7528In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. 7529Date: Thu, 20 Aug 1998 07:05:35 PDT 7530From: Vern Paxson <vern> 7531 7532> myflex/flex -8 sentag.tmp.l 7533> flex scanner push-back overflow 7534 7535Flex itself uses a flex scanner. That scanner is running out of buffer 7536space when it tries to unput() the humongous macro you've defined. When 7537you remove the '/'s, you make it small enough so that it fits in the buffer; 7538removing spaces would do the same thing. 7539 7540The fix is to either rethink how come you're using such a big macro and 7541perhaps there's another/better way to do it; or to rebuild flex's own 7542scan.c with a larger value for 7543 7544 #define YY_BUF_SIZE 16384 7545 7546- Vern 7547@end verbatim 7548@end example 7549 7550@c TODO: Evaluate this faq. 7551@node unnamed-faq-83 7552@unnumberedsec unnamed-faq-83 7553@example 7554@verbatim 7555To: Jan Kort <jan@research.techforce.nl> 7556Subject: Re: Flex 7557In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. 7558Date: Sat, 05 Sep 1998 00:59:49 PDT 7559From: Vern Paxson <vern> 7560 7561> %% 7562> 7563> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } 7564> ^\n { fprintf(stderr, "empty line\n"); } 7565> . { } 7566> \n { fprintf(stderr, "new line\n"); } 7567> 7568> %% 7569> -- input --------------------------------------- 7570> TEST1 7571> -- output -------------------------------------- 7572> TEST1 7573> empty line 7574> ------------------------------------------------ 7575 7576IMHO, it's not clear whether or not this is in fact a bug. It depends 7577on whether you view yyless() as backing up in the input stream, or as 7578pushing new characters onto the beginning of the input stream. Flex 7579interprets it as the latter (for implementation convenience, I'll admit), 7580and so considers the newline as in fact matching at the beginning of a 7581line, as after all the last token scanned an entire line and so the 7582scanner is now at the beginning of a new line. 7583 7584I agree that this is counter-intuitive for yyless(), given its 7585functional description (it's less so for unput(), depending on whether 7586you're unput()'ing new text or scanned text). But I don't plan to 7587change it any time soon, as it's a pain to do so. Consequently, 7588you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak 7589your scanner into the behavior you desire. 7590 7591Sorry for the less-than-completely-satisfactory answer. 7592 7593 Vern 7594@end verbatim 7595@end example 7596 7597@c TODO: Evaluate this faq. 7598@node unnamed-faq-84 7599@unnumberedsec unnamed-faq-84 7600@example 7601@verbatim 7602To: Patrick Krusenotto <krusenot@mac-info-link.de> 7603Subject: Re: Problems with restarting flex-2.5.2-generated scanner 7604In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. 7605Date: Thu, 24 Sep 1998 23:28:43 PDT 7606From: Vern Paxson <vern> 7607 7608> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately 7609> trying to make my scanner restart with a new file after my parser stops 7610> with a parse error. When my compiler restarts, the parser always 7611> receives the token after the token (in the old file!) that caused the 7612> parser error. 7613 7614I suspect the problem is that your parser has read ahead in order 7615to attempt to resolve an ambiguity, and when it's restarted it picks 7616up with that token rather than reading a fresh one. If you're using 7617yacc, then the special "error" production can sometimes be used to 7618consume tokens in an attempt to get the parser into a consistent state. 7619 7620 Vern 7621@end verbatim 7622@end example 7623 7624@c TODO: Evaluate this faq. 7625@node unnamed-faq-85 7626@unnumberedsec unnamed-faq-85 7627@example 7628@verbatim 7629To: Henric Jungheim <junghelh@pe-nelson.com> 7630Subject: Re: flex 2.5.4a 7631In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. 7632Date: Tue, 27 Oct 1998 16:50:14 PST 7633From: Vern Paxson <vern> 7634 7635> This brings up a feature request: How about a command line 7636> option to specify the filename when reading from stdin? That way one 7637> doesn't need to create a temporary file in order to get the "#line" 7638> directives to make sense. 7639 7640Use -o combined with -t (per the man page description of -o). 7641 7642> P.S., Is there any simple way to use non-blocking IO to parse multiple 7643> streams? 7644 7645Simple, no. 7646 7647One approach might be to return a magic character on EWOULDBLOCK and 7648have a rule 7649 7650 .*<magic-character> // put back .*, eat magic character 7651 7652This is off the top of my head, not sure it'll work. 7653 7654 Vern 7655@end verbatim 7656@end example 7657 7658@c TODO: Evaluate this faq. 7659@node unnamed-faq-86 7660@unnumberedsec unnamed-faq-86 7661@example 7662@verbatim 7663To: "Repko, Billy D" <billy.d.repko@intel.com> 7664Subject: Re: Compiling scanners 7665In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. 7666Date: Thu, 14 Jan 1999 00:25:30 PST 7667From: Vern Paxson <vern> 7668 7669> It appears that maybe it cannot find the lfl library. 7670 7671The Makefile in the distribution builds it, so you should have it. 7672It's exceedingly trivial, just a main() that calls yylex() and 7673a yyrap() that always returns 1. 7674 7675> %% 7676> \n ++num_lines; ++num_chars; 7677> . ++num_chars; 7678 7679You can't indent your rules like this - that's where the errors are coming 7680from. Flex copies indented text to the output file, it's how you do things 7681like 7682 7683 int num_lines_seen = 0; 7684 7685to declare local variables. 7686 7687 Vern 7688@end verbatim 7689@end example 7690 7691@c TODO: Evaluate this faq. 7692@node unnamed-faq-87 7693@unnumberedsec unnamed-faq-87 7694@example 7695@verbatim 7696To: Erick Branderhorst <Erick.Branderhorst@asml.nl> 7697Subject: Re: flex input buffer 7698In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. 7699Date: Tue, 09 Feb 1999 21:03:37 PST 7700From: Vern Paxson <vern> 7701 7702> In the flex.skl file the size of the default input buffers is set. Can you 7703> explain why this size is set and why it is such a high number. 7704 7705It's large to optimize performance when scanning large files. You can 7706safely make it a lot lower if needed. 7707 7708 Vern 7709@end verbatim 7710@end example 7711 7712@c TODO: Evaluate this faq. 7713@node unnamed-faq-88 7714@unnumberedsec unnamed-faq-88 7715@example 7716@verbatim 7717To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> 7718Subject: Re: Flex error message 7719In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. 7720Date: Thu, 25 Feb 1999 00:11:31 PST 7721From: Vern Paxson <vern> 7722 7723> I'm extending a larger scanner written in Flex and I keep running into 7724> problems. More specifically, I get the error message: 7725> "flex: input rules are too complicated (>= 32000 NFA states)" 7726 7727Increase the definitions in flexdef.h for: 7728 7729#define JAMSTATE -32766 /* marks a reference to the state that always j 7730ams */ 7731#define MAXIMUM_MNS 31999 7732#define BAD_SUBSCRIPT -32767 7733 7734recompile everything, and it should all work. 7735 7736 Vern 7737@end verbatim 7738@end example 7739 7740@c TODO: Evaluate this faq. 7741@node unnamed-faq-90 7742@unnumberedsec unnamed-faq-90 7743@example 7744@verbatim 7745To: "Dmitriy Goldobin" <gold@ems.chel.su> 7746Subject: Re: FLEX trouble 7747In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. 7748Date: Tue, 01 Jun 1999 00:15:07 PDT 7749From: Vern Paxson <vern> 7750 7751> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 7752> but rule "/*"(.|\n)*"*/" don't work ? 7753 7754The second of these will have to scan the entire input stream (because 7755"(.|\n)*" matches an arbitrary amount of any text) in order to see if 7756it ends with "*/", terminating the comment. That potentially will overflow 7757the input buffer. 7758 7759> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error 7760> 'unrecognized rule'. 7761 7762You can't use the '/' operator inside parentheses. It's not clear 7763what "(a/b)*" actually means. 7764 7765> I now use workaround with state <comment>, but single-rule is 7766> better, i think. 7767 7768Single-rule is nice but will always have the problem of either setting 7769restrictions on comments (like not allowing multi-line comments) and/or 7770running the risk of consuming the entire input stream, as noted above. 7771 7772 Vern 7773@end verbatim 7774@end example 7775 7776@c TODO: Evaluate this faq. 7777@node unnamed-faq-91 7778@unnumberedsec unnamed-faq-91 7779@example 7780@verbatim 7781Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) 7782 by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 7783 for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) 7784Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 7785To: vern@ee.lbl.gov 7786Date: Tue, 15 Jun 1999 08:55:43 -0700 7787From: "Aki Niimura" <neko@my-deja.com> 7788Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> 7789Mime-Version: 1.0 7790Cc: 7791X-Sent-Mail: on 7792Reply-To: 7793X-Mailer: MailCity Service 7794Subject: A question on flex C++ scanner 7795X-Sender-Ip: 12.72.207.61 7796Organization: My Deja Email (http://www.my-deja.com:80) 7797Content-Type: text/plain; charset=us-ascii 7798Content-Transfer-Encoding: 7bit 7799 7800Dear Dr. Paxon, 7801 7802I have been using flex for years. 7803It works very well on many projects. 7804Most case, I used it to generate a scanner on C language. 7805However, one project I needed to generate a scanner 7806on C++ lanuage. Thanks to your enhancement, flex did 7807the job. 7808 7809Currently, I'm working on enhancing my previous project. 7810I need to deal with multiple input streams (recursive 7811inclusion) in this scanner (C++). 7812I did similar thing for another scanner (C) as you 7813explained in your documentation. 7814 7815The generated scanner (C++) has necessary methods: 7816- switch_to_buffer(struct yy_buffer_state *b) 7817- yy_create_buffer(istream *is, int sz) 7818- yy_delete_buffer(struct yy_buffer_state *b) 7819 7820However, I couldn't figure out how to access current 7821buffer (yy_current_buffer). 7822 7823yy_current_buffer is a protected member of yyFlexLexer. 7824I can't access it directly. 7825Then, I thought yy_create_buffer() with is = 0 might 7826return current stream buffer. But it seems not as far 7827as I checked the source. (flex 2.5.4) 7828 7829I went through the Web in addition to Flex documentation. 7830However, it hasn't been successful, so far. 7831 7832It is not my intention to bother you, but, can you 7833comment about how to obtain the current stream buffer? 7834 7835Your response would be highly appreciated. 7836 7837Best regards, 7838Aki Niimura 7839 7840--== Sent via Deja.com http://www.deja.com/ ==-- 7841Share what you know. Learn what you don't. 7842@end verbatim 7843@end example 7844 7845@c TODO: Evaluate this faq. 7846@node unnamed-faq-92 7847@unnumberedsec unnamed-faq-92 7848@example 7849@verbatim 7850To: neko@my-deja.com 7851Subject: Re: A question on flex C++ scanner 7852In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. 7853Date: Tue, 15 Jun 1999 09:04:24 PDT 7854From: Vern Paxson <vern> 7855 7856> However, I couldn't figure out how to access current 7857> buffer (yy_current_buffer). 7858 7859Derive your own subclass from yyFlexLexer. 7860 7861 Vern 7862@end verbatim 7863@end example 7864 7865@c TODO: Evaluate this faq. 7866@node unnamed-faq-93 7867@unnumberedsec unnamed-faq-93 7868@example 7869@verbatim 7870To: "Stones, Darren" <Darren.Stones@nectech.co.uk> 7871Subject: Re: You're the man to see? 7872In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. 7873Date: Wed, 23 Jun 1999 09:01:40 PDT 7874From: Vern Paxson <vern> 7875 7876> I hope you can help me. I am using Flex and Bison to produce an interpreted 7877> language. However all goes well until I try to implement an IF statement or 7878> a WHILE. I cannot get this to work as the parser parses all the conditions 7879> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot 7880> make a decision!! 7881 7882You need to use the parser to build a parse tree (= abstract syntax trwee), 7883and when that's all done you recursively evaluate the tree, binding variables 7884to values at that time. 7885 7886 Vern 7887@end verbatim 7888@end example 7889 7890@c TODO: Evaluate this faq. 7891@node unnamed-faq-94 7892@unnumberedsec unnamed-faq-94 7893@example 7894@verbatim 7895To: Petr Danecek <petr@ics.cas.cz> 7896Subject: Re: flex - question 7897In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. 7898Date: Fri, 02 Jul 1999 16:52:13 PDT 7899From: Vern Paxson <vern> 7900 7901> file, it takes an enormous amount of time. It is funny, because the 7902> source code has only 12 rules!!! I think it looks like an exponencial 7903> growth. 7904 7905Right, that's the problem - some patterns (those with a lot of 7906ambiguity, where yours has because at any given time the scanner can 7907be in the middle of all sorts of combinations of the different 7908rules) blow up exponentially. 7909 7910For your rules, there is an easy fix. Change the ".*" that comes fater 7911the directory name to "[^ ]*". With that in place, the rules are no 7912longer nearly so ambiguous, because then once one of the directories 7913has been matched, no other can be matched (since they all require a 7914leading blank). 7915 7916If that's not an acceptable solution, then you can enter a start state 7917to pick up the .*\n after each directory is matched. 7918 7919Also note that for speed, you'll want to add a ".*" rule at the end, 7920otherwise rules that don't match any of the patterns will be matched 7921very slowly, a character at a time. 7922 7923 Vern 7924@end verbatim 7925@end example 7926 7927@c TODO: Evaluate this faq. 7928@node unnamed-faq-95 7929@unnumberedsec unnamed-faq-95 7930@example 7931@verbatim 7932To: Tielman Koekemoer <tielman@spi.co.za> 7933Subject: Re: Please help. 7934In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. 7935Date: Thu, 08 Jul 1999 08:20:39 PDT 7936From: Vern Paxson <vern> 7937 7938> I was hoping you could help me with my problem. 7939> 7940> I tried compiling (gnu)flex on a Solaris 2.4 machine 7941> but when I ran make (after configure) I got an error. 7942> 7943> -------------------------------------------------------------- 7944> gcc -c -I. -I. -g -O parse.c 7945> ./flex -t -p ./scan.l >scan.c 7946> sh: ./flex: not found 7947> *** Error code 1 7948> make: Fatal error: Command failed for target `scan.c' 7949> ------------------------------------------------------------- 7950> 7951> What's strange to me is that I'm only 7952> trying to install flex now. I then edited the Makefile to 7953> and changed where it says "FLEX = flex" to "FLEX = lex" 7954> ( lex: the native Solaris one ) but then it complains about 7955> the "-p" option. Is there any way I can compile flex without 7956> using flex or lex? 7957> 7958> Thanks so much for your time. 7959 7960You managed to step on the bootstrap sequence, which first copies 7961initscan.c to scan.c in order to build flex. Try fetching a fresh 7962distribution from ftp.ee.lbl.gov. (Or you can first try removing 7963".bootstrap" and doing a make again.) 7964 7965 Vern 7966@end verbatim 7967@end example 7968 7969@c TODO: Evaluate this faq. 7970@node unnamed-faq-96 7971@unnumberedsec unnamed-faq-96 7972@example 7973@verbatim 7974To: Tielman Koekemoer <tielman@spi.co.za> 7975Subject: Re: Please help. 7976In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. 7977Date: Fri, 09 Jul 1999 00:27:20 PDT 7978From: Vern Paxson <vern> 7979 7980> First I removed .bootstrap (and ran make) - no luck. I downloaded the 7981> software but I still have the same problem. Is there anything else I 7982> could try. 7983 7984Try: 7985 7986 cp initscan.c scan.c 7987 touch scan.c 7988 make scan.o 7989 7990If this last tries to first build scan.c from scan.l using ./flex, then 7991your "make" is broken, in which case compile scan.c to scan.o by hand. 7992 7993 Vern 7994@end verbatim 7995@end example 7996 7997@c TODO: Evaluate this faq. 7998@node unnamed-faq-97 7999@unnumberedsec unnamed-faq-97 8000@example 8001@verbatim 8002To: Sumanth Kamenani <skamenan@crl.nmsu.edu> 8003Subject: Re: Error 8004In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. 8005Date: Tue, 20 Jul 1999 00:18:26 PDT 8006From: Vern Paxson <vern> 8007 8008> I am getting a compilation error. The error is given as "unknown symbol- yylex". 8009 8010The parser relies on calling yylex(), but you're instead using the C++ scanning 8011class, so you need to supply a yylex() "glue" function that calls an instance 8012scanner of the scanner (e.g., "scanner->yylex()"). 8013 8014 Vern 8015@end verbatim 8016@end example 8017 8018@c TODO: Evaluate this faq. 8019@node unnamed-faq-98 8020@unnumberedsec unnamed-faq-98 8021@example 8022@verbatim 8023To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) 8024Subject: Re: lex 8025In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. 8026Date: Tue, 23 Nov 1999 15:54:30 PST 8027From: Vern Paxson <vern> 8028 8029Well, your problem is the 8030 8031switch (yybgin-yysvec-1) { /* witchcraft */ 8032 8033at the beginning of lex rules. "witchcraft" == "non-portable". It's 8034assuming knowledge of the AT&T lex's internal variables. 8035 8036For flex, you can probably do the equivalent using a switch on YYSTATE. 8037 8038 Vern 8039@end verbatim 8040@end example 8041 8042@c TODO: Evaluate this faq. 8043@node unnamed-faq-99 8044@unnumberedsec unnamed-faq-99 8045@example 8046@verbatim 8047To: archow@hss.hns.com 8048Subject: Re: Regarding distribution of flex and yacc based grammars 8049In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. 8050Date: Wed, 22 Dec 1999 01:56:24 PST 8051From: Vern Paxson <vern> 8052 8053> When we provide the customer with an object code distribution, is it 8054> necessary for us to provide source 8055> for the generated C files from flex and bison since they are generated by 8056> flex and bison ? 8057 8058For flex, no. I don't know what the current state of this is for bison. 8059 8060> Also, is there any requrirement for us to neccessarily provide source for 8061> the grammar files which are fed into flex and bison ? 8062 8063Again, for flex, no. 8064 8065See the file "COPYING" in the flex distribution for the legalese. 8066 8067 Vern 8068@end verbatim 8069@end example 8070 8071@c TODO: Evaluate this faq. 8072@node unnamed-faq-100 8073@unnumberedsec unnamed-faq-100 8074@example 8075@verbatim 8076To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> 8077Subject: Re: Flex, and self referencing rules 8078In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. 8079Date: Sat, 19 Feb 2000 18:33:16 PST 8080From: Vern Paxson <vern> 8081 8082> However, I do not use unput anywhere. I do use self-referencing 8083> rules like this: 8084> 8085> UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) 8086 8087You can't do this - flex is *not* a parser like yacc (which does indeed 8088allow recursion), it is a scanner that's confined to regular expressions. 8089 8090 Vern 8091@end verbatim 8092@end example 8093 8094@c TODO: Evaluate this faq. 8095@node unnamed-faq-101 8096@unnumberedsec unnamed-faq-101 8097@example 8098@verbatim 8099To: slg3@lehigh.edu (SAMUEL L. GULDEN) 8100Subject: Re: Flex problem 8101In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. 8102Date: Thu, 02 Mar 2000 23:00:46 PST 8103From: Vern Paxson <vern> 8104 8105If this is exactly your program: 8106 8107> digit [0-9] 8108> digits {digit}+ 8109> whitespace [ \t\n]+ 8110> 8111> %% 8112> "[" { printf("open_brac\n");} 8113> "]" { printf("close_brac\n");} 8114> "+" { printf("addop\n");} 8115> "*" { printf("multop\n");} 8116> {digits} { printf("NUMBER = %s\n", yytext);} 8117> whitespace ; 8118 8119then the problem is that the last rule needs to be "{whitespace}" ! 8120 8121 Vern 8122@end verbatim 8123@end example 8124 8125@node What is the difference between YYLEX_PARAM and YY_DECL? 8126@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL? 8127 8128YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra 8129params when it calls yylex() from the parser. 8130 8131YY_DECL is the Flex declaration of yylex. The default is similar to this: 8132 8133@example 8134@verbatim 8135#define int yy_lex () 8136@end verbatim 8137@end example 8138 8139 8140@node Why do I get "conflicting types for yylex" error? 8141@unnumberedsec Why do I get "conflicting types for yylex" error? 8142 8143This is a compiler error regarding a generated Bison parser, not a Flex scanner. 8144It means you need a prototype of yylex() in the top of the Bison file. 8145Be sure the prototype matches YY_DECL. 8146 8147@node How do I access the values set in a Flex action from within a Bison action? 8148@unnumberedsec How do I access the values set in a Flex action from within a Bison action? 8149 8150With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual. 8151See @ref{Top, , , bison, the GNU Bison Manual}. 8152 8153@node Appendices, Indices, FAQ, Top 8154@appendix Appendices 8155 8156@menu 8157* Makefiles and Flex:: 8158* Bison Bridge:: 8159* M4 Dependency:: 8160* Common Patterns:: 8161@end menu 8162 8163@node Makefiles and Flex, Bison Bridge, Appendices, Appendices 8164@appendixsec Makefiles and Flex 8165 8166@cindex Makefile, syntax 8167 8168In this appendix, we provide tips for writing Makefiles to build your scanners. 8169 8170In a traditional build environment, we say that the @file{.c} files are the 8171sources, and the @file{.o} files are the intermediate files. When using 8172@code{flex}, however, the @file{.l} files are the sources, and the generated 8173@file{.c} files (along with the @file{.o} files) are the intermediate files. 8174This requires you to carefully plan your Makefile. 8175 8176Modern @command{make} programs understand that @file{foo.l} is intended to 8177generate @file{lex.yy.c} or @file{foo.c}, and will behave 8178accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such 8179programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake} 8180may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want, 8181then you should provide an explicit rule in your Makefile.am}. The 8182following Makefile does not explicitly instruct @command{make} how to build 8183@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the 8184@command{make} program to build the intermediate file, @file{scan.c}: 8185 8186@cindex Makefile, example of implicit rules 8187@example 8188@verbatim 8189 # Basic Makefile -- relies on implicit rules 8190 # Creates "myprogram" from "scan.l" and "myprogram.c" 8191 # 8192 LEX=flex 8193 myprogram: scan.o myprogram.o 8194 scan.o: scan.l 8195 8196@end verbatim 8197@end example 8198 8199 8200For simple cases, the above may be sufficient. For other cases, 8201you may have to explicitly instruct @command{make} how to build your scanner. 8202The following is an example of a Makefile containing explicit rules: 8203 8204@cindex Makefile, explicit example 8205@example 8206@verbatim 8207 # Basic Makefile -- provides explicit rules 8208 # Creates "myprogram" from "scan.l" and "myprogram.c" 8209 # 8210 LEX=flex 8211 myprogram: scan.o myprogram.o 8212 $(CC) -o $@ $(LDFLAGS) $^ 8213 8214 myprogram.o: myprogram.c 8215 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8216 8217 scan.o: scan.c 8218 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8219 8220 scan.c: scan.l 8221 $(LEX) $(LFLAGS) -o $@ $^ 8222 8223 clean: 8224 $(RM) *.o scan.c 8225 8226@end verbatim 8227@end example 8228 8229Notice in the above example that @file{scan.c} is in the @code{clean} target. 8230This is because we consider the file @file{scan.c} to be an intermediate file. 8231 8232Finally, we provide a realistic example of a @code{flex} scanner used with a 8233@code{bison} parser@footnote{This example also applies to yacc parsers.}. 8234There is a tricky problem we have to deal with. Since a @code{flex} scanner 8235will typically include a header file (e.g., @file{y.tab.h}) generated by the 8236parser, we need to be sure that the header file is generated BEFORE the scanner 8237is compiled. We handle this case in the following example: 8238 8239@example 8240@verbatim 8241 # Makefile example -- scanner and parser. 8242 # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" 8243 # 8244 LEX = flex 8245 YACC = bison -y 8246 YFLAGS = -d 8247 objects = scan.o parse.o myprogram.o 8248 8249 myprogram: $(objects) 8250 scan.o: scan.l parse.c 8251 parse.o: parse.y 8252 myprogram.o: myprogram.c 8253 8254@end verbatim 8255@end example 8256 8257In the above example, notice the line, 8258 8259@example 8260@verbatim 8261 scan.o: scan.l parse.c 8262@end verbatim 8263@end example 8264 8265, which lists the file @file{parse.c} (the generated parser) as a dependency of 8266@file{scan.o}. We want to ensure that the parser is created before the scanner 8267is compiled, and the above line seems to do the trick. Feel free to experiment 8268with your specific implementation of @command{make}. 8269 8270 8271For more details on writing Makefiles, see @ref{Top, , , make, The 8272GNU Make Manual}. 8273 8274@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices 8275@section C Scanners with Bison Parsers 8276 8277@cindex bison, bridging with flex 8278@vindex yylval 8279@vindex yylloc 8280@tindex YYLTYPE 8281@tindex YYSTYPE 8282 8283This section describes the @code{flex} features useful when integrating 8284@code{flex} with @code{GNU bison}@footnote{The features described here are 8285purely optional, and are by no means the only way to use flex with bison. 8286We merely provide some glue to ease development of your parser-scanner pair.}. 8287Skip this section if you are not using 8288@code{bison} with your scanner. Here we discuss only the @code{flex} 8289half of the @code{flex} and @code{bison} pair. We do not discuss 8290@code{bison} in any detail. For more information about generating 8291@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}. 8292 8293A compatible @code{bison} scanner is generated by declaring @samp{%option 8294bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex} 8295from the command line. This instructs @code{flex} that the macro 8296@code{yylval} may be used. The data type for 8297@code{yylval}, @code{YYSTYPE}, 8298is typically defined in a header file, included in section 1 of the 8299@code{flex} input file. For a list of functions and macros 8300available, @xref{bison-functions}. 8301 8302The declaration of yylex becomes, 8303 8304@findex yylex (reentrant version) 8305@example 8306@verbatim 8307 int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); 8308@end verbatim 8309@end example 8310 8311If @code{%option bison-locations} is specified, then the declaration 8312becomes, 8313 8314@findex yylex (reentrant version) 8315@example 8316@verbatim 8317 int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); 8318@end verbatim 8319@end example 8320 8321Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers. 8322Support for @code{yylloc} is optional in @code{bison}, so it is optional in 8323@code{flex} as well. The following is an example of a @code{flex} scanner that 8324is compatible with @code{bison}. 8325 8326@cindex bison, scanner to be called from bison 8327@example 8328@verbatim 8329 /* Scanner for "C" assignment statements... sort of. */ 8330 %{ 8331 #include "y.tab.h" /* Generated by bison. */ 8332 %} 8333 8334 %option bison-bridge bison-locations 8335 % 8336 8337 [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} 8338 [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} 8339 "="|";" { return yytext[0];} 8340 . {} 8341 % 8342@end verbatim 8343@end example 8344 8345As you can see, there really is no magic here. We just use 8346@code{yylval} as we would any other variable. The data type of 8347@code{yylval} is generated by @code{bison}, and included in the file 8348@file{y.tab.h}. Here is the corresponding @code{bison} parser: 8349 8350@cindex bison, parser 8351@example 8352@verbatim 8353 /* Parser to convert "C" assignments to lisp. */ 8354 %{ 8355 /* Pass the argument to yyparse through to yylex. */ 8356 #define YYPARSE_PARAM scanner 8357 #define YYLEX_PARAM scanner 8358 %} 8359 %locations 8360 %pure_parser 8361 %union { 8362 int num; 8363 char* str; 8364 } 8365 %token <str> STRING 8366 %token <num> NUMBER 8367 %% 8368 assignment: 8369 STRING '=' NUMBER ';' { 8370 printf( "(setf %s %d)", $1, $3 ); 8371 } 8372 ; 8373@end verbatim 8374@end example 8375 8376@node M4 Dependency, Common Patterns, Bison Bridge, Appendices 8377@section M4 Dependency 8378@cindex m4 8379The macro processor @code{m4}@footnote{The use of m4 is subject to change in 8380future revisions of flex. It is not part of the public API of flex. Do not depend on it.} 8381must be installed wherever flex is installed. 8382@code{flex} invokes @samp{m4}, found by searching the directories in the 8383@code{PATH} environment variable. Any code you place in section 1 or in the 8384actions will be sent through m4. Please follow these rules to protect your 8385code from unwanted @code{m4} processing. 8386 8387@itemize 8388 8389@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, 8390or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for 8391some reason you need m4_ as a prefix, use a preprocessor #define to get your 8392symbol past m4 unmangled. 8393 8394@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The 8395former is not valid in C, except within comments and strings, but the latter is valid in 8396code such as @code{x[y[z]]}. The solution is simple. To get the literal string 8397@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]}, 8398use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and 8399escape them. However, it's best to avoid this complexity where possible, by 8400removing such sequences from your code. 8401 8402@end itemize 8403 8404@code{m4} is only required at the time you run @code{flex}. The generated 8405scanner is ordinary C or C++, and does @emph{not} require @code{m4}. 8406 8407@node Common Patterns, ,M4 Dependency, Appendices 8408@section Common Patterns 8409@cindex patterns, common 8410 8411This appendix provides examples of common regular expressions you might use 8412in your scanner. 8413 8414@menu 8415* Numbers:: 8416* Identifiers:: 8417* Quoted Constructs:: 8418* Addresses:: 8419@end menu 8420 8421 8422@node Numbers, Identifiers, ,Common Patterns 8423@subsection Numbers 8424 8425@table @asis 8426 8427@item C99 decimal constant 8428@code{([[:digit:]]@{-@}[0])[[:digit:]]*} 8429 8430@item C99 hexadecimal constant 8431@code{0[xX][[:xdigit:]]+} 8432 8433@item C99 octal constant 8434@code{0[01234567]*} 8435 8436@item C99 floating point constant 8437@verbatim 8438 {dseq} ([[:digit:]]+) 8439 {dseq_opt} ([[:digit:]]*) 8440 {frac} (({dseq_opt}"."{dseq})|{dseq}".") 8441 {exp} ([eE][+-]?{dseq}) 8442 {exp_opt} ({exp}?) 8443 {fsuff} [flFL] 8444 {fsuff_opt} ({fsuff}?) 8445 {hpref} (0[xX]) 8446 {hdseq} ([[:xdigit:]]+) 8447 {hdseq_opt} ([[:xdigit:]]*) 8448 {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) 8449 {bexp} ([pP][+-]?{dseq}) 8450 {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) 8451 {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) 8452 8453 {c99_floating_point_constant} ({dfc}|{hfc}) 8454@end verbatim 8455 8456See C99 section 6.4.4.2 for the gory details. 8457 8458@end table 8459 8460@node Identifiers, Quoted Constructs, Numbers, Common Patterns 8461@subsection Identifiers 8462 8463@table @asis 8464 8465@item C99 Identifier 8466@verbatim 8467ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) 8468nondigit [_[:alpha:]] 8469c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* 8470@end verbatim 8471 8472Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for 8473"implementation-defined" characters. In practice, C compilers follow the above pattern, with the 8474addition of the @samp{$} character. 8475 8476@item UTF-8 Encoded Unicode Code Point 8477@verbatim 8478[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) 8479@end verbatim 8480 8481@end table 8482 8483@node Quoted Constructs, Addresses, Identifiers, Common Patterns 8484@subsection Quoted Constructs 8485 8486@table @asis 8487@item C99 String Literal 8488@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"} 8489 8490@item C99 Comment 8491@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)} 8492 8493Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief, 8494does not include the trailing @samp{\n} character. 8495 8496A better way to scan @samp{/* */} comments is by line, rather than matching 8497possibly huge comments all at once. This will allow you to scan comments of 8498unlimited length, as long as line breaks appear at sane intervals. This is also 8499more efficient when used with automatic line number processing. @xref{option-yylineno}. 8500 8501@verbatim 8502<INITIAL>{ 8503 "/*" BEGIN(COMMENT); 8504} 8505<COMMENT>{ 8506 "*/" BEGIN(0); 8507 [^*\n]+ ; 8508 "*"[^/] ; 8509 \n ; 8510} 8511@end verbatim 8512 8513@end table 8514 8515@node Addresses, ,Quoted Constructs, Common Patterns 8516@subsection Addresses 8517 8518@table @asis 8519 8520@item IPv4 Address 8521@verbatim 8522dec-octet [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] 8523IPv4address {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet} 8524@end verbatim 8525 8526@item IPv6 Address 8527@verbatim 8528h16 [0-9A-Fa-f]{1,4} 8529ls32 {h16}:{h16}|{IPv4address} 8530IPv6address ({h16}:){6}{ls32}| 8531 ::({h16}:){5}{ls32}| 8532 ({h16})?::({h16}:){4}{ls32}| 8533 (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}| 8534 (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}| 8535 (({h16}:){0,3}{h16})?::{h16}:{ls32}| 8536 (({h16}:){0,4}{h16})?::{ls32}| 8537 (({h16}:){0,5}{h16})?::{h16}| 8538 (({h16}:){0,6}{h16})?:: 8539@end verbatim 8540 8541See @uref{http://www.ietf.org/rfc/rfc2373.txt, RFC 2373} for details. 8542Note that you have to fold the definition of @code{IPv6address} into one 8543line and that it also matches the ``unspecified address'' ``::''. 8544 8545@item URI 8546@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} 8547 8548This pattern is nearly useless, since it allows just about any character 8549to appear in a URI, including spaces and control characters. See 8550@uref{http://www.ietf.org/rfc/rfc2396.txt, RFC 2396} for details. 8551 8552@end table 8553 8554 8555@node Indices, , Appendices, Top 8556@unnumbered Indices 8557 8558@menu 8559* Concept Index:: 8560* Index of Functions and Macros:: 8561* Index of Variables:: 8562* Index of Data Types:: 8563* Index of Hooks:: 8564* Index of Scanner Options:: 8565@end menu 8566 8567@node Concept Index, Index of Functions and Macros, Indices, Indices 8568@unnumberedsec Concept Index 8569 8570@printindex cp 8571 8572@node Index of Functions and Macros, Index of Variables, Concept Index, Indices 8573@unnumberedsec Index of Functions and Macros 8574 8575This is an index of functions and preprocessor macros that look like functions. 8576For macros that expand to variables or constants, see @ref{Index of Variables}. 8577 8578@printindex fn 8579 8580@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices 8581@unnumberedsec Index of Variables 8582 8583This is an index of variables, constants, and preprocessor macros 8584that expand to variables or constants. 8585 8586@printindex vr 8587 8588@node Index of Data Types, Index of Hooks, Index of Variables, Indices 8589@unnumberedsec Index of Data Types 8590@printindex tp 8591 8592@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices 8593@unnumberedsec Index of Hooks 8594 8595This is an index of "hooks" that the user may define. These hooks typically correspond 8596to specific locations in the generated scanner, and may be used to insert arbitrary code. 8597 8598@printindex hk 8599 8600@node Index of Scanner Options, , Index of Hooks, Indices 8601@unnumberedsec Index of Scanner Options 8602 8603@printindex op 8604 8605@c A vim script to name the faq entries. delete this when faqs are no longer 8606@c named "unnamed-faq-XXX". 8607@c 8608@c fu! Faq2 () range abort 8609@c let @r=input("Rename to: ") 8610@c exe "%s/" . @w . "/" . @r . "/g" 8611@c normal 'f 8612@c endf 8613@c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> 8614 8615@bye 8616