flex.texi revision 1.1.1.4
1\input texinfo.tex @c -*-texinfo-*- 2@c %**start of header 3@setfilename flex.info 4@include version.texi 5@settitle Lexical Analysis With Flex, for Flex @value{VERSION} 6@set authors Vern Paxson, Will Estes and John Millaway 7@c "Macro Hooks" index 8@defindex hk 9@c "Options" index 10@defindex op 11@dircategory Programming 12@direntry 13* flex: (flex). Fast lexical analyzer generator (lex replacement). 14@end direntry 15@c %**end of header 16 17@copying 18 19The flex manual is placed under the same licensing conditions as the 20rest of flex: 21 22Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012 23The Flex Project. 24 25Copyright @copyright{} 1990, 1997 The Regents of the University of California. 26All rights reserved. 27 28This code is derived from software contributed to Berkeley by 29Vern Paxson. 30 31The United States Government has rights in this work pursuant 32to contract no. DE-AC03-76SF00098 between the United States 33Department of Energy and the University of California. 34 35Redistribution and use in source and binary forms, with or without 36modification, are permitted provided that the following conditions 37are met: 38 39@enumerate 40@item 41 Redistributions of source code must retain the above copyright 42notice, this list of conditions and the following disclaimer. 43 44@item 45Redistributions in binary form must reproduce the above copyright 46notice, this list of conditions and the following disclaimer in the 47documentation and/or other materials provided with the distribution. 48@end enumerate 49 50Neither the name of the University nor the names of its contributors 51may be used to endorse or promote products derived from this software 52without specific prior written permission. 53 54THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 55IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 56WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 57PURPOSE. 58@end copying 59 60@titlepage 61@title Lexical Analysis with Flex 62@subtitle Edition @value{EDITION}, @value{UPDATED} 63@author @value{authors} 64@page 65@vskip 0pt plus 1filll 66@insertcopying 67@end titlepage 68@contents 69@ifnottex 70@node Top, Copyright, (dir), (dir) 71@top flex 72 73This manual describes @code{flex}, a tool for generating programs that 74perform pattern-matching on text. The manual includes both tutorial and 75reference sections. 76 77This edition of @cite{The flex Manual} documents @code{flex} version 78@value{VERSION}. It was last updated on @value{UPDATED}. 79 80This manual was written by @value{authors}. 81 82@menu 83* Copyright:: 84* Reporting Bugs:: 85* Introduction:: 86* Simple Examples:: 87* Format:: 88* Patterns:: 89* Matching:: 90* Actions:: 91* Generated Scanner:: 92* Start Conditions:: 93* Multiple Input Buffers:: 94* EOF:: 95* Misc Macros:: 96* User Values:: 97* Yacc:: 98* Scanner Options:: 99* Performance:: 100* Cxx:: 101* Reentrant:: 102* Lex and Posix:: 103* Memory Management:: 104* Serialized Tables:: 105* Diagnostics:: 106* Limitations:: 107* Bibliography:: 108* FAQ:: 109* Appendices:: 110* Indices:: 111 112@detailmenu 113 --- The Detailed Node Listing --- 114 115Format of the Input File 116 117* Definitions Section:: 118* Rules Section:: 119* User Code Section:: 120* Comments in the Input:: 121 122Scanner Options 123 124* Options for Specifying Filenames:: 125* Options Affecting Scanner Behavior:: 126* Code-Level And API Options:: 127* Options for Scanner Speed and Size:: 128* Debugging Options:: 129* Miscellaneous Options:: 130 131Reentrant C Scanners 132 133* Reentrant Uses:: 134* Reentrant Overview:: 135* Reentrant Example:: 136* Reentrant Detail:: 137* Reentrant Functions:: 138 139The Reentrant API in Detail 140 141* Specify Reentrant:: 142* Extra Reentrant Argument:: 143* Global Replacement:: 144* Init and Destroy Functions:: 145* Accessor Methods:: 146* Extra Data:: 147* About yyscan_t:: 148 149Memory Management 150 151* The Default Memory Management:: 152* Overriding The Default Memory Management:: 153* A Note About yytext And Memory:: 154 155Serialized Tables 156 157* Creating Serialized Tables:: 158* Loading and Unloading Serialized Tables:: 159* Tables File Format:: 160 161FAQ 162 163* When was flex born?:: 164* How do I expand backslash-escape sequences in C-style quoted strings?:: 165* Why do flex scanners call fileno if it is not ANSI compatible?:: 166* Does flex support recursive pattern definitions?:: 167* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 168* Flex is not matching my patterns in the same order that I defined them.:: 169* My actions are executing out of order or sometimes not at all.:: 170* How can I have multiple input sources feed into the same scanner at the same time?:: 171* Can I build nested parsers that work with the same input file?:: 172* How can I match text only at the end of a file?:: 173* How can I make REJECT cascade across start condition boundaries?:: 174* Why cant I use fast or full tables with interactive mode?:: 175* How much faster is -F or -f than -C?:: 176* If I have a simple grammar cant I just parse it with flex?:: 177* Why doesn't yyrestart() set the start state back to INITIAL?:: 178* How can I match C-style comments?:: 179* The period isn't working the way I expected.:: 180* Can I get the flex manual in another format?:: 181* Does there exist a "faster" NDFA->DFA algorithm?:: 182* How does flex compile the DFA so quickly?:: 183* How can I use more than 8192 rules?:: 184* How do I abandon a file in the middle of a scan and switch to a new file?:: 185* How do I execute code only during initialization (only before the first scan)?:: 186* How do I execute code at termination?:: 187* Where else can I find help?:: 188* Can I include comments in the "rules" section of the file?:: 189* I get an error about undefined yywrap().:: 190* How can I change the matching pattern at run time?:: 191* How can I expand macros in the input?:: 192* How can I build a two-pass scanner?:: 193* How do I match any string not matched in the preceding rules?:: 194* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 195* Is there a way to make flex treat NULL like a regular character?:: 196* Whenever flex can not match the input it says "flex scanner jammed".:: 197* Why doesn't flex have non-greedy operators like perl does?:: 198* Memory leak - 16386 bytes allocated by malloc.:: 199* How do I track the byte offset for lseek()?:: 200* How do I use my own I/O classes in a C++ scanner?:: 201* How do I skip as many chars as possible?:: 202* deleteme00:: 203* Are certain equivalent patterns faster than others?:: 204* Is backing up a big deal?:: 205* Can I fake multi-byte character support?:: 206* deleteme01:: 207* Can you discuss some flex internals?:: 208* unput() messes up yy_at_bol:: 209* The | operator is not doing what I want:: 210* Why can't flex understand this variable trailing context pattern?:: 211* The ^ operator isn't working:: 212* Trailing context is getting confused with trailing optional patterns:: 213* Is flex GNU or not?:: 214* ERASEME53:: 215* I need to scan if-then-else blocks and while loops:: 216* ERASEME55:: 217* ERASEME56:: 218* ERASEME57:: 219* Is there a repository for flex scanners?:: 220* How can I conditionally compile or preprocess my flex input file?:: 221* Where can I find grammars for lex and yacc?:: 222* I get an end-of-buffer message for each character scanned.:: 223* unnamed-faq-62:: 224* unnamed-faq-63:: 225* unnamed-faq-64:: 226* unnamed-faq-65:: 227* unnamed-faq-66:: 228* unnamed-faq-67:: 229* unnamed-faq-68:: 230* unnamed-faq-69:: 231* unnamed-faq-70:: 232* unnamed-faq-71:: 233* unnamed-faq-72:: 234* unnamed-faq-73:: 235* unnamed-faq-74:: 236* unnamed-faq-75:: 237* unnamed-faq-76:: 238* unnamed-faq-77:: 239* unnamed-faq-78:: 240* unnamed-faq-79:: 241* unnamed-faq-80:: 242* unnamed-faq-81:: 243* unnamed-faq-82:: 244* unnamed-faq-83:: 245* unnamed-faq-84:: 246* unnamed-faq-85:: 247* unnamed-faq-86:: 248* unnamed-faq-87:: 249* unnamed-faq-88:: 250* unnamed-faq-90:: 251* unnamed-faq-91:: 252* unnamed-faq-92:: 253* unnamed-faq-93:: 254* unnamed-faq-94:: 255* unnamed-faq-95:: 256* unnamed-faq-96:: 257* unnamed-faq-97:: 258* unnamed-faq-98:: 259* unnamed-faq-99:: 260* unnamed-faq-100:: 261* unnamed-faq-101:: 262* What is the difference between YYLEX_PARAM and YY_DECL?:: 263* Why do I get "conflicting types for yylex" error?:: 264* How do I access the values set in a Flex action from within a Bison action?:: 265 266Appendices 267 268* Makefiles and Flex:: 269* Bison Bridge:: 270* M4 Dependency:: 271* Common Patterns:: 272 273Indices 274 275* Concept Index:: 276* Index of Functions and Macros:: 277* Index of Variables:: 278* Index of Data Types:: 279* Index of Hooks:: 280* Index of Scanner Options:: 281 282@end detailmenu 283@end menu 284@end ifnottex 285@node Copyright, Reporting Bugs, Top, Top 286@chapter Copyright 287 288@cindex copyright of flex 289@cindex distributing flex 290@insertcopying 291 292@node Reporting Bugs, Introduction, Copyright, Top 293@chapter Reporting Bugs 294 295@cindex bugs, reporting 296@cindex reporting bugs 297 298If you find a bug in @code{flex}, please report it using 299the SourceForge Bug Tracking facilities which can be found on 300@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}. 301 302@node Introduction, Simple Examples, Reporting Bugs, Top 303@chapter Introduction 304 305@cindex scanner, definition of 306@code{flex} is a tool for generating @dfn{scanners}. A scanner is a 307program which recognizes lexical patterns in text. The @code{flex} 308program reads the given input files, or its standard input if no file 309names are given, for a description of a scanner to generate. The 310description is in the form of pairs of regular expressions and C code, 311called @dfn{rules}. @code{flex} generates as output a C source file, 312@file{lex.yy.c} by default, which defines a routine @code{yylex()}. 313This file can be compiled and linked with the flex runtime library to 314produce an executable. When the executable is run, it analyzes its 315input for occurrences of the regular expressions. Whenever it finds 316one, it executes the corresponding C code. 317 318@node Simple Examples, Format, Introduction, Top 319@chapter Some Simple Examples 320 321First some simple examples to get the flavor of how one uses 322@code{flex}. 323 324@cindex username expansion 325The following @code{flex} input specifies a scanner which, when it 326encounters the string @samp{username} will replace it with the user's 327login name: 328 329@example 330@verbatim 331 %% 332 username printf( "%s", getlogin() ); 333@end verbatim 334@end example 335 336@cindex default rule 337@cindex rules, default 338By default, any text not matched by a @code{flex} scanner is copied to 339the output, so the net effect of this scanner is to copy its input file 340to its output with each occurrence of @samp{username} expanded. In this 341input, there is just one rule. @samp{username} is the @dfn{pattern} and 342the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the 343beginning of the rules. 344 345Here's another simple example: 346 347@cindex counting characters and lines 348@example 349@verbatim 350 int num_lines = 0, num_chars = 0; 351 352 %% 353 \n ++num_lines; ++num_chars; 354 . ++num_chars; 355 356 %% 357 358 int main() 359 { 360 yylex(); 361 printf( "# of lines = %d, # of chars = %d\n", 362 num_lines, num_chars ); 363 } 364@end verbatim 365@end example 366 367This scanner counts the number of characters and the number of lines in 368its input. It produces no output other than the final report on the 369character and line counts. The first line declares two globals, 370@code{num_lines} and @code{num_chars}, which are accessible both inside 371@code{yylex()} and in the @code{main()} routine declared after the 372second @samp{%%}. There are two rules, one which matches a newline 373(@samp{\n}) and increments both the line count and the character count, 374and one which matches any character other than a newline (indicated by 375the @samp{.} regular expression). 376 377A somewhat more complicated example: 378 379@cindex Pascal-like language 380@example 381@verbatim 382 /* scanner for a toy Pascal-like language */ 383 384 %{ 385 /* need this for the call to atof() below */ 386 #include <math.h> 387 %} 388 389 DIGIT [0-9] 390 ID [a-z][a-z0-9]* 391 392 %% 393 394 {DIGIT}+ { 395 printf( "An integer: %s (%d)\n", yytext, 396 atoi( yytext ) ); 397 } 398 399 {DIGIT}+"."{DIGIT}* { 400 printf( "A float: %s (%g)\n", yytext, 401 atof( yytext ) ); 402 } 403 404 if|then|begin|end|procedure|function { 405 printf( "A keyword: %s\n", yytext ); 406 } 407 408 {ID} printf( "An identifier: %s\n", yytext ); 409 410 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 411 412 "{"[\^{}}\n]*"}" /* eat up one-line comments */ 413 414 [ \t\n]+ /* eat up whitespace */ 415 416 . printf( "Unrecognized character: %s\n", yytext ); 417 418 %% 419 420 int main( int argc, char **argv ) 421 { 422 ++argv, --argc; /* skip over program name */ 423 if ( argc > 0 ) 424 yyin = fopen( argv[0], "r" ); 425 else 426 yyin = stdin; 427 428 yylex(); 429 } 430@end verbatim 431@end example 432 433This is the beginnings of a simple scanner for a language like Pascal. 434It identifies different types of @dfn{tokens} and reports on what it has 435seen. 436 437The details of this example will be explained in the following 438sections. 439 440@node Format, Patterns, Simple Examples, Top 441@chapter Format of the Input File 442 443 444@cindex format of flex input 445@cindex input, format of 446@cindex file format 447@cindex sections of flex input 448 449The @code{flex} input file consists of three sections, separated by a 450line containing only @samp{%%}. 451 452@cindex format of input file 453@example 454@verbatim 455 definitions 456 %% 457 rules 458 %% 459 user code 460@end verbatim 461@end example 462 463@menu 464* Definitions Section:: 465* Rules Section:: 466* User Code Section:: 467* Comments in the Input:: 468@end menu 469 470@node Definitions Section, Rules Section, Format, Format 471@section Format of the Definitions Section 472 473@cindex input file, Definitions section 474@cindex Definitions, in flex input 475The @dfn{definitions section} contains declarations of simple @dfn{name} 476definitions to simplify the scanner specification, and declarations of 477@dfn{start conditions}, which are explained in a later section. 478 479@cindex aliases, how to define 480@cindex pattern aliases, how to define 481Name definitions have the form: 482 483@example 484@verbatim 485 name definition 486@end verbatim 487@end example 488 489The @samp{name} is a word beginning with a letter or an underscore 490(@samp{_}) followed by zero or more letters, digits, @samp{_}, or 491@samp{-} (dash). The definition is taken to begin at the first 492non-whitespace character following the name and continuing to the end of 493the line. The definition can subsequently be referred to using 494@samp{@{name@}}, which will expand to @samp{(definition)}. For example, 495 496@cindex pattern aliases, defining 497@cindex defining pattern aliases 498@example 499@verbatim 500 DIGIT [0-9] 501 ID [a-z][a-z0-9]* 502@end verbatim 503@end example 504 505Defines @samp{DIGIT} to be a regular expression which matches a single 506digit, and @samp{ID} to be a regular expression which matches a letter 507followed by zero-or-more letters-or-digits. A subsequent reference to 508 509@cindex pattern aliases, use of 510@example 511@verbatim 512 {DIGIT}+"."{DIGIT}* 513@end verbatim 514@end example 515 516is identical to 517 518@example 519@verbatim 520 ([0-9])+"."([0-9])* 521@end verbatim 522@end example 523 524and matches one-or-more digits followed by a @samp{.} followed by 525zero-or-more digits. 526 527@cindex comments in flex input 528An unindented comment (i.e., a line 529beginning with @samp{/*}) is copied verbatim to the output up 530to the next @samp{*/}. 531 532@cindex %@{ and %@}, in Definitions Section 533@cindex embedding C code in flex input 534@cindex C code in flex input 535Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 536is also copied verbatim to the output (with the %@{ and %@} symbols 537removed). The %@{ and %@} symbols must appear unindented on lines by 538themselves. 539 540@cindex %top 541 542A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except 543that the code in a @code{%top} block is relocated to the @emph{top} of the 544generated file, before any flex definitions @footnote{Actually, 545@code{yyIN_HEADER} is defined before the @samp{%top} block.}. 546The @code{%top} block is useful when you want certain preprocessor macros to be 547defined or certain files to be included before the generated code. 548The single characters, @samp{@{} and @samp{@}} are used to delimit the 549@code{%top} block, as show in the example below: 550 551@example 552@verbatim 553 %top{ 554 /* This code goes at the "top" of the generated file. */ 555 #include <stdint.h> 556 #include <inttypes.h> 557 } 558@end verbatim 559@end example 560 561Multiple @code{%top} blocks are allowed, and their order is preserved. 562 563@node Rules Section, User Code Section, Definitions Section, Format 564@section Format of the Rules Section 565 566@cindex input file, Rules Section 567@cindex rules, in flex input 568The @dfn{rules} section of the @code{flex} input contains a series of 569rules of the form: 570 571@example 572@verbatim 573 pattern action 574@end verbatim 575@end example 576 577where the pattern must be unindented and the action must begin 578on the same line. 579@xref{Patterns}, for a further description of patterns and actions. 580 581In the rules section, any indented or %@{ %@} enclosed text appearing 582before the first rule may be used to declare variables which are local 583to the scanning routine and (after the declarations) code which is to be 584executed whenever the scanning routine is entered. Other indented or 585%@{ %@} text in the rule section is still copied to the output, but its 586meaning is not well-defined and it may well cause compile-time errors 587(this feature is present for @acronym{POSIX} compliance. @xref{Lex and 588Posix}, for other such features). 589 590Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 591is copied verbatim to the output (with the %@{ and %@} symbols removed). 592The %@{ and %@} symbols must appear unindented on lines by themselves. 593 594@node User Code Section, Comments in the Input, Rules Section, Format 595@section Format of the User Code Section 596 597@cindex input file, user code Section 598@cindex user code, in flex input 599The user code section is simply copied to @file{lex.yy.c} verbatim. It 600is used for companion routines which call or are called by the scanner. 601The presence of this section is optional; if it is missing, the second 602@samp{%%} in the input file may be skipped, too. 603 604@node Comments in the Input, , User Code Section, Format 605@section Comments in the Input 606 607@cindex comments, syntax of 608Flex supports C-style comments, that is, anything between @samp{/*} and 609@samp{*/} is 610considered a comment. Whenever flex encounters a comment, it copies the 611entire comment verbatim to the generated source code. Comments may 612appear just about anywhere, but with the following exceptions: 613 614@itemize 615@cindex comments, in rules section 616@item 617Comments may not appear in the Rules Section wherever flex is expecting 618a regular expression. This means comments may not appear at the 619beginning of a line, or immediately following a list of scanner states. 620@item 621Comments may not appear on an @samp{%option} line in the Definitions 622Section. 623@end itemize 624 625If you want to follow a simple rule, then always begin a comment on a 626new line, with one or more whitespace characters before the initial 627@samp{/*}). This rule will work anywhere in the input file. 628 629All the comments in the following example are valid: 630 631@cindex comments, valid uses of 632@cindex comments in the input 633@example 634@verbatim 635%{ 636/* code block */ 637%} 638 639/* Definitions Section */ 640%x STATE_X 641 642%% 643 /* Rules Section */ 644ruleA /* after regex */ { /* code block */ } /* after code block */ 645 /* Rules Section (indented) */ 646<STATE_X>{ 647ruleC ECHO; 648ruleD ECHO; 649%{ 650/* code block */ 651%} 652} 653%% 654/* User Code Section */ 655 656@end verbatim 657@end example 658 659@node Patterns, Matching, Format, Top 660@chapter Patterns 661 662@cindex patterns, in rules section 663@cindex regular expressions, in patterns 664The patterns in the input (see @ref{Rules Section}) are written using an 665extended set of regular expressions. These are: 666 667@cindex patterns, syntax 668@cindex patterns, syntax 669@table @samp 670@item x 671match the character 'x' 672 673@item . 674any character (byte) except newline 675 676@cindex [] in patterns 677@cindex character classes in patterns, syntax of 678@cindex POSIX, character classes in patterns, syntax of 679@item [xyz] 680a @dfn{character class}; in this case, the pattern 681matches either an 'x', a 'y', or a 'z' 682 683@cindex ranges in patterns 684@item [abj-oZ] 685a "character class" with a range in it; matches 686an 'a', a 'b', any letter from 'j' through 'o', 687or a 'Z' 688 689@cindex ranges in patterns, negating 690@cindex negating ranges in patterns 691@item [^A-Z] 692a "negated character class", i.e., any character 693but those in the class. In this case, any 694character EXCEPT an uppercase letter. 695 696@item [^A-Z\n] 697any character EXCEPT an uppercase letter or 698a newline 699 700@item [a-z]@{-@}[aeiou] 701the lowercase consonants 702 703@item r* 704zero or more r's, where r is any regular expression 705 706@item r+ 707one or more r's 708 709@item r? 710zero or one r's (that is, ``an optional r'') 711 712@cindex braces in patterns 713@item r@{2,5@} 714anywhere from two to five r's 715 716@item r@{2,@} 717two or more r's 718 719@item r@{4@} 720exactly 4 r's 721 722@cindex pattern aliases, expansion of 723@item @{name@} 724the expansion of the @samp{name} definition 725(@pxref{Format}). 726 727@cindex literal text in patterns, syntax of 728@cindex verbatim text in patterns, syntax of 729@item "[xyz]\"foo" 730the literal string: @samp{[xyz]"foo} 731 732@cindex escape sequences in patterns, syntax of 733@item \X 734if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or 735@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a 736literal @samp{X} (used to escape operators such as @samp{*}) 737 738@cindex NULL character in patterns, syntax of 739@item \0 740a NUL character (ASCII code 0) 741 742@cindex octal characters in patterns 743@item \123 744the character with octal value 123 745 746@item \x2a 747the character with hexadecimal value 2a 748 749@item (r) 750match an @samp{r}; parentheses are used to override precedence (see below) 751 752@item (?r-s:pattern) 753apply option @samp{r} and omit option @samp{s} while interpreting pattern. 754Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}. 755 756@samp{i} means case-insensitive. @samp{-i} means case-sensitive. 757 758@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever. 759@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}. 760 761@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless 762it is backslash-escaped, contained within @samp{""}s, or appears inside a 763character class. 764 765The following are all valid: 766 767@verbatim 768(?:foo) same as (foo) 769(?i:ab7) same as ([aA][bB]7) 770(?-i:ab) same as (ab) 771(?s:.) same as [\x00-\xFF] 772(?-s:.) same as [^\n] 773(?ix-s: a . b) same as ([Aa][^\n][bB]) 774(?x:a b) same as ("ab") 775(?x:a\ b) same as ("a b") 776(?x:a" "b) same as ("a b") 777(?x:a[ ]b) same as ("a b") 778(?x:a 779 /* comment */ 780 b 781 c) same as (abc) 782@end verbatim 783 784@item (?# comment ) 785omit everything within @samp{()}. The first @samp{)} 786character encountered ends the pattern. It is not possible to for the comment 787to contain a @samp{)} character. The comment may span lines. 788 789@cindex concatenation, in patterns 790@item rs 791the regular expression @samp{r} followed by the regular expression @samp{s}; called 792@dfn{concatenation} 793 794@item r|s 795either an @samp{r} or an @samp{s} 796 797@cindex trailing context, in patterns 798@item r/s 799an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is 800included when determining whether this rule is the longest match, but is 801then returned to the input before the action is executed. So the action 802only sees the text matched by @samp{r}. This type of pattern is called 803@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex 804cannot match correctly. @xref{Limitations}, regarding dangerous trailing 805context.) 806 807@cindex beginning of line, in patterns 808@cindex BOL, in patterns 809@item ^r 810an @samp{r}, but only at the beginning of a line (i.e., 811when just starting to scan, or right after a 812newline has been scanned). 813 814@cindex end of line, in patterns 815@cindex EOL, in patterns 816@item r$ 817an @samp{r}, but only at the end of a line (i.e., just before a 818newline). Equivalent to @samp{r/\n}. 819 820@cindex newline, matching in patterns 821Note that @code{flex}'s notion of ``newline'' is exactly 822whatever the C compiler used to compile @code{flex} 823interprets @samp{\n} as; in particular, on some DOS 824systems you must either filter out @samp{\r}s in the 825input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. 826 827@cindex start conditions, in patterns 828@item <s>r 829an @samp{r}, but only in start condition @code{s} (see @ref{Start 830Conditions} for discussion of start conditions). 831 832@item <s1,s2,s3>r 833same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. 834 835@item <*>r 836an @samp{r} in any start condition, even an exclusive one. 837 838@cindex end of file, in patterns 839@cindex EOF in patterns, syntax of 840@item <<EOF>> 841an end-of-file. 842 843@item <s1,s2><<EOF>> 844an end-of-file when in start condition @code{s1} or @code{s2} 845@end table 846 847Note that inside of a character class, all regular expression operators 848lose their special meaning except escape (@samp{\}) and the character class 849operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. 850 851@cindex patterns, precedence of operators 852The regular expressions listed above are grouped according to 853precedence, from highest precedence at the top to lowest at the bottom. 854Those grouped together have equal precedence (see special note on the 855precedence of the repeat operator, @samp{@{@}}, under the documentation 856for the @samp{--posix} POSIX compliance option). For example, 857 858@cindex patterns, grouping and precedence 859@example 860@verbatim 861 foo|bar* 862@end verbatim 863@end example 864 865is the same as 866 867@example 868@verbatim 869 (foo)|(ba(r*)) 870@end verbatim 871@end example 872 873since the @samp{*} operator has higher precedence than concatenation, 874and concatenation higher than alternation (@samp{|}). This pattern 875therefore matches @emph{either} the string @samp{foo} @emph{or} the 876string @samp{ba} followed by zero-or-more @samp{r}'s. To match 877@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: 878 879@example 880@verbatim 881 foo|(bar)* 882@end verbatim 883@end example 884 885And to match a sequence of zero or more repetitions of @samp{foo} and 886@samp{bar}: 887 888@cindex patterns, repetitions with grouping 889@example 890@verbatim 891 (foo|bar)* 892@end verbatim 893@end example 894 895@cindex character classes in patterns 896In addition to characters and ranges of characters, character classes 897can also contain @dfn{character class expressions}. These are 898expressions enclosed inside @samp{[:} and @samp{:]} delimiters (which 899themselves must appear between the @samp{[} and @samp{]} of the 900character class. Other elements may occur inside the character class, 901too). The valid expressions are: 902 903@cindex patterns, valid character classes 904@example 905@verbatim 906 [:alnum:] [:alpha:] [:blank:] 907 [:cntrl:] [:digit:] [:graph:] 908 [:lower:] [:print:] [:punct:] 909 [:space:] [:upper:] [:xdigit:] 910@end verbatim 911@end example 912 913These expressions all designate a set of characters equivalent to the 914corresponding standard C @code{isXXX} function. For example, 915@samp{[:alnum:]} designates those characters for which @code{isalnum()} 916returns true - i.e., any alphabetic or numeric character. Some systems 917don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a 918blank or a tab. 919 920For example, the following character classes are all equivalent: 921 922@cindex character classes, equivalence of 923@cindex patterns, character class equivalence 924@example 925@verbatim 926 [[:alnum:]] 927 [[:alpha:][:digit:]] 928 [[:alpha:][0-9]] 929 [a-zA-Z0-9] 930@end verbatim 931@end example 932 933A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. 934This means the character classes are sensitive to the locale in which @code{flex} 935is executed, and the resulting scanner will not be sensitive to the runtime locale. 936This may or may not be desirable. 937 938 939@itemize 940@cindex case-insensitive, effect on character classes 941@item If your scanner is case-insensitive (the @samp{-i} flag), then 942@samp{[:upper:]} and @samp{[:lower:]} are equivalent to 943@samp{[:alpha:]}. 944 945@anchor{case and character ranges} 946@item Character classes with ranges, such as @samp{[a-Z]}, should be used with 947caution in a case-insensitive scanner if the range spans upper or lowercase 948characters. Flex does not know if you want to fold all upper and lowercase 949characters together, or if you want the literal numeric range specified (with 950no case folding). When in doubt, flex will assume that you meant the literal 951numeric range, and will issue a warning. The exception to this rule is a 952character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you 953want case-folding to occur. Here are some examples with the @samp{-i} flag 954enabled: 955 956@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} 957@item Range @tab Result @tab Literal Range @tab Alternate Range 958@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab 959@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab 960@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} 961@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} 962@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} 963@end multitable 964 965@cindex end of line, in negated character classes 966@cindex EOL, in negated character classes 967@item 968A negated character class such as the example @samp{[^A-Z]} above 969@emph{will} match a newline unless @samp{\n} (or an equivalent escape 970sequence) is one of the characters explicitly present in the negated 971character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other 972regular expression tools treat negated character classes, but 973unfortunately the inconsistency is historically entrenched. Matching 974newlines means that a pattern like @samp{[^"]*} can match the entire 975input unless there's another quote in the input. 976 977Flex allows negation of character class expressions by prepending @samp{^} to 978the POSIX character class name. 979 980@example 981@verbatim 982 [:^alnum:] [:^alpha:] [:^blank:] 983 [:^cntrl:] [:^digit:] [:^graph:] 984 [:^lower:] [:^print:] [:^punct:] 985 [:^space:] [:^upper:] [:^xdigit:] 986@end verbatim 987@end example 988 989Flex will issue a warning if the expressions @samp{[:^upper:]} and 990@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is 991unclear. The current behavior is to skip them entirely, but this may change 992without notice in future revisions of flex. 993 994@item 995 996The @samp{@{-@}} operator computes the difference of two character classes. For 997example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class 998@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is 999just the single character @samp{a}). The @samp{@{-@}} operator is left 1000associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful 1001not to accidentally create an empty set, which will never match. 1002 1003@item 1004 1005The @samp{@{+@}} operator computes the union of two character classes. For 1006example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator 1007is useful when preceded by the result of a difference operation, as in, 1008@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to 1009@samp{[A-Zq]} in the "C" locale. 1010 1011@cindex trailing context, limits of 1012@cindex ^ as non-special character in patterns 1013@cindex $ as normal character in patterns 1014@item 1015A rule can have at most one instance of trailing context (the @samp{/} operator 1016or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns 1017can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, 1018cannot be grouped inside parentheses. A @samp{^} which does not occur at 1019the beginning of a rule or a @samp{$} which does not occur at the end of 1020a rule loses its special properties and is treated as a normal character. 1021 1022@item 1023The following are invalid: 1024 1025@cindex patterns, invalid trailing context 1026@example 1027@verbatim 1028 foo/bar$ 1029 <sc1>foo<sc2>bar 1030@end verbatim 1031@end example 1032 1033Note that the first of these can be written @samp{foo/bar\n}. 1034 1035@item 1036The following will result in @samp{$} or @samp{^} being treated as a normal character: 1037 1038@cindex patterns, special characters treated as non-special 1039@example 1040@verbatim 1041 foo|(bar$) 1042 foo|^bar 1043@end verbatim 1044@end example 1045 1046If the desired meaning is a @samp{foo} or a 1047@samp{bar}-followed-by-a-newline, the following could be used (the 1048special @code{|} action is explained below, @pxref{Actions}): 1049 1050@cindex patterns, end of line 1051@example 1052@verbatim 1053 foo | 1054 bar$ /* action goes here */ 1055@end verbatim 1056@end example 1057 1058A similar trick will work for matching a @samp{foo} or a 1059@samp{bar}-at-the-beginning-of-a-line. 1060@end itemize 1061 1062@node Matching, Actions, Patterns, Top 1063@chapter How the Input Is Matched 1064 1065@cindex patterns, matching 1066@cindex input, matching 1067@cindex trailing context, matching 1068@cindex matching, and trailing context 1069@cindex matching, length of 1070@cindex matching, multiple matches 1071When the generated scanner is run, it analyzes its input looking for 1072strings which match any of its patterns. If it finds more than one 1073match, it takes the one matching the most text (for trailing context 1074rules, this includes the length of the trailing part, even though it 1075will then be returned to the input). If it finds two or more matches of 1076the same length, the rule listed first in the @code{flex} input file is 1077chosen. 1078 1079@cindex token 1080@cindex yytext 1081@cindex yyleng 1082Once the match is determined, the text corresponding to the match 1083(called the @dfn{token}) is made available in the global character 1084pointer @code{yytext}, and its length in the global integer 1085@code{yyleng}. The @dfn{action} corresponding to the matched pattern is 1086then executed (@pxref{Actions}), and then the remaining input is scanned 1087for another match. 1088 1089@cindex default rule 1090If no match is found, then the @dfn{default rule} is executed: the next 1091character in the input is considered matched and copied to the standard 1092output. Thus, the simplest valid @code{flex} input is: 1093 1094@cindex minimal scanner 1095@example 1096@verbatim 1097 %% 1098@end verbatim 1099@end example 1100 1101which generates a scanner that simply copies its input (one character at 1102a time) to its output. 1103 1104@cindex yytext, two types of 1105@cindex %array, use of 1106@cindex %pointer, use of 1107@vindex yytext 1108Note that @code{yytext} can be defined in two different ways: either as 1109a character @emph{pointer} or as a character @emph{array}. You can 1110control which definition @code{flex} uses by including one of the 1111special directives @code{%pointer} or @code{%array} in the first 1112(definitions) section of your flex input. The default is 1113@code{%pointer}, unless you use the @samp{-l} lex compatibility option, 1114in which case @code{yytext} will be an array. The advantage of using 1115@code{%pointer} is substantially faster scanning and no buffer overflow 1116when matching very large tokens (unless you run out of dynamic memory). 1117The disadvantage is that you are restricted in how your actions can 1118modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()} 1119function destroys the present contents of @code{yytext}, which can be a 1120considerable porting headache when moving between different @code{lex} 1121versions. 1122 1123@cindex %array, advantages of 1124The advantage of @code{%array} is that you can then modify @code{yytext} 1125to your heart's content, and calls to @code{unput()} do not destroy 1126@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex} 1127programs sometimes access @code{yytext} externally using declarations of 1128the form: 1129 1130@example 1131@verbatim 1132 extern char yytext[]; 1133@end verbatim 1134@end example 1135 1136This definition is erroneous when used with @code{%pointer}, but correct 1137for @code{%array}. 1138 1139The @code{%array} declaration defines @code{yytext} to be an array of 1140@code{YYLMAX} characters, which defaults to a fairly large value. You 1141can change the size by simply #define'ing @code{YYLMAX} to a different 1142value in the first section of your @code{flex} input. As mentioned 1143above, with @code{%pointer} yytext grows dynamically to accommodate 1144large tokens. While this means your @code{%pointer} scanner can 1145accommodate very large tokens (such as matching entire blocks of 1146comments), bear in mind that each time the scanner must resize 1147@code{yytext} it also must rescan the entire token from the beginning, 1148so matching such tokens can prove slow. @code{yytext} presently does 1149@emph{not} dynamically grow if a call to @code{unput()} results in too 1150much text being pushed back; instead, a run-time error results. 1151 1152@cindex %array, with C++ 1153Also note that you cannot use @code{%array} with C++ scanner classes 1154(@pxref{Cxx}). 1155 1156@node Actions, Generated Scanner, Matching, Top 1157@chapter Actions 1158 1159@cindex actions 1160Each pattern in a rule has a corresponding @dfn{action}, which can be 1161any arbitrary C statement. The pattern ends at the first non-escaped 1162whitespace character; the remainder of the line is its action. If the 1163action is empty, then when the pattern is matched the input token is 1164simply discarded. For example, here is the specification for a program 1165which deletes all occurrences of @samp{zap me} from its input: 1166 1167@cindex deleting lines from input 1168@example 1169@verbatim 1170 %% 1171 "zap me" 1172@end verbatim 1173@end example 1174 1175This example will copy all other characters in the input to the output 1176since they will be matched by the default rule. 1177 1178Here is a program which compresses multiple blanks and tabs down to a 1179single blank, and throws away whitespace found at the end of a line: 1180 1181@cindex whitespace, compressing 1182@cindex compressing whitespace 1183@example 1184@verbatim 1185 %% 1186 [ \t]+ putchar( ' ' ); 1187 [ \t]+$ /* ignore this token */ 1188@end verbatim 1189@end example 1190 1191@cindex %@{ and %@}, in Rules Section 1192@cindex actions, use of @{ and @} 1193@cindex actions, embedded C strings 1194@cindex C-strings, in actions 1195@cindex comments, in actions 1196If the action contains a @samp{@{}, then the action spans till the 1197balancing @samp{@}} is found, and the action may cross multiple lines. 1198@code{flex} knows about C strings and comments and won't be fooled by 1199braces found within them, but also allows actions to begin with 1200@samp{%@{} and will consider the action to be all the text up to the 1201next @samp{%@}} (regardless of ordinary braces inside the action). 1202 1203@cindex |, in actions 1204An action consisting solely of a vertical bar (@samp{|}) means ``same as the 1205action for the next rule''. See below for an illustration. 1206 1207Actions can include arbitrary C code, including @code{return} statements 1208to return a value to whatever routine called @code{yylex()}. Each time 1209@code{yylex()} is called it continues processing tokens from where it 1210last left off until it either reaches the end of the file or executes a 1211return. 1212 1213@cindex yytext, modification of 1214Actions are free to modify @code{yytext} except for lengthening it 1215(adding characters to its end--these will overwrite later characters in 1216the input stream). This however does not apply when using @code{%array} 1217(@pxref{Matching}). In that case, @code{yytext} may be freely modified 1218in any way. 1219 1220@cindex yyleng, modification of 1221@cindex yymore, and yyleng 1222Actions are free to modify @code{yyleng} except they should not do so if 1223the action also includes use of @code{yymore()} (see below). 1224 1225@cindex preprocessor macros, for use in actions 1226There are a number of special directives which can be included within an 1227action: 1228 1229@table @code 1230@item ECHO 1231@cindex ECHO 1232copies yytext to the scanner's output. 1233 1234@item BEGIN 1235@cindex BEGIN 1236followed by the name of a start condition places the scanner in the 1237corresponding start condition (see below). 1238 1239@item REJECT 1240@cindex REJECT 1241directs the scanner to proceed on to the ``second best'' rule which 1242matched the input (or a prefix of the input). The rule is chosen as 1243described above in @ref{Matching}, and @code{yytext} and @code{yyleng} 1244set up appropriately. It may either be one which matched as much text 1245as the originally chosen rule but came later in the @code{flex} input 1246file, or one which matched less text. For example, the following will 1247both count the words in the input and call the routine @code{special()} 1248whenever @samp{frob} is seen: 1249 1250@example 1251@verbatim 1252 int word_count = 0; 1253 %% 1254 1255 frob special(); REJECT; 1256 [^ \t\n]+ ++word_count; 1257@end verbatim 1258@end example 1259 1260Without the @code{REJECT}, any occurrences of @samp{frob} in the input 1261would not be counted as words, since the scanner normally executes only 1262one action per token. Multiple uses of @code{REJECT} are allowed, each 1263one finding the next best choice to the currently active rule. For 1264example, when the following scanner scans the token @samp{abcd}, it will 1265write @samp{abcdabcaba} to the output: 1266 1267@cindex REJECT, calling multiple times 1268@cindex |, use of 1269@example 1270@verbatim 1271 %% 1272 a | 1273 ab | 1274 abc | 1275 abcd ECHO; REJECT; 1276 .|\n /* eat up any unmatched character */ 1277@end verbatim 1278@end example 1279 1280The first three rules share the fourth's action since they use the 1281special @samp{|} action. 1282 1283@code{REJECT} is a particularly expensive feature in terms of scanner 1284performance; if it is used in @emph{any} of the scanner's actions it 1285will slow down @emph{all} of the scanner's matching. Furthermore, 1286@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options 1287(@pxref{Scanner Options}). 1288 1289Note also that unlike the other special actions, @code{REJECT} is a 1290@emph{branch}. Code immediately following it in the action will 1291@emph{not} be executed. 1292 1293@item yymore() 1294@cindex yymore() 1295tells the scanner that the next time it matches a rule, the 1296corresponding token should be @emph{appended} onto the current value of 1297@code{yytext} rather than replacing it. For example, given the input 1298@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to 1299the output: 1300 1301@cindex yymore(), mega-kludge 1302@cindex yymore() to append token to previous token 1303@example 1304@verbatim 1305 %% 1306 mega- ECHO; yymore(); 1307 kludge ECHO; 1308@end verbatim 1309@end example 1310 1311First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} 1312is matched, but the previous @samp{mega-} is still hanging around at the 1313beginning of 1314@code{yytext} 1315so the 1316@code{ECHO} 1317for the @samp{kludge} rule will actually write @samp{mega-kludge}. 1318@end table 1319 1320@cindex yymore, performance penalty of 1321Two notes regarding use of @code{yymore()}. First, @code{yymore()} 1322depends on the value of @code{yyleng} correctly reflecting the size of 1323the current token, so you must not modify @code{yyleng} if you are using 1324@code{yymore()}. Second, the presence of @code{yymore()} in the 1325scanner's action entails a minor performance penalty in the scanner's 1326matching speed. 1327 1328@cindex yyless() 1329@code{yyless(n)} returns all but the first @code{n} characters of the 1330current token back to the input stream, where they will be rescanned 1331when the scanner looks for the next match. @code{yytext} and 1332@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now 1333be equal to @code{n}). For example, on the input @samp{foobar} the 1334following will write out @samp{foobarbar}: 1335 1336@cindex yyless(), pushing back characters 1337@cindex pushing back characters with yyless 1338@example 1339@verbatim 1340 %% 1341 foobar ECHO; yyless(3); 1342 [a-z]+ ECHO; 1343@end verbatim 1344@end example 1345 1346An argument of 0 to @code{yyless()} will cause the entire current input 1347string to be scanned again. Unless you've changed how the scanner will 1348subsequently process its input (using @code{BEGIN}, for example), this 1349will result in an endless loop. 1350 1351Note that @code{yyless()} is a macro and can only be used in the flex 1352input file, not from other source files. 1353 1354@cindex unput() 1355@cindex pushing back characters with unput 1356@code{unput(c)} puts the character @code{c} back onto the input stream. 1357It will be the next character scanned. The following action will take 1358the current token and cause it to be rescanned enclosed in parentheses. 1359 1360@cindex unput(), pushing back characters 1361@cindex pushing back characters with unput() 1362@example 1363@verbatim 1364 { 1365 int i; 1366 /* Copy yytext because unput() trashes yytext */ 1367 char *yycopy = strdup( yytext ); 1368 unput( ')' ); 1369 for ( i = yyleng - 1; i >= 0; --i ) 1370 unput( yycopy[i] ); 1371 unput( '(' ); 1372 free( yycopy ); 1373 } 1374@end verbatim 1375@end example 1376 1377Note that since each @code{unput()} puts the given character back at the 1378@emph{beginning} of the input stream, pushing back strings must be done 1379back-to-front. 1380 1381@cindex %pointer, and unput() 1382@cindex unput(), and %pointer 1383An important potential problem when using @code{unput()} is that if you 1384are using @code{%pointer} (the default), a call to @code{unput()} 1385@emph{destroys} the contents of @code{yytext}, starting with its 1386rightmost character and devouring one character to the left with each 1387call. If you need the value of @code{yytext} preserved after a call to 1388@code{unput()} (as in the above example), you must either first copy it 1389elsewhere, or build your scanner using @code{%array} instead 1390(@pxref{Matching}). 1391 1392@cindex pushing back EOF 1393@cindex EOF, pushing back 1394Finally, note that you cannot put back @samp{EOF} to attempt to mark the 1395input stream with an end-of-file. 1396 1397@cindex input() 1398@code{input()} reads the next character from the input stream. For 1399example, the following is one way to eat up C comments: 1400 1401@cindex comments, discarding 1402@cindex discarding C comments 1403@example 1404@verbatim 1405 %% 1406 "/*" { 1407 int c; 1408 1409 for ( ; ; ) 1410 { 1411 while ( (c = input()) != '*' && 1412 c != EOF ) 1413 ; /* eat up text of comment */ 1414 1415 if ( c == '*' ) 1416 { 1417 while ( (c = input()) == '*' ) 1418 ; 1419 if ( c == '/' ) 1420 break; /* found the end */ 1421 } 1422 1423 if ( c == EOF ) 1424 { 1425 error( "EOF in comment" ); 1426 break; 1427 } 1428 } 1429 } 1430@end verbatim 1431@end example 1432 1433@cindex input(), and C++ 1434@cindex yyinput() 1435(Note that if the scanner is compiled using @code{C++}, then 1436@code{input()} is instead referred to as @b{yyinput()}, in order to 1437avoid a name clash with the @code{C++} stream by the name of 1438@code{input}.) 1439 1440@cindex flushing the internal buffer 1441@cindex YY_FLUSH_BUFFER 1442@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that 1443the next time the scanner attempts to match a token, it will first 1444refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}). 1445This action is a special case of the more general 1446@code{yy_flush_buffer;} function, described below (@pxref{Multiple 1447Input Buffers}) 1448 1449@cindex yyterminate() 1450@cindex terminating with yyterminate() 1451@cindex exiting with yyterminate() 1452@cindex halting with yyterminate() 1453@code{yyterminate()} can be used in lieu of a return statement in an 1454action. It terminates the scanner and returns a 0 to the scanner's 1455caller, indicating ``all done''. By default, @code{yyterminate()} is 1456also called when an end-of-file is encountered. It is a macro and may 1457be redefined. 1458 1459@node Generated Scanner, Start Conditions, Actions, Top 1460@chapter The Generated Scanner 1461 1462@cindex yylex(), in generated scanner 1463The output of @code{flex} is the file @file{lex.yy.c}, which contains 1464the scanning routine @code{yylex()}, a number of tables used by it for 1465matching tokens, and a number of auxiliary routines and macros. By 1466default, @code{yylex()} is declared as follows: 1467 1468@example 1469@verbatim 1470 int yylex() 1471 { 1472 ... various definitions and the actions in here ... 1473 } 1474@end verbatim 1475@end example 1476 1477@cindex yylex(), overriding 1478(If your environment supports function prototypes, then it will be 1479@code{int yylex( void )}.) This definition may be changed by defining 1480the @code{YY_DECL} macro. For example, you could use: 1481 1482@cindex yylex, overriding the prototype of 1483@example 1484@verbatim 1485 #define YY_DECL float lexscan( a, b ) float a, b; 1486@end verbatim 1487@end example 1488 1489to give the scanning routine the name @code{lexscan}, returning a float, 1490and taking two floats as arguments. Note that if you give arguments to 1491the scanning routine using a K&R-style/non-prototyped function 1492declaration, you must terminate the definition with a semi-colon (;). 1493 1494@code{flex} generates @samp{C99} function definitions by 1495default. However flex does have the ability to generate obsolete, er, 1496@samp{traditional}, function definitions. This is to support 1497bootstrapping gcc on old systems. Unfortunately, traditional 1498definitions prevent us from using any standard data types smaller than 1499int (such as short, char, or bool) as function arguments. For this 1500reason, future versions of @code{flex} may generate standard C99 code 1501only, leaving K&R-style functions to the historians. Currently, if you 1502do @strong{not} want @samp{C99} definitions, then you must use 1503@code{%option noansi-definitions}. 1504 1505@cindex stdin, default for yyin 1506@cindex yyin 1507Whenever @code{yylex()} is called, it scans tokens from the global input 1508file @file{yyin} (which defaults to stdin). It continues until it 1509either reaches an end-of-file (at which point it returns the value 0) or 1510one of its actions executes a @code{return} statement. 1511 1512@cindex EOF and yyrestart() 1513@cindex end-of-file, and yyrestart() 1514@cindex yyrestart() 1515If the scanner reaches an end-of-file, subsequent calls are undefined 1516unless either @file{yyin} is pointed at a new input file (in which case 1517scanning continues from that file), or @code{yyrestart()} is called. 1518@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which 1519can be NULL, if you've set up @code{YY_INPUT} to scan from a source other 1520than @code{yyin}), and initializes @file{yyin} for scanning from that 1521file. Essentially there is no difference between just assigning 1522@file{yyin} to a new input file or using @code{yyrestart()} to do so; 1523the latter is available for compatibility with previous versions of 1524@code{flex}, and because it can be used to switch input files in the 1525middle of scanning. It can also be used to throw away the current input 1526buffer, by calling it with an argument of @file{yyin}; but it would be 1527better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that 1528@code{yyrestart()} does @emph{not} reset the start condition to 1529@code{INITIAL} (@pxref{Start Conditions}). 1530 1531@cindex RETURN, within actions 1532If @code{yylex()} stops scanning due to executing a @code{return} 1533statement in one of the actions, the scanner may then be called again 1534and it will resume scanning where it left off. 1535 1536@cindex YY_INPUT 1537By default (and for purposes of efficiency), the scanner uses 1538block-reads rather than simple @code{getc()} calls to read characters 1539from @file{yyin}. The nature of how it gets its input can be controlled 1540by defining the @code{YY_INPUT} macro. The calling sequence for 1541@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action 1542is to place up to @code{max_size} characters in the character array 1543@code{buf} and return in the integer variable @code{result} either the 1544number of characters read or the constant @code{YY_NULL} (0 on Unix 1545systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from 1546the global file-pointer @file{yyin}. 1547 1548@cindex YY_INPUT, overriding 1549Here is a sample definition of @code{YY_INPUT} (in the definitions 1550section of the input file): 1551 1552@example 1553@verbatim 1554 %{ 1555 #define YY_INPUT(buf,result,max_size) \ 1556 { \ 1557 int c = getchar(); \ 1558 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 1559 } 1560 %} 1561@end verbatim 1562@end example 1563 1564This definition will change the input processing to occur one character 1565at a time. 1566 1567@cindex yywrap() 1568When the scanner receives an end-of-file indication from YY_INPUT, it 1569then checks the @code{yywrap()} function. If @code{yywrap()} returns 1570false (zero), then it is assumed that the function has gone ahead and 1571set up @file{yyin} to point to another input file, and scanning 1572continues. If it returns true (non-zero), then the scanner terminates, 1573returning 0 to its caller. Note that in either case, the start 1574condition remains unchanged; it does @emph{not} revert to 1575@code{INITIAL}. 1576 1577@cindex yywrap, default for 1578@cindex noyywrap, %option 1579@cindex %option noyywrapp 1580If you do not supply your own version of @code{yywrap()}, then you must 1581either use @code{%option noyywrap} (in which case the scanner behaves as 1582though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to 1583obtain the default version of the routine, which always returns 1. 1584 1585For scanning from in-memory buffers (e.g., scanning strings), see 1586@ref{Scanning Strings}. @xref{Multiple Input Buffers}. 1587 1588@cindex ECHO, and yyout 1589@cindex yyout 1590@cindex stdout, as default for yyout 1591The scanner writes its @code{ECHO} output to the @file{yyout} global 1592(default, @file{stdout}), which may be redefined by the user simply by 1593assigning it to some other @code{FILE} pointer. 1594 1595@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top 1596@chapter Start Conditions 1597 1598@cindex start conditions 1599@code{flex} provides a mechanism for conditionally activating rules. 1600Any rule whose pattern is prefixed with @samp{<sc>} will only be active 1601when the scanner is in the @dfn{start condition} named @code{sc}. For 1602example, 1603 1604@example 1605@verbatim 1606 <STRING>[^"]* { /* eat up the string body ... */ 1607 ... 1608 } 1609@end verbatim 1610@end example 1611 1612will be active only when the scanner is in the @code{STRING} start 1613condition, and 1614 1615@cindex start conditions, multiple 1616@example 1617@verbatim 1618 <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ 1619 ... 1620 } 1621@end verbatim 1622@end example 1623 1624will be active only when the current start condition is either 1625@code{INITIAL}, @code{STRING}, or @code{QUOTE}. 1626 1627@cindex start conditions, inclusive v.s.@: exclusive 1628Start conditions are declared in the definitions (first) section of the 1629input using unindented lines beginning with either @samp{%s} or 1630@samp{%x} followed by a list of names. The former declares 1631@dfn{inclusive} start conditions, the latter @dfn{exclusive} start 1632conditions. A start condition is activated using the @code{BEGIN} 1633action. Until the next @code{BEGIN} action is executed, rules with the 1634given start condition will be active and rules with other start 1635conditions will be inactive. If the start condition is inclusive, then 1636rules with no start conditions at all will also be active. If it is 1637exclusive, then @emph{only} rules qualified with the start condition 1638will be active. A set of rules contingent on the same exclusive start 1639condition describe a scanner which is independent of any of the other 1640rules in the @code{flex} input. Because of this, exclusive start 1641conditions make it easy to specify ``mini-scanners'' which scan portions 1642of the input that are syntactically different from the rest (e.g., 1643comments). 1644 1645If the distinction between inclusive and exclusive start conditions 1646is still a little vague, here's a simple example illustrating the 1647connection between the two. The set of rules: 1648 1649@cindex start conditions, inclusive 1650@example 1651@verbatim 1652 %s example 1653 %% 1654 1655 <example>foo do_something(); 1656 1657 bar something_else(); 1658@end verbatim 1659@end example 1660 1661is equivalent to 1662 1663@cindex start conditions, exclusive 1664@example 1665@verbatim 1666 %x example 1667 %% 1668 1669 <example>foo do_something(); 1670 1671 <INITIAL,example>bar something_else(); 1672@end verbatim 1673@end example 1674 1675Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in 1676the second example wouldn't be active (i.e., couldn't match) when in 1677start condition @code{example}. If we just used @code{<example>} to 1678qualify @code{bar}, though, then it would only be active in 1679@code{example} and not in @code{INITIAL}, while in the first example 1680it's active in both, because in the first example the @code{example} 1681start condition is an inclusive @code{(%s)} start condition. 1682 1683@cindex start conditions, special wildcard condition 1684Also note that the special start-condition specifier 1685@code{<*>} 1686matches every start condition. Thus, the above example could also 1687have been written: 1688 1689@cindex start conditions, use of wildcard condition (<*>) 1690@example 1691@verbatim 1692 %x example 1693 %% 1694 1695 <example>foo do_something(); 1696 1697 <*>bar something_else(); 1698@end verbatim 1699@end example 1700 1701The default rule (to @code{ECHO} any unmatched character) remains active 1702in start conditions. It is equivalent to: 1703 1704@cindex start conditions, behavior of default rule 1705@example 1706@verbatim 1707 <*>.|\n ECHO; 1708@end verbatim 1709@end example 1710 1711@cindex BEGIN, explanation 1712@findex BEGIN 1713@vindex INITIAL 1714@code{BEGIN(0)} returns to the original state where only the rules with 1715no start conditions are active. This state can also be referred to as 1716the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is 1717equivalent to @code{BEGIN(0)}. (The parentheses around the start 1718condition name are not required but are considered good style.) 1719 1720@code{BEGIN} actions can also be given as indented code at the beginning 1721of the rules section. For example, the following will cause the scanner 1722to enter the @code{SPECIAL} start condition whenever @code{yylex()} is 1723called and the global variable @code{enter_special} is true: 1724 1725@cindex start conditions, using BEGIN 1726@example 1727@verbatim 1728 int enter_special; 1729 1730 %x SPECIAL 1731 %% 1732 if ( enter_special ) 1733 BEGIN(SPECIAL); 1734 1735 <SPECIAL>blahblahblah 1736 ...more rules follow... 1737@end verbatim 1738@end example 1739 1740To illustrate the uses of start conditions, here is a scanner which 1741provides two different interpretations of a string like @samp{123.456}. 1742By default it will treat it as three tokens, the integer @samp{123}, a 1743dot (@samp{.}), and the integer @samp{456}. But if the string is 1744preceded earlier in the line by the string @samp{expect-floats} it will 1745treat it as a single token, the floating-point number @samp{123.456}: 1746 1747@cindex start conditions, for different interpretations of same input 1748@example 1749@verbatim 1750 %{ 1751 #include <math.h> 1752 %} 1753 %s expect 1754 1755 %% 1756 expect-floats BEGIN(expect); 1757 1758 <expect>[0-9]+.[0-9]+ { 1759 printf( "found a float, = %f\n", 1760 atof( yytext ) ); 1761 } 1762 <expect>\n { 1763 /* that's the end of the line, so 1764 * we need another "expect-number" 1765 * before we'll recognize any more 1766 * numbers 1767 */ 1768 BEGIN(INITIAL); 1769 } 1770 1771 [0-9]+ { 1772 printf( "found an integer, = %d\n", 1773 atoi( yytext ) ); 1774 } 1775 1776 "." printf( "found a dot\n" ); 1777@end verbatim 1778@end example 1779 1780@cindex comments, example of scanning C comments 1781Here is a scanner which recognizes (and discards) C comments while 1782maintaining a count of the current input line. 1783 1784@cindex recognizing C comments 1785@example 1786@verbatim 1787 %x comment 1788 %% 1789 int line_num = 1; 1790 1791 "/*" BEGIN(comment); 1792 1793 <comment>[^*\n]* /* eat anything that's not a '*' */ 1794 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1795 <comment>\n ++line_num; 1796 <comment>"*"+"/" BEGIN(INITIAL); 1797@end verbatim 1798@end example 1799 1800This scanner goes to a bit of trouble to match as much 1801text as possible with each rule. In general, when attempting to write 1802a high-speed scanner try to match as much possible in each rule, as 1803it's a big win. 1804 1805Note that start-conditions names are really integer values and 1806can be stored as such. Thus, the above could be extended in the 1807following fashion: 1808 1809@cindex start conditions, integer values 1810@cindex using integer values of start condition names 1811@example 1812@verbatim 1813 %x comment foo 1814 %% 1815 int line_num = 1; 1816 int comment_caller; 1817 1818 "/*" { 1819 comment_caller = INITIAL; 1820 BEGIN(comment); 1821 } 1822 1823 ... 1824 1825 <foo>"/*" { 1826 comment_caller = foo; 1827 BEGIN(comment); 1828 } 1829 1830 <comment>[^*\n]* /* eat anything that's not a '*' */ 1831 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1832 <comment>\n ++line_num; 1833 <comment>"*"+"/" BEGIN(comment_caller); 1834@end verbatim 1835@end example 1836 1837@cindex YY_START, example 1838Furthermore, you can access the current start condition using the 1839integer-valued @code{YY_START} macro. For example, the above 1840assignments to @code{comment_caller} could instead be written 1841 1842@cindex getting current start state with YY_START 1843@example 1844@verbatim 1845 comment_caller = YY_START; 1846@end verbatim 1847@end example 1848 1849@vindex YY_START 1850Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that 1851is what's used by AT&T @code{lex}). 1852 1853For historical reasons, start conditions do not have their own 1854name-space within the generated scanner. The start condition names are 1855unmodified in the generated scanner and generated header. 1856@xref{option-header}. @xref{option-prefix}. 1857 1858 1859 1860Finally, here's an example of how to match C-style quoted strings using 1861exclusive start conditions, including expanded escape sequences (but 1862not including checking for a string that's too long): 1863 1864@cindex matching C-style double-quoted strings 1865@example 1866@verbatim 1867 %x str 1868 1869 %% 1870 char string_buf[MAX_STR_CONST]; 1871 char *string_buf_ptr; 1872 1873 1874 \" string_buf_ptr = string_buf; BEGIN(str); 1875 1876 <str>\" { /* saw closing quote - all done */ 1877 BEGIN(INITIAL); 1878 *string_buf_ptr = '\0'; 1879 /* return string constant token type and 1880 * value to parser 1881 */ 1882 } 1883 1884 <str>\n { 1885 /* error - unterminated string constant */ 1886 /* generate error message */ 1887 } 1888 1889 <str>\\[0-7]{1,3} { 1890 /* octal escape sequence */ 1891 int result; 1892 1893 (void) sscanf( yytext + 1, "%o", &result ); 1894 1895 if ( result > 0xff ) 1896 /* error, constant is out-of-bounds */ 1897 1898 *string_buf_ptr++ = result; 1899 } 1900 1901 <str>\\[0-9]+ { 1902 /* generate error - bad escape sequence; something 1903 * like '\48' or '\0777777' 1904 */ 1905 } 1906 1907 <str>\\n *string_buf_ptr++ = '\n'; 1908 <str>\\t *string_buf_ptr++ = '\t'; 1909 <str>\\r *string_buf_ptr++ = '\r'; 1910 <str>\\b *string_buf_ptr++ = '\b'; 1911 <str>\\f *string_buf_ptr++ = '\f'; 1912 1913 <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1914 1915 <str>[^\\\n\"]+ { 1916 char *yptr = yytext; 1917 1918 while ( *yptr ) 1919 *string_buf_ptr++ = *yptr++; 1920 } 1921@end verbatim 1922@end example 1923 1924@cindex start condition, applying to multiple patterns 1925Often, such as in some of the examples above, you wind up writing a 1926whole bunch of rules all preceded by the same start condition(s). Flex 1927makes this a little easier and cleaner by introducing a notion of start 1928condition @dfn{scope}. A start condition scope is begun with: 1929 1930@example 1931@verbatim 1932 <SCs>{ 1933@end verbatim 1934@end example 1935 1936where @code{<SCs>} is a list of one or more start conditions. Inside the 1937start condition scope, every rule automatically has the prefix 1938@code{<SCs>} applied to it, until a @samp{@}} which matches the initial 1939@samp{@{}. So, for example, 1940 1941@cindex extended scope of start conditions 1942@example 1943@verbatim 1944 <ESC>{ 1945 "\\n" return '\n'; 1946 "\\r" return '\r'; 1947 "\\f" return '\f'; 1948 "\\0" return '\0'; 1949 } 1950@end verbatim 1951@end example 1952 1953is equivalent to: 1954 1955@example 1956@verbatim 1957 <ESC>"\\n" return '\n'; 1958 <ESC>"\\r" return '\r'; 1959 <ESC>"\\f" return '\f'; 1960 <ESC>"\\0" return '\0'; 1961@end verbatim 1962@end example 1963 1964Start condition scopes may be nested. 1965 1966@cindex stacks, routines for manipulating 1967@cindex start conditions, use of a stack 1968 1969The following routines are available for manipulating stacks of start conditions: 1970 1971@deftypefun void yy_push_state ( int @code{new_state} ) 1972pushes the current start condition onto the top of the start condition 1973stack and switches to 1974@code{new_state} 1975as though you had used 1976@code{BEGIN new_state} 1977(recall that start condition names are also integers). 1978@end deftypefun 1979 1980@deftypefun void yy_pop_state () 1981pops the top of the stack and switches to it via 1982@code{BEGIN}. 1983@end deftypefun 1984 1985@deftypefun int yy_top_state () 1986returns the top of the stack without altering the stack's contents. 1987@end deftypefun 1988 1989@cindex memory, for start condition stacks 1990The start condition stack grows dynamically and so has no built-in size 1991limitation. If memory is exhausted, program execution aborts. 1992 1993To use start condition stacks, your scanner must include a @code{%option 1994stack} directive (@pxref{Scanner Options}). 1995 1996@node Multiple Input Buffers, EOF, Start Conditions, Top 1997@chapter Multiple Input Buffers 1998 1999@cindex multiple input streams 2000Some scanners (such as those which support ``include'' files) require 2001reading from several input streams. As @code{flex} scanners do a large 2002amount of buffering, one cannot control where the next input will be 2003read from by simply writing a @code{YY_INPUT()} which is sensitive to 2004the scanning context. @code{YY_INPUT()} is only called when the scanner 2005reaches the end of its buffer, which may be a long time after scanning a 2006statement such as an @code{include} statement which requires switching 2007the input source. 2008 2009To negotiate these sorts of problems, @code{flex} provides a mechanism 2010for creating and switching between multiple input buffers. An input 2011buffer is created by using: 2012 2013@cindex memory, allocating input buffers 2014@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) 2015@end deftypefun 2016 2017which takes a @code{FILE} pointer and a size and creates a buffer 2018associated with the given file and large enough to hold @code{size} 2019characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It 2020returns a @code{YY_BUFFER_STATE} handle, which may then be passed to 2021other routines (see below). 2022@tindex YY_BUFFER_STATE 2023The @code{YY_BUFFER_STATE} type is a 2024pointer to an opaque @code{struct yy_buffer_state} structure, so you may 2025safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) 20260)} if you wish, and also refer to the opaque structure in order to 2027correctly declare input buffers in source files other than that of your 2028scanner. Note that the @code{FILE} pointer in the call to 2029@code{yy_create_buffer} is only used as the value of @file{yyin} seen by 2030@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses 2031@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to 2032@code{yy_create_buffer}. You select a particular buffer to scan from 2033using: 2034 2035@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) 2036@end deftypefun 2037 2038The above function switches the scanner's input buffer so subsequent tokens 2039will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may 2040be used by @code{yywrap()} to set things up for continued scanning, instead of 2041opening a new file and pointing @file{yyin} at it. If you are looking for a 2042stack of input buffers, then you want to use @code{yypush_buffer_state()} 2043instead of this function. Note also that switching input sources via either 2044@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the 2045start condition. 2046 2047@cindex memory, deleting input buffers 2048@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) 2049@end deftypefun 2050 2051is used to reclaim the storage associated with a buffer. (@code{buffer} 2052can be NULL, in which case the routine does nothing.) You can also clear 2053the current contents of a buffer using: 2054 2055@cindex pushing an input buffer 2056@cindex stack, input buffer push 2057@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) 2058@end deftypefun 2059 2060This function pushes the new buffer state onto an internal stack. The pushed 2061state becomes the new current state. The stack is maintained by flex and will 2062grow as required. This function is intended to be used instead of 2063@code{yy_switch_to_buffer}, when you want to change states, but preserve the 2064current state for later use. 2065 2066@cindex popping an input buffer 2067@cindex stack, input buffer pop 2068@deftypefun void yypop_buffer_state ( ) 2069@end deftypefun 2070 2071This function removes the current state from the top of the stack, and deletes 2072it by calling @code{yy_delete_buffer}. The next state on the stack, if any, 2073becomes the new current state. 2074 2075@cindex clearing an input buffer 2076@cindex flushing an input buffer 2077@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) 2078@end deftypefun 2079 2080This function discards the buffer's contents, 2081so the next time the scanner attempts to match a token from the 2082buffer, it will first fill the buffer anew using 2083@code{YY_INPUT()}. 2084 2085@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) 2086@end deftypefun 2087 2088is an alias for @code{yy_create_buffer()}, 2089provided for compatibility with the C++ use of @code{new} and 2090@code{delete} for creating and destroying dynamic objects. 2091 2092@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro 2093@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the 2094current buffer. It should not be used as an lvalue. 2095 2096@cindex EOF, example using multiple input buffers 2097Here are two examples of using these features for writing a scanner 2098which expands include files (the 2099@code{<<EOF>>} 2100feature is discussed below). 2101 2102This first example uses yypush_buffer_state and yypop_buffer_state. Flex 2103maintains the stack internally. 2104 2105@cindex handling include files with multiple input buffers 2106@example 2107@verbatim 2108 /* the "incl" state is used for picking up the name 2109 * of an include file 2110 */ 2111 %x incl 2112 %% 2113 include BEGIN(incl); 2114 2115 [a-z]+ ECHO; 2116 [^a-z\n]*\n? ECHO; 2117 2118 <incl>[ \t]* /* eat the whitespace */ 2119 <incl>[^ \t\n]+ { /* got the include file name */ 2120 yyin = fopen( yytext, "r" ); 2121 2122 if ( ! yyin ) 2123 error( ... ); 2124 2125 yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); 2126 2127 BEGIN(INITIAL); 2128 } 2129 2130 <<EOF>> { 2131 yypop_buffer_state(); 2132 2133 if ( !YY_CURRENT_BUFFER ) 2134 { 2135 yyterminate(); 2136 } 2137 } 2138@end verbatim 2139@end example 2140 2141The second example, below, does the same thing as the previous example did, but 2142manages its own input buffer stack manually (instead of letting flex do it). 2143 2144@cindex handling include files with multiple input buffers 2145@example 2146@verbatim 2147 /* the "incl" state is used for picking up the name 2148 * of an include file 2149 */ 2150 %x incl 2151 2152 %{ 2153 #define MAX_INCLUDE_DEPTH 10 2154 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 2155 int include_stack_ptr = 0; 2156 %} 2157 2158 %% 2159 include BEGIN(incl); 2160 2161 [a-z]+ ECHO; 2162 [^a-z\n]*\n? ECHO; 2163 2164 <incl>[ \t]* /* eat the whitespace */ 2165 <incl>[^ \t\n]+ { /* got the include file name */ 2166 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 2167 { 2168 fprintf( stderr, "Includes nested too deeply" ); 2169 exit( 1 ); 2170 } 2171 2172 include_stack[include_stack_ptr++] = 2173 YY_CURRENT_BUFFER; 2174 2175 yyin = fopen( yytext, "r" ); 2176 2177 if ( ! yyin ) 2178 error( ... ); 2179 2180 yy_switch_to_buffer( 2181 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 2182 2183 BEGIN(INITIAL); 2184 } 2185 2186 <<EOF>> { 2187 if ( --include_stack_ptr 0 ) 2188 { 2189 yyterminate(); 2190 } 2191 2192 else 2193 { 2194 yy_delete_buffer( YY_CURRENT_BUFFER ); 2195 yy_switch_to_buffer( 2196 include_stack[include_stack_ptr] ); 2197 } 2198 } 2199@end verbatim 2200@end example 2201 2202@anchor{Scanning Strings} 2203@cindex strings, scanning strings instead of files 2204The following routines are available for setting up input buffers for 2205scanning in-memory strings instead of files. All of them create a new 2206input buffer for scanning the string, and return a corresponding 2207@code{YY_BUFFER_STATE} handle (which you should delete with 2208@code{yy_delete_buffer()} when done with it). They also switch to the 2209new buffer using @code{yy_switch_to_buffer()}, so the next call to 2210@code{yylex()} will start scanning the string. 2211 2212@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) 2213scans a NUL-terminated string. 2214@end deftypefun 2215 2216@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) 2217scans @code{len} bytes (including possibly @code{NUL}s) starting at location 2218@code{bytes}. 2219@end deftypefun 2220 2221Note that both of these functions create and scan a @emph{copy} of the 2222string or bytes. (This may be desirable, since @code{yylex()} modifies 2223the contents of the buffer it is scanning.) You can avoid the copy by 2224using: 2225 2226@vindex YY_END_OF_BUFFER_CHAR 2227@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) 2228which scans in place the buffer starting at @code{base}, consisting of 2229@code{size} bytes, the last two bytes of which @emph{must} be 2230@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not 2231scanned; thus, scanning consists of @code{base[0]} through 2232@code{base[size-2]}, inclusive. 2233@end deftypefun 2234 2235If you fail to set up @code{base} in this manner (i.e., forget the final 2236two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()} 2237returns a NULL pointer instead of creating a new input buffer. 2238 2239@deftp {Data type} yy_size_t 2240is an integral type to which you can cast an integer expression 2241reflecting the size of the buffer. 2242@end deftp 2243 2244@node EOF, Misc Macros, Multiple Input Buffers, Top 2245@chapter End-of-File Rules 2246 2247@cindex EOF, explanation 2248The special rule @code{<<EOF>>} indicates 2249actions which are to be taken when an end-of-file is 2250encountered and @code{yywrap()} returns non-zero (i.e., indicates 2251no further files to process). The action must finish 2252by doing one of the following things: 2253 2254@itemize 2255@item 2256@findex YY_NEW_FILE (now obsolete) 2257assigning @file{yyin} to a new input file (in previous versions of 2258@code{flex}, after doing the assignment you had to call the special 2259action @code{YY_NEW_FILE}. This is no longer necessary.) 2260 2261@item 2262executing a @code{return} statement; 2263 2264@item 2265executing the special @code{yyterminate()} action. 2266 2267@item 2268or, switching to a new buffer using @code{yy_switch_to_buffer()} as 2269shown in the example above. 2270@end itemize 2271 2272<<EOF>> rules may not be used with other patterns; they may only be 2273qualified with a list of start conditions. If an unqualified <<EOF>> 2274rule is given, it applies to @emph{all} start conditions which do not 2275already have <<EOF>> actions. To specify an <<EOF>> rule for only the 2276initial start condition, use: 2277 2278@example 2279@verbatim 2280 <INITIAL><<EOF>> 2281@end verbatim 2282@end example 2283 2284These rules are useful for catching things like unclosed comments. An 2285example: 2286 2287@cindex <<EOF>>, use of 2288@example 2289@verbatim 2290 %x quote 2291 %% 2292 2293 ...other rules for dealing with quotes... 2294 2295 <quote><<EOF>> { 2296 error( "unterminated quote" ); 2297 yyterminate(); 2298 } 2299 <<EOF>> { 2300 if ( *++filelist ) 2301 yyin = fopen( *filelist, "r" ); 2302 else 2303 yyterminate(); 2304 } 2305@end verbatim 2306@end example 2307 2308@node Misc Macros, User Values, EOF, Top 2309@chapter Miscellaneous Macros 2310 2311@hkindex YY_USER_ACTION 2312The macro @code{YY_USER_ACTION} can be defined to provide an action 2313which is always executed prior to the matched rule's action. For 2314example, it could be #define'd to call a routine to convert yytext to 2315lower-case. When @code{YY_USER_ACTION} is invoked, the variable 2316@code{yy_act} gives the number of the matched rule (rules are numbered 2317starting with 1). Suppose you want to profile how often each of your 2318rules is matched. The following would do the trick: 2319 2320@cindex YY_USER_ACTION to track each time a rule is matched 2321@example 2322@verbatim 2323 #define YY_USER_ACTION ++ctr[yy_act] 2324@end verbatim 2325@end example 2326 2327@vindex YY_NUM_RULES 2328where @code{ctr} is an array to hold the counts for the different rules. 2329Note that the macro @code{YY_NUM_RULES} gives the total number of rules 2330(including the default rule), even if you use @samp{-s)}, so a correct 2331declaration for @code{ctr} is: 2332 2333@example 2334@verbatim 2335 int ctr[YY_NUM_RULES]; 2336@end verbatim 2337@end example 2338 2339@hkindex YY_USER_INIT 2340The macro @code{YY_USER_INIT} may be defined to provide an action which 2341is always executed before the first scan (and before the scanner's 2342internal initializations are done). For example, it could be used to 2343call a routine to read in a data table or open a logging file. 2344 2345@findex yy_set_interactive 2346The macro @code{yy_set_interactive(is_interactive)} can be used to 2347control whether the current buffer is considered @dfn{interactive}. An 2348interactive buffer is processed more slowly, but must be used when the 2349scanner's input source is indeed interactive to avoid problems due to 2350waiting to fill buffers (see the discussion of the @samp{-I} flag in 2351@ref{Scanner Options}). A non-zero value in the macro invocation marks 2352the buffer as interactive, a zero value as non-interactive. Note that 2353use of this macro overrides @code{%option always-interactive} or 2354@code{%option never-interactive} (@pxref{Scanner Options}). 2355@code{yy_set_interactive()} must be invoked prior to beginning to scan 2356the buffer that is (or is not) to be considered interactive. 2357 2358@cindex BOL, setting it 2359@findex yy_set_bol 2360The macro @code{yy_set_bol(at_bol)} can be used to control whether the 2361current buffer's scanning context for the next token match is done as 2362though at the beginning of a line. A non-zero macro argument makes 2363rules anchored with @samp{^} active, while a zero argument makes 2364@samp{^} rules inactive. 2365 2366@cindex BOL, checking the BOL flag 2367@findex YY_AT_BOL 2368The macro @code{YY_AT_BOL()} returns true if the next token scanned from 2369the current buffer will have @samp{^} rules active, false otherwise. 2370 2371@cindex actions, redefining YY_BREAK 2372@hkindex YY_BREAK 2373In the generated scanner, the actions are all gathered in one large 2374switch statement and separated using @code{YY_BREAK}, which may be 2375redefined. By default, it is simply a @code{break}, to separate each 2376rule's action from the following rule's. Redefining @code{YY_BREAK} 2377allows, for example, C++ users to #define YY_BREAK to do nothing (while 2378being very careful that every rule ends with a @code{break} or a 2379@code{return}!) to avoid suffering from unreachable statement warnings 2380where because a rule's action ends with @code{return}, the 2381@code{YY_BREAK} is inaccessible. 2382 2383@node User Values, Yacc, Misc Macros, Top 2384@chapter Values Available To the User 2385 2386This chapter summarizes the various values available to the user in the 2387rule actions. 2388 2389@table @code 2390@vindex yytext 2391@item char *yytext 2392holds the text of the current token. It may be modified but not 2393lengthened (you cannot append characters to the end). 2394 2395@cindex yytext, default array size 2396@cindex array, default size for yytext 2397@vindex YYLMAX 2398If the special directive @code{%array} appears in the first section of 2399the scanner description, then @code{yytext} is instead declared 2400@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition 2401that you can redefine in the first section if you don't like the default 2402value (generally 8KB). Using @code{%array} results in somewhat slower 2403scanners, but the value of @code{yytext} becomes immune to calls to 2404@code{unput()}, which potentially destroy its value when @code{yytext} is 2405a character pointer. The opposite of @code{%array} is @code{%pointer}, 2406which is the default. 2407 2408@cindex C++ and %array 2409You cannot use @code{%array} when generating C++ scanner classes (the 2410@samp{-+} flag). 2411 2412@vindex yyleng 2413@item int yyleng 2414holds the length of the current token. 2415 2416@vindex yyin 2417@item FILE *yyin 2418is the file which by default @code{flex} reads from. It may be 2419redefined but doing so only makes sense before scanning begins or after 2420an EOF has been encountered. Changing it in the midst of scanning will 2421have unexpected results since @code{flex} buffers its input; use 2422@code{yyrestart()} instead. Once scanning terminates because an 2423end-of-file has been seen, you can assign @file{yyin} at the new input 2424file and then call the scanner again to continue scanning. 2425 2426@findex yyrestart 2427@item void yyrestart( FILE *new_file ) 2428may be called to point @file{yyin} at the new input file. The 2429switch-over to the new file is immediate (any previously buffered-up 2430input is lost). Note that calling @code{yyrestart()} with @file{yyin} 2431as an argument thus throws away the current input buffer and continues 2432scanning the same input file. 2433 2434@vindex yyout 2435@item FILE *yyout 2436is the file to which @code{ECHO} actions are done. It can be reassigned 2437by the user. 2438 2439@vindex YY_CURRENT_BUFFER 2440@item YY_CURRENT_BUFFER 2441returns a @code{YY_BUFFER_STATE} handle to the current buffer. 2442 2443@vindex YY_START 2444@item YY_START 2445returns an integer value corresponding to the current start condition. 2446You can subsequently use this value with @code{BEGIN} to return to that 2447start condition. 2448@end table 2449 2450@node Yacc, Scanner Options, User Values, Top 2451@chapter Interfacing with Yacc 2452 2453@cindex yacc, interface 2454 2455@vindex yylval, with yacc 2456One of the main uses of @code{flex} is as a companion to the @code{yacc} 2457parser-generator. @code{yacc} parsers expect to call a routine named 2458@code{yylex()} to find the next input token. The routine is supposed to 2459return the type of the next token as well as putting any associated 2460value in the global @code{yylval}. To use @code{flex} with @code{yacc}, 2461one specifies the @samp{-d} option to @code{yacc} to instruct it to 2462generate the file @file{y.tab.h} containing definitions of all the 2463@code{%tokens} appearing in the @code{yacc} input. This file is then 2464included in the @code{flex} scanner. For example, if one of the tokens 2465is @code{TOK_NUMBER}, part of the scanner might look like: 2466 2467@cindex yacc interface 2468@example 2469@verbatim 2470 %{ 2471 #include "y.tab.h" 2472 %} 2473 2474 %% 2475 2476 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2477@end verbatim 2478@end example 2479 2480@node Scanner Options, Performance, Yacc, Top 2481@chapter Scanner Options 2482 2483@cindex command-line options 2484@cindex options, command-line 2485@cindex arguments, command-line 2486 2487The various @code{flex} options are categorized by function in the following 2488menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. 2489 2490@menu 2491* Options for Specifying Filenames:: 2492* Options Affecting Scanner Behavior:: 2493* Code-Level And API Options:: 2494* Options for Scanner Speed and Size:: 2495* Debugging Options:: 2496* Miscellaneous Options:: 2497@end menu 2498 2499Even though there are many scanner options, a typical scanner might only 2500specify the following options: 2501 2502@example 2503@verbatim 2504%option 8bit reentrant bison-bridge 2505%option warn nodefault 2506%option yylineno 2507%option outfile="scanner.c" header-file="scanner.h" 2508@end verbatim 2509@end example 2510 2511The first line specifies the general type of scanner we want. The second line 2512specifies that we are being careful. The third line asks flex to track line 2513numbers. The last line tells flex what to name the files. (The options can be 2514specified in any order. We just divided them.) 2515 2516@code{flex} also provides a mechanism for controlling options within the 2517scanner specification itself, rather than from the flex command-line. 2518This is done by including @code{%option} directives in the first section 2519of the scanner specification. You can specify multiple options with a 2520single @code{%option} directive, and multiple directives in the first 2521section of your flex input file. 2522 2523Most options are given simply as names, optionally preceded by the 2524word @samp{no} (with no intervening whitespace) to negate their meaning. 2525The names are the same as their long-option equivalents (but without the 2526leading @samp{--} ). 2527 2528@code{flex} scans your rule actions to determine whether you use the 2529@code{REJECT} or @code{yymore()} features. The @code{REJECT} and 2530@code{yymore} options are available to override its decision as to 2531whether you use the options, either by setting them (e.g., @code{%option 2532reject)} to indicate the feature is indeed used, or unsetting them to 2533indicate it actually is not used (e.g., @code{%option noyymore)}. 2534 2535 2536A number of options are available for lint purists who want to suppress 2537the appearance of unneeded routines in the generated scanner. Each of 2538the following, if unset (e.g., @code{%option nounput}), results in the 2539corresponding routine not appearing in the generated scanner: 2540 2541@example 2542@verbatim 2543 input, unput 2544 yy_push_state, yy_pop_state, yy_top_state 2545 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2546 2547 yyget_extra, yyset_extra, yyget_leng, yyget_text, 2548 yyget_lineno, yyset_lineno, yyget_in, yyset_in, 2549 yyget_out, yyset_out, yyget_lval, yyset_lval, 2550 yyget_lloc, yyset_lloc, yyget_debug, yyset_debug 2551@end verbatim 2552@end example 2553 2554(though @code{yy_push_state()} and friends won't appear anyway unless 2555you use @code{%option stack)}. 2556 2557@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options 2558@section Options for Specifying Filenames 2559 2560@table @samp 2561 2562@anchor{option-header} 2563@opindex ---header-file 2564@opindex header-file 2565@item --header-file=FILE, @code{%option header-file="FILE"} 2566instructs flex to write a C header to @file{FILE}. This file contains 2567function prototypes, extern variables, and types used by the scanner. 2568Only the external API is exported by the header file. Many macros that 2569are usable from within scanner actions are not exported to the header 2570file. This is due to namespace problems and the goal of a clean 2571external API. 2572 2573While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} 2574is substituted with the appropriate prefix. 2575 2576The @samp{--header-file} option is not compatible with the @samp{--c++} option, 2577since the C++ scanner provides its own header in @file{yyFlexLexer.h}. 2578 2579 2580 2581@anchor{option-outfile} 2582@opindex -o 2583@opindex ---outfile 2584@opindex outfile 2585@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"} 2586directs flex to write the scanner to the file @file{FILE} instead of 2587@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option, 2588then the scanner is written to @file{stdout} but its @code{#line} 2589directives (see the @samp{-l} option above) refer to the file 2590@file{FILE}. 2591 2592 2593 2594@anchor{option-stdout} 2595@opindex -t 2596@opindex ---stdout 2597@opindex stdout 2598@item -t, --stdout, @code{%option stdout} 2599instructs @code{flex} to write the scanner it generates to standard 2600output instead of @file{lex.yy.c}. 2601 2602 2603 2604@opindex ---skel 2605@item -SFILE, --skel=FILE 2606overrides the default skeleton file from which 2607@code{flex} 2608constructs its scanners. You'll never need this option unless you are doing 2609@code{flex} 2610maintenance or development. 2611 2612@opindex ---tables-file 2613@opindex tables-file 2614@item --tables-file=FILE 2615Write serialized scanner dfa tables to FILE. The generated scanner will not 2616contain the tables, and requires them to be loaded at runtime. 2617@xref{serialization}. 2618 2619@opindex ---tables-verify 2620@opindex tables-verify 2621@item --tables-verify 2622This option is for flex development. We document it here in case you stumble 2623upon it by accident or in case you suspect some inconsistency in the serialized 2624tables. Flex will serialize the scanner dfa tables but will also generate the 2625in-code tables as it normally does. At runtime, the scanner will verify that 2626the serialized tables match the in-code tables, instead of loading them. 2627 2628@end table 2629 2630@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options 2631@section Options Affecting Scanner Behavior 2632 2633@table @samp 2634@anchor{option-case-insensitive} 2635@opindex -i 2636@opindex ---case-insensitive 2637@opindex case-insensitive 2638@item -i, --case-insensitive, @code{%option case-insensitive} 2639instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The 2640case of letters given in the @code{flex} input patterns will be ignored, 2641and tokens in the input will be matched regardless of case. The matched 2642text given in @code{yytext} will have the preserved case (i.e., it will 2643not be folded). For tricky behavior, see @ref{case and character ranges}. 2644 2645 2646 2647@anchor{option-lex-compat} 2648@opindex -l 2649@opindex ---lex-compat 2650@opindex lex-compat 2651@item -l, --lex-compat, @code{%option lex-compat} 2652turns on maximum compatibility with the original AT&T @code{lex} 2653implementation. Note that this does not mean @emph{full} compatibility. 2654Use of this option costs a considerable amount of performance, and it 2655cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or 2656@samp{-CF} options. For details on the compatibilities it provides, see 2657@ref{Lex and Posix}. This option also results in the name 2658@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. 2659 2660 2661 2662@anchor{option-batch} 2663@opindex -B 2664@opindex ---batch 2665@opindex batch 2666@item -B, --batch, @code{%option batch} 2667instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of 2668@emph{interactive} scanners generated by @samp{--interactive} (see below). In 2669general, you use @samp{-B} when you are @emph{certain} that your scanner 2670will never be used interactively, and you want to squeeze a 2671@emph{little} more performance out of it. If your goal is instead to 2672squeeze out a @emph{lot} more performance, you should be using the 2673@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically 2674anyway. 2675 2676 2677 2678@anchor{option-interactive} 2679@opindex -I 2680@opindex ---interactive 2681@opindex interactive 2682@item -I, --interactive, @code{%option interactive} 2683instructs @code{flex} to generate an @i{interactive} scanner. An 2684interactive scanner is one that only looks ahead to decide what token 2685has been matched if it absolutely must. It turns out that always 2686looking one extra character ahead, even if the scanner has already seen 2687enough text to disambiguate the current token, is a bit faster than only 2688looking ahead when necessary. But scanners that always look ahead give 2689dreadful interactive performance; for example, when a user types a 2690newline, it is not recognized as a newline token until they enter 2691@emph{another} token, which often means typing in another whole line. 2692 2693@code{flex} scanners default to @code{interactive} unless you use the 2694@samp{-Cf} or @samp{-CF} table-compression options 2695(@pxref{Performance}). That's because if you're looking for 2696high-performance you should be using one of these options, so if you 2697didn't, @code{flex} assumes you'd rather trade off a bit of run-time 2698performance for intuitive interactive behavior. Note also that you 2699@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or 2700@samp{-CF}. Thus, this option is not really needed; it is on by default 2701for all those cases in which it is allowed. 2702 2703You can force a scanner to 2704@emph{not} 2705be interactive by using 2706@samp{--batch} 2707 2708 2709 2710@anchor{option-7bit} 2711@opindex -7 2712@opindex ---7bit 2713@opindex 7bit 2714@item -7, --7bit, @code{%option 7bit} 2715instructs @code{flex} to generate a 7-bit scanner, i.e., one which can 2716only recognize 7-bit characters in its input. The advantage of using 2717@samp{--7bit} is that the scanner's tables can be up to half the size of 2718those generated using the @samp{--8bit}. The disadvantage is that such 2719scanners often hang or crash if their input contains an 8-bit character. 2720 2721Note, however, that unless you generate your scanner using the 2722@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} 2723will save only a small amount of table space, and make your scanner 2724considerably less portable. @code{Flex}'s default behavior is to 2725generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, 2726in which case @code{flex} defaults to generating 7-bit scanners unless 2727your site was always configured to generate 8-bit scanners (as will 2728often be the case with non-USA sites). You can tell whether flex 2729generated a 7-bit or an 8-bit scanner by inspecting the flag summary in 2730the @samp{--verbose} output as described above. 2731 2732Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still 2733defaults to generating an 8-bit scanner, since usually with these 2734compression options full 8-bit tables are not much more expensive than 27357-bit tables. 2736 2737 2738 2739@anchor{option-8bit} 2740@opindex -8 2741@opindex ---8bit 2742@opindex 8bit 2743@item -8, --8bit, @code{%option 8bit} 2744instructs @code{flex} to generate an 8-bit scanner, i.e., one which can 2745recognize 8-bit characters. This flag is only needed for scanners 2746generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to 2747generating an 8-bit scanner anyway. 2748 2749See the discussion of 2750@samp{--7bit} 2751above for @code{flex}'s default behavior and the tradeoffs between 7-bit 2752and 8-bit scanners. 2753 2754 2755 2756@anchor{option-default} 2757@opindex ---default 2758@opindex default 2759@item --default, @code{%option default} 2760generate the default rule. 2761 2762 2763 2764@anchor{option-always-interactive} 2765@opindex ---always-interactive 2766@opindex always-interactive 2767@item --always-interactive, @code{%option always-interactive} 2768instructs flex to generate a scanner which always considers its input 2769@emph{interactive}. Normally, on each new input file the scanner calls 2770@code{isatty()} in an attempt to determine whether the scanner's input 2771source is interactive and thus should be read a character at a time. 2772When this option is used, however, then no such call is made. 2773 2774 2775 2776@opindex ---never-interactive 2777@item --never-interactive, @code{--never-interactive} 2778instructs flex to generate a scanner which never considers its input 2779interactive. This is the opposite of @code{always-interactive}. 2780 2781 2782@anchor{option-posix} 2783@opindex -X 2784@opindex ---posix 2785@opindex posix 2786@item -X, --posix, @code{%option posix} 2787turns on maximum compatibility with the POSIX 1003.2-1992 definition of 2788@code{lex}. Since @code{flex} was originally designed to implement the 2789POSIX definition of @code{lex} this generally involves very few changes 2790in behavior. At the current writing the known differences between 2791@code{flex} and the POSIX standard are: 2792 2793@itemize 2794@item 2795In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower 2796precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). 2797Most POSIX utilities use an Extended Regular Expression (ERE) precedence 2798that has the precedence of the repeat operator higher than concatenation 2799(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex} 2800places the precedence of the repeat operator higher than concatenation 2801which matches the ERE processing of other POSIX utilities. When either 2802@samp{--posix} or @samp{-l} are specified, @code{flex} will use the 2803traditional AT&T and POSIX-compliant precedence for the repeat operator 2804where concatenation has higher precedence than the repeat operator. 2805@end itemize 2806 2807 2808@anchor{option-stack} 2809@opindex ---stack 2810@opindex stack 2811@item --stack, @code{%option stack} 2812enables the use of 2813start condition stacks (@pxref{Start Conditions}). 2814 2815 2816 2817@anchor{option-stdinit} 2818@opindex ---stdinit 2819@opindex stdinit 2820@item --stdinit, @code{%option stdinit} 2821if set (i.e., @b{%option stdinit)} initializes @code{yyin} and 2822@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of 2823@file{NULL}. Some existing @code{lex} programs depend on this behavior, 2824even though it is not compliant with ANSI C, which does not require 2825@file{stdin} and @file{stdout} to be compile-time constant. In a 2826reentrant scanner, however, this is not a problem since initialization 2827is performed in @code{yylex_init} at runtime. 2828 2829 2830 2831@anchor{option-yylineno} 2832@opindex ---yylineno 2833@opindex yylineno 2834@item --yylineno, @code{%option yylineno} 2835directs @code{flex} to generate a scanner 2836that maintains the number of the current line read from its input in the 2837global variable @code{yylineno}. This option is implied by @code{%option 2838lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is 2839accessible regardless of the value of @code{%option yylineno}, however, its 2840value is not modified by @code{flex} unless @code{%option yylineno} is enabled. 2841 2842 2843 2844@anchor{option-yywrap} 2845@opindex ---yywrap 2846@opindex yywrap 2847@item --yywrap, @code{%option yywrap} 2848if unset (i.e., @code{--noyywrap)}, makes the scanner not call 2849@code{yywrap()} upon an end-of-file, but simply assume that there are no 2850more files to scan (until the user points @file{yyin} at a new file and 2851calls @code{yylex()} again). 2852 2853@end table 2854 2855@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options 2856@section Code-Level And API Options 2857 2858@table @samp 2859 2860@anchor{option-ansi-definitions} 2861@opindex ---option-ansi-definitions 2862@opindex ansi-definitions 2863@item --ansi-definitions, @code{%option ansi-definitions} 2864instruct flex to generate ANSI C99 definitions for functions. 2865This option is enabled by default. 2866If @code{%option noansi-definitions} is specified, then the obsolete style 2867is generated. 2868 2869@anchor{option-ansi-prototypes} 2870@opindex ---option-ansi-prototypes 2871@opindex ansi-prototypes 2872@item --ansi-prototypes, @code{%option ansi-prototypes} 2873instructs flex to generate ANSI C99 prototypes for functions. 2874This option is enabled by default. 2875If @code{noansi-prototypes} is specified, then 2876prototypes will have empty parameter lists. 2877 2878@anchor{option-bison-bridge} 2879@opindex ---bison-bridge 2880@opindex bison-bridge 2881@item --bison-bridge, @code{%option bison-bridge} 2882instructs flex to generate a C scanner that is 2883meant to be called by a 2884@code{GNU bison} 2885parser. The scanner has minor API changes for 2886@code{bison} 2887compatibility. In particular, the declaration of 2888@code{yylex} 2889is modified to take an additional parameter, 2890@code{yylval}. 2891@xref{Bison Bridge}. 2892 2893@anchor{option-bison-locations} 2894@opindex ---bison-locations 2895@opindex bison-locations 2896@item --bison-locations, @code{%option bison-locations} 2897instruct flex that 2898@code{GNU bison} @code{%locations} are being used. 2899This means @code{yylex} will be passed 2900an additional parameter, @code{yylloc}. This option 2901implies @code{%option bison-bridge}. 2902@xref{Bison Bridge}. 2903 2904@anchor{option-noline} 2905@opindex -L 2906@opindex ---noline 2907@opindex noline 2908@item -L, --noline, @code{%option noline} 2909instructs 2910@code{flex} 2911not to generate 2912@code{#line} 2913directives. Without this option, 2914@code{flex} 2915peppers the generated scanner 2916with @code{#line} directives so error messages in the actions will be correctly 2917located with respect to either the original 2918@code{flex} 2919input file (if the errors are due to code in the input file), or 2920@file{lex.yy.c} 2921(if the errors are 2922@code{flex}'s 2923fault -- you should report these sorts of errors to the email address 2924given in @ref{Reporting Bugs}). 2925 2926 2927 2928@anchor{option-reentrant} 2929@opindex -R 2930@opindex ---reentrant 2931@opindex reentrant 2932@item -R, --reentrant, @code{%option reentrant} 2933instructs flex to generate a reentrant C scanner. The generated scanner 2934may safely be used in a multi-threaded environment. The API for a 2935reentrant scanner is different than for a non-reentrant scanner 2936@pxref{Reentrant}). Because of the API difference between 2937reentrant and non-reentrant @code{flex} scanners, non-reentrant flex 2938code must be modified before it is suitable for use with this option. 2939This option is not compatible with the @samp{--c++} option. 2940 2941The option @samp{--reentrant} does not affect the performance of 2942the scanner. 2943 2944 2945 2946@anchor{option-c++} 2947@opindex -+ 2948@opindex ---c++ 2949@opindex c++ 2950@item -+, --c++, @code{%option c++} 2951specifies that you want flex to generate a C++ 2952scanner class. @xref{Cxx}, for 2953details. 2954 2955 2956 2957@anchor{option-array} 2958@opindex ---array 2959@opindex array 2960@item --array, @code{%option array} 2961specifies that you want yytext to be an array instead of a char* 2962 2963 2964 2965@anchor{option-pointer} 2966@opindex ---pointer 2967@opindex pointer 2968@item --pointer, @code{%option pointer} 2969specify that @code{yytext} should be a @code{char *}, not an array. 2970This default is @code{char *}. 2971 2972 2973 2974@anchor{option-prefix} 2975@opindex -P 2976@opindex ---prefix 2977@opindex prefix 2978@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"} 2979changes the default @samp{yy} prefix used by @code{flex} for all 2980globally-visible variable and function names to instead be 2981@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of 2982@code{yytext} to @code{footext}. It also changes the name of the default 2983output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial 2984list of the names affected: 2985 2986@example 2987@verbatim 2988 yy_create_buffer 2989 yy_delete_buffer 2990 yy_flex_debug 2991 yy_init_buffer 2992 yy_flush_buffer 2993 yy_load_buffer_state 2994 yy_switch_to_buffer 2995 yyin 2996 yyleng 2997 yylex 2998 yylineno 2999 yyout 3000 yyrestart 3001 yytext 3002 yywrap 3003 yyalloc 3004 yyrealloc 3005 yyfree 3006@end verbatim 3007@end example 3008 3009(If you are using a C++ scanner, then only @code{yywrap} and 3010@code{yyFlexLexer} are affected.) Within your scanner itself, you can 3011still refer to the global variables and functions using either version 3012of their name; but externally, they have the modified name. 3013 3014This option lets you easily link together multiple 3015@code{flex} 3016programs into the same executable. Note, though, that using this 3017option also renames 3018@code{yywrap()}, 3019so you now 3020@emph{must} 3021either 3022provide your own (appropriately-named) version of the routine for your 3023scanner, or use 3024@code{%option noyywrap}, 3025as linking with 3026@samp{-lfl} 3027no longer provides one for you by default. 3028 3029 3030 3031@anchor{option-main} 3032@opindex ---main 3033@opindex main 3034@item --main, @code{%option main} 3035 directs flex to provide a default @code{main()} program for the 3036scanner, which simply calls @code{yylex()}. This option implies 3037@code{noyywrap} (see below). 3038 3039 3040 3041@anchor{option-nounistd} 3042@opindex ---nounistd 3043@opindex nounistd 3044@item --nounistd, @code{%option nounistd} 3045suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option 3046is meant to target environments in which @file{unistd.h} does not exist. Be aware 3047that certain options may cause flex to generate code that relies on functions 3048normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.) 3049If you wish to use these functions, you will have to inform your compiler where 3050to find them. 3051@xref{option-always-interactive}. @xref{option-read}. 3052 3053 3054 3055@anchor{option-yyclass} 3056@opindex ---yyclass 3057@opindex yyclass 3058@item --yyclass=NAME, @code{%option yyclass="NAME"} 3059only applies when generating a C++ scanner (the @samp{--c++} option). It 3060informs @code{flex} that you have derived @code{NAME} as a subclass of 3061@code{yyFlexLexer}, so @code{flex} will place your actions in the member 3062function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It 3063also generates a @code{yyFlexLexer::yylex()} member function that emits 3064a run-time error (by invoking @code{yyFlexLexer::LexerError())} if 3065called. @xref{Cxx}. 3066 3067@end table 3068 3069@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options 3070@section Options for Scanner Speed and Size 3071 3072@table @samp 3073 3074@item -C[aefFmr] 3075controls the degree of table compression and, more generally, trade-offs 3076between small scanners and fast scanners. 3077 3078@table @samp 3079@opindex -C 3080@item -C 3081A lone @samp{-C} specifies that the scanner tables should be compressed 3082but neither equivalence classes nor meta-equivalence classes should be 3083used. 3084 3085@anchor{option-align} 3086@opindex -Ca 3087@opindex ---align 3088@opindex align 3089@item -Ca, --align, @code{%option align} 3090(``align'') instructs flex to trade off larger tables in the 3091generated scanner for faster performance because the elements of 3092the tables are better aligned for memory access and computation. On some 3093RISC architectures, fetching and manipulating longwords is more efficient 3094than with smaller-sized units such as shortwords. This option can 3095quadruple the size of the tables used by your scanner. 3096 3097@anchor{option-ecs} 3098@opindex -Ce 3099@opindex ---ecs 3100@opindex ecs 3101@item -Ce, --ecs, @code{%option ecs} 3102directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets 3103of characters which have identical lexical properties (for example, if 3104the only appearance of digits in the @code{flex} input is in the 3105character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be 3106put in the same equivalence class). Equivalence classes usually give 3107dramatic reductions in the final table/object file sizes (typically a 3108factor of 2-5) and are pretty cheap performance-wise (one array look-up 3109per character scanned). 3110 3111@opindex -Cf 3112@item -Cf 3113specifies that the @dfn{full} scanner tables should be generated - 3114@code{flex} should not compress the tables by taking advantages of 3115similar transition functions for different states. 3116 3117@opindex -CF 3118@item -CF 3119specifies that the alternate fast scanner representation (described 3120above under the @samp{--fast} flag) should be used. This option cannot be 3121used with @samp{--c++}. 3122 3123@anchor{option-meta-ecs} 3124@opindex -Cm 3125@opindex ---meta-ecs 3126@opindex meta-ecs 3127@item -Cm, --meta-ecs, @code{%option meta-ecs} 3128directs 3129@code{flex} 3130to construct 3131@dfn{meta-equivalence classes}, 3132which are sets of equivalence classes (or characters, if equivalence 3133classes are not being used) that are commonly used together. Meta-equivalence 3134classes are often a big win when using compressed tables, but they 3135have a moderate performance impact (one or two @code{if} tests and one 3136array look-up per character scanned). 3137 3138@anchor{option-read} 3139@opindex -Cr 3140@opindex ---read 3141@opindex read 3142@item -Cr, --read, @code{%option read} 3143causes the generated scanner to @emph{bypass} use of the standard I/O 3144library (@code{stdio}) for input. Instead of calling @code{fread()} or 3145@code{getc()}, the scanner will use the @code{read()} system call, 3146resulting in a performance gain which varies from system to system, but 3147in general is probably negligible unless you are also using @samp{-Cf} 3148or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for 3149example, you read from @file{yyin} using @code{stdio} prior to calling 3150the scanner (because the scanner will miss whatever text your previous 3151reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect 3152if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). 3153@end table 3154 3155The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense 3156together - there is no opportunity for meta-equivalence classes if the 3157table is not being compressed. Otherwise the options may be freely 3158mixed, and are cumulative. 3159 3160The default setting is @samp{-Cem}, which specifies that @code{flex} 3161should generate equivalence classes and meta-equivalence classes. This 3162setting provides the highest degree of table compression. You can trade 3163off faster-executing scanners at the cost of larger tables with the 3164following generally being true: 3165 3166@example 3167@verbatim 3168 slowest & smallest 3169 -Cem 3170 -Cm 3171 -Ce 3172 -C 3173 -C{f,F}e 3174 -C{f,F} 3175 -C{f,F}a 3176 fastest & largest 3177@end verbatim 3178@end example 3179 3180Note that scanners with the smallest tables are usually generated and 3181compiled the quickest, so during development you will usually want to 3182use the default, maximal compression. 3183 3184@samp{-Cfe} is often a good compromise between speed and size for 3185production scanners. 3186 3187@anchor{option-full} 3188@opindex -f 3189@opindex ---full 3190@opindex full 3191@item -f, --full, @code{%option full} 3192specifies 3193@dfn{fast scanner}. 3194No table compression is done and @code{stdio} is bypassed. 3195The result is large but fast. This option is equivalent to 3196@samp{--Cfr} 3197 3198 3199@anchor{option-fast} 3200@opindex -F 3201@opindex ---fast 3202@opindex fast 3203@item -F, --fast, @code{%option fast} 3204specifies that the @emph{fast} scanner table representation should be 3205used (and @code{stdio} bypassed). This representation is about as fast 3206as the full table representation @samp{--full}, and for some sets of 3207patterns will be considerably smaller (and for others, larger). In 3208general, if the pattern set contains both @emph{keywords} and a 3209catch-all, @emph{identifier} rule, such as in the set: 3210 3211@example 3212@verbatim 3213 "case" return TOK_CASE; 3214 "switch" return TOK_SWITCH; 3215 ... 3216 "default" return TOK_DEFAULT; 3217 [a-z]+ return TOK_ID; 3218@end verbatim 3219@end example 3220 3221then you're better off using the full table representation. If only 3222the @emph{identifier} rule is present and you then use a hash table or some such 3223to detect the keywords, you're better off using 3224@samp{--fast}. 3225 3226This option is equivalent to @samp{-CFr}. It cannot be used 3227with @samp{--c++}. 3228 3229@end table 3230 3231@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options 3232@section Debugging Options 3233 3234@table @samp 3235 3236@anchor{option-backup} 3237@opindex -b 3238@opindex ---backup 3239@opindex backup 3240@item -b, --backup, @code{%option backup} 3241Generate backing-up information to @file{lex.backup}. This is a list of 3242scanner states which require backing up and the input characters on 3243which they do so. By adding rules one can remove backing-up states. If 3244@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} 3245is used, the generated scanner will run faster (see the @samp{--perf-report} flag). 3246Only users who wish to squeeze every last cycle out of their scanners 3247need worry about this option. (@pxref{Performance}). 3248 3249 3250 3251@anchor{option-debug} 3252@opindex -d 3253@opindex ---debug 3254@opindex debug 3255@item -d, --debug, @code{%option debug} 3256makes the generated scanner run in @dfn{debug} mode. Whenever a pattern 3257is recognized and the global variable @code{yy_flex_debug} is non-zero 3258(which is the default), the scanner will write to @file{stderr} a line 3259of the form: 3260 3261@example 3262@verbatim 3263 -accepting rule at line 53 ("the matched text") 3264@end verbatim 3265@end example 3266 3267The line number refers to the location of the rule in the file defining 3268the scanner (i.e., the file that was fed to flex). Messages are also 3269generated when the scanner backs up, accepts the default rule, reaches 3270the end of its input buffer (or encounters a NUL; at this point, the two 3271look the same as far as the scanner's concerned), or reaches an 3272end-of-file. 3273 3274 3275 3276@anchor{option-perf-report} 3277@opindex -p 3278@opindex ---perf-report 3279@opindex perf-report 3280@item -p, --perf-report, @code{%option perf-report} 3281generates a performance report to @file{stderr}. The report consists of 3282comments regarding features of the @code{flex} input file which will 3283cause a serious loss of performance in the resulting scanner. If you 3284give the flag twice, you will also get comments regarding features that 3285lead to minor performance losses. 3286 3287Note that the use of @code{REJECT}, and 3288variable trailing context (@pxref{Limitations}) entails a substantial 3289performance penalty; use of @code{yymore()}, the @samp{^} operator, and 3290the @samp{--interactive} flag entail minor performance penalties. 3291 3292 3293 3294@anchor{option-nodefault} 3295@opindex -s 3296@opindex ---nodefault 3297@opindex nodefault 3298@item -s, --nodefault, @code{%option nodefault} 3299causes the @emph{default rule} (that unmatched scanner input is echoed 3300to @file{stdout)} to be suppressed. If the scanner encounters input 3301that does not match any of its rules, it aborts with an error. This 3302option is useful for finding holes in a scanner's rule set. 3303 3304 3305 3306@anchor{option-trace} 3307@opindex -T 3308@opindex ---trace 3309@opindex trace 3310@item -T, --trace, @code{%option trace} 3311makes @code{flex} run in @dfn{trace} mode. It will generate a lot of 3312messages to @file{stderr} concerning the form of the input and the 3313resultant non-deterministic and deterministic finite automata. This 3314option is mostly for use in maintaining @code{flex}. 3315 3316 3317 3318@anchor{option-nowarn} 3319@opindex -w 3320@opindex ---nowarn 3321@opindex nowarn 3322@item -w, --nowarn, @code{%option nowarn} 3323suppresses warning messages. 3324 3325 3326 3327@anchor{option-verbose} 3328@opindex -v 3329@opindex ---verbose 3330@opindex verbose 3331@item -v, --verbose, @code{%option verbose} 3332specifies that @code{flex} should write to @file{stderr} a summary of 3333statistics regarding the scanner it generates. Most of the statistics 3334are meaningless to the casual @code{flex} user, but the first line 3335identifies the version of @code{flex} (same as reported by @samp{--version}), 3336and the next line the flags used when generating the scanner, including 3337those that are on by default. 3338 3339 3340 3341@anchor{option-warn} 3342@opindex ---warn 3343@opindex warn 3344@item --warn, @code{%option warn} 3345warn about certain things. In particular, if the default rule can be 3346matched but no default rule has been given, the flex will warn you. 3347We recommend using this option always. 3348 3349@end table 3350 3351@node Miscellaneous Options, , Debugging Options, Scanner Options 3352@section Miscellaneous Options 3353 3354@table @samp 3355@opindex -c 3356@item -c 3357A do-nothing option included for POSIX compliance. 3358 3359@opindex -h 3360@opindex ---help 3361@item -h, -?, --help 3362generates a ``help'' summary of @code{flex}'s options to @file{stdout} 3363and then exits. 3364 3365@opindex -n 3366@item -n 3367Another do-nothing option included for 3368POSIX compliance. 3369 3370@opindex -V 3371@opindex ---version 3372@item -V, --version 3373prints the version number to @file{stdout} and exits. 3374 3375@end table 3376 3377 3378@node Performance, Cxx, Scanner Options, Top 3379@chapter Performance Considerations 3380 3381@cindex performance, considerations 3382The main design goal of @code{flex} is that it generate high-performance 3383scanners. It has been optimized for dealing well with large sets of 3384rules. Aside from the effects on scanner speed of the table compression 3385@samp{-C} options outlined above, there are a number of options/actions 3386which degrade performance. These are, from most expensive to least: 3387 3388@cindex REJECT, performance costs 3389@cindex yylineno, performance costs 3390@cindex trailing context, performance costs 3391@example 3392@verbatim 3393 REJECT 3394 arbitrary trailing context 3395 3396 pattern sets that require backing up 3397 %option yylineno 3398 %array 3399 3400 %option interactive 3401 %option always-interactive 3402 3403 ^ beginning-of-line operator 3404 yymore() 3405@end verbatim 3406@end example 3407 3408with the first two all being quite expensive and the last two being 3409quite cheap. Note also that @code{unput()} is implemented as a routine 3410call that potentially does quite a bit of work, while @code{yyless()} is 3411a quite-cheap macro. So if you are just putting back some excess text 3412you scanned, use @code{yyless()}. 3413 3414@code{REJECT} should be avoided at all costs when performance is 3415important. It is a particularly expensive option. 3416 3417There is one case when @code{%option yylineno} can be expensive. That is when 3418your patterns match long tokens that could @emph{possibly} contain a newline 3419character. There is no performance penalty for rules that can not possibly 3420match newlines, since flex does not need to check them for newlines. In 3421general, you should avoid rules such as @code{[^f]+}, which match very long 3422tokens, including newlines, and may possibly match your entire file! A better 3423approach is to separate @code{[^f]+} into two rules: 3424 3425@example 3426@verbatim 3427%option yylineno 3428%% 3429 [^f\n]+ 3430 \n+ 3431@end verbatim 3432@end example 3433 3434The above scanner does not incur a performance penalty. 3435 3436@cindex patterns, tuning for performance 3437@cindex performance, backing up 3438@cindex backing up, example of eliminating 3439Getting rid of backing up is messy and often may be an enormous amount 3440of work for a complicated scanner. In principal, one begins by using 3441the @samp{-b} flag to generate a @file{lex.backup} file. For example, 3442on the input: 3443 3444@cindex backing up, eliminating 3445@example 3446@verbatim 3447 %% 3448 foo return TOK_KEYWORD; 3449 foobar return TOK_KEYWORD; 3450@end verbatim 3451@end example 3452 3453the file looks like: 3454 3455@example 3456@verbatim 3457 State #6 is non-accepting - 3458 associated rule line numbers: 3459 2 3 3460 out-transitions: [ o ] 3461 jam-transitions: EOF [ \001-n p-\177 ] 3462 3463 State #8 is non-accepting - 3464 associated rule line numbers: 3465 3 3466 out-transitions: [ a ] 3467 jam-transitions: EOF [ \001-` b-\177 ] 3468 3469 State #9 is non-accepting - 3470 associated rule line numbers: 3471 3 3472 out-transitions: [ r ] 3473 jam-transitions: EOF [ \001-q s-\177 ] 3474 3475 Compressed tables always back up. 3476@end verbatim 3477@end example 3478 3479The first few lines tell us that there's a scanner state in which it can 3480make a transition on an 'o' but not on any other character, and that in 3481that state the currently scanned text does not match any rule. The 3482state occurs when trying to match the rules found at lines 2 and 3 in 3483the input file. If the scanner is in that state and then reads 3484something other than an 'o', it will have to back up to find a rule 3485which is matched. With a bit of headscratching one can see that this 3486must be the state it's in when it has seen @samp{fo}. When this has 3487happened, if anything other than another @samp{o} is seen, the scanner 3488will have to back up to simply match the @samp{f} (by the default rule). 3489 3490The comment regarding State #8 indicates there's a problem when 3491@samp{foob} has been scanned. Indeed, on any character other than an 3492@samp{a}, the scanner will have to back up to accept "foo". Similarly, 3493the comment for State #9 concerns when @samp{fooba} has been scanned and 3494an @samp{r} does not follow. 3495 3496The final comment reminds us that there's no point going to all the 3497trouble of removing backing up from the rules unless we're using 3498@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so 3499with compressed scanners. 3500 3501@cindex error rules, to eliminate backing up 3502The way to remove the backing up is to add ``error'' rules: 3503 3504@cindex backing up, eliminating by adding error rules 3505@example 3506@verbatim 3507 %% 3508 foo return TOK_KEYWORD; 3509 foobar return TOK_KEYWORD; 3510 3511 fooba | 3512 foob | 3513 fo { 3514 /* false alarm, not really a keyword */ 3515 return TOK_ID; 3516 } 3517@end verbatim 3518@end example 3519 3520Eliminating backing up among a list of keywords can also be done using a 3521``catch-all'' rule: 3522 3523@cindex backing up, eliminating with catch-all rule 3524@example 3525@verbatim 3526 %% 3527 foo return TOK_KEYWORD; 3528 foobar return TOK_KEYWORD; 3529 3530 [a-z]+ return TOK_ID; 3531@end verbatim 3532@end example 3533 3534This is usually the best solution when appropriate. 3535 3536Backing up messages tend to cascade. With a complicated set of rules 3537it's not uncommon to get hundreds of messages. If one can decipher 3538them, though, it often only takes a dozen or so rules to eliminate the 3539backing up (though it's easy to make a mistake and have an error rule 3540accidentally match a valid token. A possible future @code{flex} feature 3541will be to automatically add rules to eliminate backing up). 3542 3543It's important to keep in mind that you gain the benefits of eliminating 3544backing up only if you eliminate @emph{every} instance of backing up. 3545Leaving just one means you gain nothing. 3546 3547@emph{Variable} trailing context (where both the leading and trailing 3548parts do not have a fixed length) entails almost the same performance 3549loss as @code{REJECT} (i.e., substantial). So when possible a rule 3550like: 3551 3552@cindex trailing context, variable length 3553@example 3554@verbatim 3555 %% 3556 mouse|rat/(cat|dog) run(); 3557@end verbatim 3558@end example 3559 3560is better written: 3561 3562@example 3563@verbatim 3564 %% 3565 mouse/cat|dog run(); 3566 rat/cat|dog run(); 3567@end verbatim 3568@end example 3569 3570or as 3571 3572@example 3573@verbatim 3574 %% 3575 mouse|rat/cat run(); 3576 mouse|rat/dog run(); 3577@end verbatim 3578@end example 3579 3580Note that here the special '|' action does @emph{not} provide any 3581savings, and can even make things worse (@pxref{Limitations}). 3582 3583Another area where the user can increase a scanner's performance (and 3584one that's easier to implement) arises from the fact that the longer the 3585tokens matched, the faster the scanner will run. This is because with 3586long tokens the processing of most input characters takes place in the 3587(short) inner scanning loop, and does not often have to go through the 3588additional work of setting up the scanning environment (e.g., 3589@code{yytext}) for the action. Recall the scanner for C comments: 3590 3591@cindex performance optimization, matching longer tokens 3592@example 3593@verbatim 3594 %x comment 3595 %% 3596 int line_num = 1; 3597 3598 "/*" BEGIN(comment); 3599 3600 <comment>[^*\n]* 3601 <comment>"*"+[^*/\n]* 3602 <comment>\n ++line_num; 3603 <comment>"*"+"/" BEGIN(INITIAL); 3604@end verbatim 3605@end example 3606 3607This could be sped up by writing it as: 3608 3609@example 3610@verbatim 3611 %x comment 3612 %% 3613 int line_num = 1; 3614 3615 "/*" BEGIN(comment); 3616 3617 <comment>[^*\n]* 3618 <comment>[^*\n]*\n ++line_num; 3619 <comment>"*"+[^*/\n]* 3620 <comment>"*"+[^*/\n]*\n ++line_num; 3621 <comment>"*"+"/" BEGIN(INITIAL); 3622@end verbatim 3623@end example 3624 3625Now instead of each newline requiring the processing of another action, 3626recognizing the newlines is distributed over the other rules to keep the 3627matched text as long as possible. Note that @emph{adding} rules does 3628@emph{not} slow down the scanner! The speed of the scanner is 3629independent of the number of rules or (modulo the considerations given 3630at the beginning of this section) how complicated the rules are with 3631regard to operators such as @samp{*} and @samp{|}. 3632 3633@cindex keywords, for performance 3634@cindex performance, using keywords 3635A final example in speeding up a scanner: suppose you want to scan 3636through a file containing identifiers and keywords, one per line 3637and with no other extraneous characters, and recognize all the 3638keywords. A natural first approach is: 3639 3640@cindex performance optimization, recognizing keywords 3641@example 3642@verbatim 3643 %% 3644 asm | 3645 auto | 3646 break | 3647 ... etc ... 3648 volatile | 3649 while /* it's a keyword */ 3650 3651 .|\n /* it's not a keyword */ 3652@end verbatim 3653@end example 3654 3655To eliminate the back-tracking, introduce a catch-all rule: 3656 3657@example 3658@verbatim 3659 %% 3660 asm | 3661 auto | 3662 break | 3663 ... etc ... 3664 volatile | 3665 while /* it's a keyword */ 3666 3667 [a-z]+ | 3668 .|\n /* it's not a keyword */ 3669@end verbatim 3670@end example 3671 3672Now, if it's guaranteed that there's exactly one word per line, then we 3673can reduce the total number of matches by a half by merging in the 3674recognition of newlines with that of the other tokens: 3675 3676@example 3677@verbatim 3678 %% 3679 asm\n | 3680 auto\n | 3681 break\n | 3682 ... etc ... 3683 volatile\n | 3684 while\n /* it's a keyword */ 3685 3686 [a-z]+\n | 3687 .|\n /* it's not a keyword */ 3688@end verbatim 3689@end example 3690 3691One has to be careful here, as we have now reintroduced backing up 3692into the scanner. In particular, while 3693@emph{we} 3694know that there will never be any characters in the input stream 3695other than letters or newlines, 3696@code{flex} 3697can't figure this out, and it will plan for possibly needing to back up 3698when it has scanned a token like @samp{auto} and then the next character 3699is something other than a newline or a letter. Previously it would 3700then just match the @samp{auto} rule and be done, but now it has no @samp{auto} 3701rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, 3702we could either duplicate all rules but without final newlines, or, 3703since we never expect to encounter such an input and therefore don't 3704how it's classified, we can introduce one more catch-all rule, this 3705one which doesn't include a newline: 3706 3707@example 3708@verbatim 3709 %% 3710 asm\n | 3711 auto\n | 3712 break\n | 3713 ... etc ... 3714 volatile\n | 3715 while\n /* it's a keyword */ 3716 3717 [a-z]+\n | 3718 [a-z]+ | 3719 .|\n /* it's not a keyword */ 3720@end verbatim 3721@end example 3722 3723Compiled with @samp{-Cf}, this is about as fast as one can get a 3724@code{flex} scanner to go for this particular problem. 3725 3726A final note: @code{flex} is slow when matching @code{NUL}s, 3727particularly when a token contains multiple @code{NUL}s. It's best to 3728write rules which match @emph{short} amounts of text if it's anticipated 3729that the text will often include @code{NUL}s. 3730 3731Another final note regarding performance: as mentioned in 3732@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge 3733tokens is a slow process because it presently requires that the (huge) 3734token be rescanned from the beginning. Thus if performance is vital, 3735you should attempt to match ``large'' quantities of text but not 3736``huge'' quantities, where the cutoff between the two is at about 8K 3737characters per token. 3738 3739@node Cxx, Reentrant, Performance, Top 3740@chapter Generating C++ Scanners 3741 3742@cindex c++, experimental form of scanner class 3743@cindex experimental form of c++ scanner class 3744@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} 3745and may change considerably between major releases. 3746 3747@cindex C++ 3748@cindex member functions, C++ 3749@cindex methods, c++ 3750@code{flex} provides two different ways to generate scanners for use 3751with C++. The first way is to simply compile a scanner generated by 3752@code{flex} using a C++ compiler instead of a C compiler. You should 3753not encounter any compilation errors (@pxref{Reporting Bugs}). You can 3754then use C++ code in your rule actions instead of C code. Note that the 3755default input source for your scanner remains @file{yyin}, and default 3756echoing is still done to @file{yyout}. Both of these remain @code{FILE 3757*} variables and not C++ @emph{streams}. 3758 3759You can also use @code{flex} to generate a C++ scanner class, using the 3760@samp{-+} option (or, equivalently, @code{%option c++)}, which is 3761automatically specified if the name of the @code{flex} executable ends 3762in a '+', such as @code{flex++}. When using this option, @code{flex} 3763defaults to generating the scanner to the file @file{lex.yy.cc} instead 3764of @file{lex.yy.c}. The generated scanner includes the header file 3765@file{FlexLexer.h}, which defines the interface to two C++ classes. 3766 3767The first class in @file{FlexLexer.h}, @code{FlexLexer}, 3768provides an abstract base class defining the general scanner class 3769interface. It provides the following member functions: 3770 3771@table @code 3772@findex YYText (C++ only) 3773@item const char* YYText() 3774returns the text of the most recently matched token, the equivalent of 3775@code{yytext}. 3776 3777@findex YYLeng (C++ only) 3778@item int YYLeng() 3779returns the length of the most recently matched token, the equivalent of 3780@code{yyleng}. 3781 3782@findex lineno (C++ only) 3783@item int lineno() const 3784returns the current input line number (see @code{%option yylineno)}, or 3785@code{1} if @code{%option yylineno} was not used. 3786 3787@findex set_debug (C++ only) 3788@item void set_debug( int flag ) 3789sets the debugging flag for the scanner, equivalent to assigning to 3790@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build 3791the scanner using @code{%option debug} to include debugging information 3792in it. 3793 3794@findex debug (C++ only) 3795@item int debug() const 3796returns the current setting of the debugging flag. 3797@end table 3798 3799Also provided are member functions equivalent to 3800@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the 3801first argument is an @code{istream&} object reference and not a 3802@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and 3803@code{yyrestart()} (again, the first argument is a @code{istream&} 3804object reference). 3805 3806@tindex yyFlexLexer (C++ only) 3807@tindex FlexLexer (C++ only) 3808The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, 3809which is derived from @code{FlexLexer}. It defines the following 3810additional member functions: 3811 3812@table @code 3813@findex yyFlexLexer constructor (C++ only) 3814@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3815@item yyFlexLexer( istream& arg_yyin, ostream& arg_yyout ) 3816constructs a @code{yyFlexLexer} object using the given streams for input 3817and output. If not specified, the streams default to @code{cin} and 3818@code{cout}, respectively. @code{yyFlexLexer} does not take ownership of 3819its stream arguments. It's up to the user to ensure the streams pointed 3820to remain alive at least as long as the @code{yyFlexLexer} instance. 3821 3822@findex yylex (C++ version) 3823@item virtual int yylex() 3824performs the same role is @code{yylex()} does for ordinary @code{flex} 3825scanners: it scans the input stream, consuming tokens, until a rule's 3826action returns a value. If you derive a subclass @code{S} from 3827@code{yyFlexLexer} and want to access the member functions and variables 3828of @code{S} inside @code{yylex()}, then you need to use @code{%option 3829yyclass="S"} to inform @code{flex} that you will be using that subclass 3830instead of @code{yyFlexLexer}. In this case, rather than generating 3831@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()} 3832(and also generates a dummy @code{yyFlexLexer::yylex()} that calls 3833@code{yyFlexLexer::LexerError()} if called). 3834 3835@findex switch_streams (C++ only) 3836@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) 3837@item virtual void switch_streams(istream& new_in, ostream& new_out) 3838reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to 3839@code{new_out} (if non-null), deleting the previous input buffer if 3840@code{yyin} is reassigned. 3841 3842@item int yylex( istream* new_in, ostream* new_out = 0 ) 3843@item int yylex( istream& new_in, ostream& new_out ) 3844first switches the input streams via @code{switch_streams( new_in, 3845new_out )} and then returns the value of @code{yylex()}. 3846@end table 3847 3848In addition, @code{yyFlexLexer} defines the following protected virtual 3849functions which you can redefine in derived classes to tailor the 3850scanner: 3851 3852@table @code 3853@findex LexerInput (C++ only) 3854@item virtual int LexerInput( char* buf, int max_size ) 3855reads up to @code{max_size} characters into @code{buf} and returns the 3856number of characters read. To indicate end-of-input, return 0 3857characters. Note that @code{interactive} scanners (see the @samp{-B} 3858and @samp{-I} flags in @ref{Scanner Options}) define the macro 3859@code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to 3860take different actions depending on whether or not the scanner might be 3861scanning an interactive input source, you can test for the presence of 3862this name via @code{#ifdef} statements. 3863 3864@findex LexerOutput (C++ only) 3865@item virtual void LexerOutput( const char* buf, int size ) 3866writes out @code{size} characters from the buffer @code{buf}, which, while 3867@code{NUL}-terminated, may also contain internal @code{NUL}s if the 3868scanner's rules can match text with @code{NUL}s in them. 3869 3870@cindex error reporting, in C++ 3871@findex LexerError (C++ only) 3872@item virtual void LexerError( const char* msg ) 3873reports a fatal error message. The default version of this function 3874writes the message to the stream @code{cerr} and exits. 3875@end table 3876 3877Note that a @code{yyFlexLexer} object contains its @emph{entire} 3878scanning state. Thus you can use such objects to create reentrant 3879scanners, but see also @ref{Reentrant}. You can instantiate multiple 3880instances of the same @code{yyFlexLexer} class, and you can also combine 3881multiple C++ scanner classes together in the same program using the 3882@samp{-P} option discussed above. 3883 3884Finally, note that the @code{%array} feature is not available to C++ 3885scanner classes; you must use @code{%pointer} (the default). 3886 3887Here is an example of a simple C++ scanner: 3888 3889@cindex C++ scanners, use of 3890@example 3891@verbatim 3892 // An example of using the flex C++ scanner class. 3893 3894 %{ 3895 #include <iostream> 3896 using namespace std; 3897 int mylineno = 0; 3898 %} 3899 3900 %option noyywrap c++ 3901 3902 string \"[^\n"]+\" 3903 3904 ws [ \t]+ 3905 3906 alpha [A-Za-z] 3907 dig [0-9] 3908 name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* 3909 num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? 3910 num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? 3911 number {num1}|{num2} 3912 3913 %% 3914 3915 {ws} /* skip blanks and tabs */ 3916 3917 "/*" { 3918 int c; 3919 3920 while((c = yyinput()) != 0) 3921 { 3922 if(c == '\n') 3923 ++mylineno; 3924 3925 else if(c == '*') 3926 { 3927 if((c = yyinput()) == '/') 3928 break; 3929 else 3930 unput(c); 3931 } 3932 } 3933 } 3934 3935 {number} cout << "number " << YYText() << '\n'; 3936 3937 \n mylineno++; 3938 3939 {name} cout << "name " << YYText() << '\n'; 3940 3941 {string} cout << "string " << YYText() << '\n'; 3942 3943 %% 3944 3945 // This include is required if main() is an another source file. 3946 //#include <FlexLexer.h> 3947 3948 int main( int /* argc */, char** /* argv */ ) 3949 { 3950 FlexLexer* lexer = new yyFlexLexer; 3951 while(lexer->yylex() != 0) 3952 ; 3953 return 0; 3954 } 3955@end verbatim 3956@end example 3957 3958@cindex C++, multiple different scanners 3959If you want to create multiple (different) lexer classes, you use the 3960@samp{-P} flag (or the @code{prefix=} option) to rename each 3961@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can 3962include @file{<FlexLexer.h>} in your other sources once per lexer class, 3963first renaming @code{yyFlexLexer} as follows: 3964 3965@cindex include files, with C++ 3966@cindex header files, with C++ 3967@cindex C++ scanners, including multiple scanners 3968@example 3969@verbatim 3970 #undef yyFlexLexer 3971 #define yyFlexLexer xxFlexLexer 3972 #include <FlexLexer.h> 3973 3974 #undef yyFlexLexer 3975 #define yyFlexLexer zzFlexLexer 3976 #include <FlexLexer.h> 3977@end verbatim 3978@end example 3979 3980if, for example, you used @code{%option prefix="xx"} for one of your 3981scanners and @code{%option prefix="zz"} for the other. 3982 3983@node Reentrant, Lex and Posix, Cxx, Top 3984@chapter Reentrant C Scanners 3985 3986@cindex reentrant, explanation 3987@code{flex} has the ability to generate a reentrant C scanner. This is 3988accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated 3989scanner is both portable, and safe to use in one or more separate threads of 3990control. The most common use for reentrant scanners is from within 3991multi-threaded applications. Any thread may create and execute a reentrant 3992@code{flex} scanner without the need for synchronization with other threads. 3993 3994@menu 3995* Reentrant Uses:: 3996* Reentrant Overview:: 3997* Reentrant Example:: 3998* Reentrant Detail:: 3999* Reentrant Functions:: 4000@end menu 4001 4002@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant 4003@section Uses for Reentrant Scanners 4004 4005However, there are other uses for a reentrant scanner. For example, you 4006could scan two or more files simultaneously to implement a @code{diff} at 4007the token level (i.e., instead of at the character level): 4008 4009@cindex reentrant scanners, multiple interleaved scanners 4010@example 4011@verbatim 4012 /* Example of maintaining more than one active scanner. */ 4013 4014 do { 4015 int tok1, tok2; 4016 4017 tok1 = yylex( scanner_1 ); 4018 tok2 = yylex( scanner_2 ); 4019 4020 if( tok1 != tok2 ) 4021 printf("Files are different."); 4022 4023 } while ( tok1 && tok2 ); 4024@end verbatim 4025@end example 4026 4027Another use for a reentrant scanner is recursion. 4028(Note that a recursive scanner can also be created using a non-reentrant scanner and 4029buffer states. @xref{Multiple Input Buffers}.) 4030 4031The following crude scanner supports the @samp{eval} command by invoking 4032another instance of itself. 4033 4034@cindex reentrant scanners, recursive invocation 4035@example 4036@verbatim 4037 /* Example of recursive invocation. */ 4038 4039 %option reentrant 4040 4041 %% 4042 "eval(".+")" { 4043 yyscan_t scanner; 4044 YY_BUFFER_STATE buf; 4045 4046 yylex_init( &scanner ); 4047 yytext[yyleng-1] = ' '; 4048 4049 buf = yy_scan_string( yytext + 5, scanner ); 4050 yylex( scanner ); 4051 4052 yy_delete_buffer(buf,scanner); 4053 yylex_destroy( scanner ); 4054 } 4055 ... 4056 %% 4057@end verbatim 4058@end example 4059 4060@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant 4061@section An Overview of the Reentrant API 4062 4063@cindex reentrant, API explanation 4064The API for reentrant scanners is different than for non-reentrant 4065scanners. Here is a quick overview of the API: 4066 4067@itemize 4068@code{%option reentrant} must be specified. 4069 4070@item 4071All functions take one additional argument: @code{yyscanner} 4072 4073@item 4074All global variables are replaced by their macro equivalents. 4075(We tell you this because it may be important to you during debugging.) 4076 4077@item 4078@code{yylex_init} and @code{yylex_destroy} must be called before and 4079after @code{yylex}, respectively. 4080 4081@item 4082Accessor methods (get/set functions) provide access to common 4083@code{flex} variables. 4084 4085@item 4086User-specific data can be stored in @code{yyextra}. 4087@end itemize 4088 4089@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant 4090@section Reentrant Example 4091 4092First, an example of a reentrant scanner: 4093@cindex reentrant, example of 4094@example 4095@verbatim 4096 /* This scanner prints "//" comments. */ 4097 4098 %option reentrant stack noyywrap 4099 %x COMMENT 4100 4101 %% 4102 4103 "//" yy_push_state( COMMENT, yyscanner); 4104 .|\n 4105 4106 <COMMENT>\n yy_pop_state( yyscanner ); 4107 <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); 4108 4109 %% 4110 4111 int main ( int argc, char * argv[] ) 4112 { 4113 yyscan_t scanner; 4114 4115 yylex_init ( &scanner ); 4116 yylex ( scanner ); 4117 yylex_destroy ( scanner ); 4118 return 0; 4119 } 4120@end verbatim 4121@end example 4122 4123@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant 4124@section The Reentrant API in Detail 4125 4126Here are the things you need to do or know to use the reentrant C API of 4127@code{flex}. 4128 4129@menu 4130* Specify Reentrant:: 4131* Extra Reentrant Argument:: 4132* Global Replacement:: 4133* Init and Destroy Functions:: 4134* Accessor Methods:: 4135* Extra Data:: 4136* About yyscan_t:: 4137@end menu 4138 4139@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail 4140@subsection Declaring a Scanner As Reentrant 4141 4142 %option reentrant (--reentrant) must be specified. 4143 4144Notice that @code{%option reentrant} is specified in the above example 4145(@pxref{Reentrant Example}. Had this option not been specified, 4146@code{flex} would have happily generated a non-reentrant scanner without 4147complaining. You may explicitly specify @code{%option noreentrant}, if 4148you do @emph{not} want a reentrant scanner, although it is not 4149necessary. The default is to generate a non-reentrant scanner. 4150 4151@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail 4152@subsection The Extra Argument 4153 4154@cindex reentrant, calling functions 4155@vindex yyscanner (reentrant only) 4156All functions take one additional argument: @code{yyscanner}. 4157 4158Notice that the calls to @code{yy_push_state} and @code{yy_pop_state} 4159both have an argument, @code{yyscanner} , that is not present in a 4160non-reentrant scanner. Here are the declarations of 4161@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner: 4162 4163@example 4164@verbatim 4165 static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; 4166 static void yy_pop_state ( yyscan_t yyscanner ) ; 4167@end verbatim 4168@end example 4169 4170Notice that the argument @code{yyscanner} appears in the declaration of 4171both functions. In fact, all @code{flex} functions in a reentrant 4172scanner have this additional argument. It is always the last argument 4173in the argument list, it is always of type @code{yyscan_t} (which is 4174typedef'd to @code{void *}) and it is 4175always named @code{yyscanner}. As you may have guessed, 4176@code{yyscanner} is a pointer to an opaque data structure encapsulating 4177the current state of the scanner. For a list of function declarations, 4178see @ref{Reentrant Functions}. Note that preprocessor macros, such as 4179@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this 4180additional argument. 4181 4182@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail 4183@subsection Global Variables Replaced By Macros 4184 4185@cindex reentrant, accessing flex variables 4186All global variables in traditional flex have been replaced by macro equivalents. 4187 4188Note that in the above example, @code{yyout} and @code{yytext} are 4189not plain variables. These are macros that will expand to their equivalent lvalue. 4190All of the familiar @code{flex} globals have been replaced by their macro 4191equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno}, 4192@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc} 4193are macros. You may safely use these macros in actions as if they were plain 4194variables. We only tell you this so you don't expect to link to these variables 4195externally. Currently, each macro expands to a member of an internal struct, e.g., 4196 4197@example 4198@verbatim 4199#define yytext (((struct yyguts_t*)yyscanner)->yytext_r) 4200@end verbatim 4201@end example 4202 4203One important thing to remember about 4204@code{yytext} 4205and friends is that 4206@code{yytext} 4207is not a global variable in a reentrant 4208scanner, you can not access it directly from outside an action or from 4209other functions. You must use an accessor method, e.g., 4210@code{yyget_text}, 4211to accomplish this. (See below). 4212 4213@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail 4214@subsection Init and Destroy Functions 4215 4216@cindex memory, considerations for reentrant scanners 4217@cindex reentrant, initialization 4218@findex yylex_init 4219@findex yylex_destroy 4220 4221@code{yylex_init} and @code{yylex_destroy} must be called before and 4222after @code{yylex}, respectively. 4223 4224@example 4225@verbatim 4226 int yylex_init ( yyscan_t * ptr_yy_globals ) ; 4227 int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; 4228 int yylex ( yyscan_t yyscanner ) ; 4229 int yylex_destroy ( yyscan_t yyscanner ) ; 4230@end verbatim 4231@end example 4232 4233The function @code{yylex_init} must be called before calling any other 4234function. The argument to @code{yylex_init} is the address of an 4235uninitialized pointer to be filled in by @code{yylex_init}, overwriting 4236any previous contents. The function @code{yylex_init_extra} may be used 4237instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}. 4238See the section on yyextra, below, for more details. 4239 4240The value stored in @code{ptr_yy_globals} should 4241thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex 4242does not save the argument passed to @code{yylex_init}, so it is safe to 4243pass the address of a local pointer to @code{yylex_init} so long as it remains 4244in scope for the duration of all calls to the scanner, up to and including 4245the call to @code{yylex_destroy}. 4246 4247The function 4248@code{yylex} should be familiar to you by now. The reentrant version 4249takes one argument, which is the value returned (via an argument) by 4250@code{yylex_init}. Otherwise, it behaves the same as the non-reentrant 4251version of @code{yylex}. 4252 4253Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success, 4254or non-zero on failure, in which case errno is set to one of the following values: 4255 4256@itemize 4257@item ENOMEM 4258Memory allocation error. @xref{memory-management}. 4259@item EINVAL 4260Invalid argument. 4261@end itemize 4262 4263 4264The function @code{yylex_destroy} should be 4265called to free resources used by the scanner. After @code{yylex_destroy} 4266is called, the contents of @code{yyscanner} should not be used. Of 4267course, there is no need to destroy a scanner if you plan to reuse it. 4268A @code{flex} scanner (both reentrant and non-reentrant) may be 4269restarted by calling @code{yyrestart}. 4270 4271Below is an example of a program that creates a scanner, uses it, then destroys 4272it when done: 4273 4274@example 4275@verbatim 4276 int main () 4277 { 4278 yyscan_t scanner; 4279 int tok; 4280 4281 yylex_init(&scanner); 4282 4283 while ((tok=yylex(scanner)) > 0) 4284 printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); 4285 4286 yylex_destroy(scanner); 4287 return 0; 4288 } 4289@end verbatim 4290@end example 4291 4292@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail 4293@subsection Accessing Variables with Reentrant Scanners 4294 4295@cindex reentrant, accessor functions 4296Accessor methods (get/set functions) provide access to common 4297@code{flex} variables. 4298 4299Many scanners that you build will be part of a larger project. Portions 4300of your project will need access to @code{flex} values, such as 4301@code{yytext}. In a non-reentrant scanner, these values are global, so 4302there is no problem accessing them. However, in a reentrant scanner, there are no 4303global @code{flex} values. You can not access them directly. Instead, 4304you must access @code{flex} values using accessor methods (get/set 4305functions). Each accessor method is named @code{yyget_NAME} or 4306@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex} 4307variable you want. For example: 4308 4309@cindex accessor functions, use of 4310@example 4311@verbatim 4312 /* Set the last character of yytext to NULL. */ 4313 void chop ( yyscan_t scanner ) 4314 { 4315 int len = yyget_leng( scanner ); 4316 yyget_text( scanner )[len - 1] = '\0'; 4317 } 4318@end verbatim 4319@end example 4320 4321The above code may be called from within an action like this: 4322 4323@example 4324@verbatim 4325 %% 4326 .+\n { chop( yyscanner );} 4327@end verbatim 4328@end example 4329 4330You may find that @code{%option header-file} is particularly useful for generating 4331prototypes of all the accessor functions. @xref{option-header}. 4332 4333@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail 4334@subsection Extra Data 4335 4336@cindex reentrant, extra data 4337@vindex yyextra 4338User-specific data can be stored in @code{yyextra}. 4339 4340In a reentrant scanner, it is unwise to use global variables to 4341communicate with or maintain state between different pieces of your program. 4342However, you may need access to external data or invoke external functions 4343from within the scanner actions. 4344Likewise, you may need to pass information to your scanner 4345(e.g., open file descriptors, or database connections). 4346In a non-reentrant scanner, the only way to do this would be through the 4347use of global variables. 4348@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner. 4349This data is accessible through the accessor methods 4350@code{yyget_extra} and @code{yyset_extra} 4351from outside the scanner, and through the shortcut macro 4352@code{yyextra} 4353from within the scanner itself. They are defined as follows: 4354 4355@tindex YY_EXTRA_TYPE (reentrant only) 4356@findex yyget_extra 4357@findex yyset_extra 4358@example 4359@verbatim 4360 #define YY_EXTRA_TYPE void* 4361 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4362 void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); 4363@end verbatim 4364@end example 4365 4366In addition, an extra form of @code{yylex_init} is provided, 4367@code{yylex_init_extra}. This function is provided so that the yyextra value can 4368be accessed from within the very first yyalloc, used to allocate 4369the scanner itself. 4370 4371By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You 4372may redefine this type using @code{%option extra-type="your_type"} in 4373the scanner: 4374 4375@cindex YY_EXTRA_TYPE, defining your own type 4376@example 4377@verbatim 4378 /* An example of overriding YY_EXTRA_TYPE. */ 4379 %{ 4380 #include <sys/stat.h> 4381 #include <unistd.h> 4382 %} 4383 %option reentrant 4384 %option extra-type="struct stat *" 4385 %% 4386 4387 __filesize__ printf( "%ld", yyextra->st_size ); 4388 __lastmod__ printf( "%ld", yyextra->st_mtime ); 4389 %% 4390 void scan_file( char* filename ) 4391 { 4392 yyscan_t scanner; 4393 struct stat buf; 4394 FILE *in; 4395 4396 in = fopen( filename, "r" ); 4397 stat( filename, &buf ); 4398 4399 yylex_init_extra( buf, &scanner ); 4400 yyset_in( in, scanner ); 4401 yylex( scanner ); 4402 yylex_destroy( scanner ); 4403 4404 fclose( in ); 4405 } 4406@end verbatim 4407@end example 4408 4409 4410@node About yyscan_t, , Extra Data, Reentrant Detail 4411@subsection About yyscan_t 4412 4413@tindex yyscan_t (reentrant only) 4414@code{yyscan_t} is defined as: 4415 4416@example 4417@verbatim 4418 typedef void* yyscan_t; 4419@end verbatim 4420@end example 4421 4422It is initialized by @code{yylex_init()} to point to 4423an internal structure. You should never access this value 4424directly. In particular, you should never attempt to free it 4425(use @code{yylex_destroy()} instead.) 4426 4427@node Reentrant Functions, , Reentrant Detail, Reentrant 4428@section Functions and Macros Available in Reentrant C Scanners 4429 4430The following Functions are available in a reentrant scanner: 4431 4432@findex yyget_text 4433@findex yyget_leng 4434@findex yyget_in 4435@findex yyget_out 4436@findex yyget_lineno 4437@findex yyset_in 4438@findex yyset_out 4439@findex yyset_lineno 4440@findex yyget_debug 4441@findex yyset_debug 4442@findex yyget_extra 4443@findex yyset_extra 4444 4445@example 4446@verbatim 4447 char *yyget_text ( yyscan_t scanner ); 4448 int yyget_leng ( yyscan_t scanner ); 4449 FILE *yyget_in ( yyscan_t scanner ); 4450 FILE *yyget_out ( yyscan_t scanner ); 4451 int yyget_lineno ( yyscan_t scanner ); 4452 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4453 int yyget_debug ( yyscan_t scanner ); 4454 4455 void yyset_debug ( int flag, yyscan_t scanner ); 4456 void yyset_in ( FILE * in_str , yyscan_t scanner ); 4457 void yyset_out ( FILE * out_str , yyscan_t scanner ); 4458 void yyset_lineno ( int line_number , yyscan_t scanner ); 4459 void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); 4460@end verbatim 4461@end example 4462 4463There are no ``set'' functions for yytext and yyleng. This is intentional. 4464 4465The following Macro shortcuts are available in actions in a reentrant 4466scanner: 4467 4468@example 4469@verbatim 4470 yytext 4471 yyleng 4472 yyin 4473 yyout 4474 yylineno 4475 yyextra 4476 yy_flex_debug 4477@end verbatim 4478@end example 4479 4480@cindex yylineno, in a reentrant scanner 4481In a reentrant C scanner, support for yylineno is always present 4482(i.e., you may access yylineno), but the value is never modified by 4483@code{flex} unless @code{%option yylineno} is enabled. This is to allow 4484the user to maintain the line count independently of @code{flex}. 4485 4486@anchor{bison-functions} 4487The following functions and macros are made available when @code{%option 4488bison-bridge} (@samp{--bison-bridge}) is specified: 4489 4490@example 4491@verbatim 4492 YYSTYPE * yyget_lval ( yyscan_t scanner ); 4493 void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); 4494 yylval 4495@end verbatim 4496@end example 4497 4498The following functions and macros are made available 4499when @code{%option bison-locations} (@samp{--bison-locations}) is specified: 4500 4501@example 4502@verbatim 4503 YYLTYPE *yyget_lloc ( yyscan_t scanner ); 4504 void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); 4505 yylloc 4506@end verbatim 4507@end example 4508 4509Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for 4510yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are 4511generated by @code{bison}, and are included in section 1 of the @code{flex} 4512input. 4513 4514@node Lex and Posix, Memory Management, Reentrant, Top 4515@chapter Incompatibilities with Lex and Posix 4516 4517@cindex POSIX and lex 4518@cindex lex (traditional) and POSIX 4519 4520@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two 4521implementations do not share any code, though), with some extensions and 4522incompatibilities, both of which are of concern to those who wish to 4523write scanners acceptable to both implementations. @code{flex} is fully 4524compliant with the POSIX @code{lex} specification, except that when 4525using @code{%pointer} (the default), a call to @code{unput()} destroys 4526the contents of @code{yytext}, which is counter to the POSIX 4527specification. In this section we discuss all of the known areas of 4528incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX 4529specification. @code{flex}'s @samp{-l} option turns on maximum 4530compatibility with the original AT&T @code{lex} implementation, at the 4531cost of a major loss in the generated scanner's performance. We note 4532below which incompatibilities can be overcome using the @samp{-l} 4533option. @code{flex} is fully compatible with @code{lex} with the 4534following exceptions: 4535 4536@itemize 4537@item 4538The undocumented @code{lex} scanner internal variable @code{yylineno} is 4539not supported unless @samp{-l} or @code{%option yylineno} is used. 4540 4541@item 4542@code{yylineno} should be maintained on a per-buffer basis, rather than 4543a per-scanner (single global variable) basis. 4544 4545@item 4546@code{yylineno} is not part of the POSIX specification. 4547 4548@item 4549The @code{input()} routine is not redefinable, though it may be called 4550to read characters following whatever has been matched by a rule. If 4551@code{input()} encounters an end-of-file the normal @code{yywrap()} 4552processing is done. A ``real'' end-of-file is returned by 4553@code{input()} as @code{EOF}. 4554 4555@item 4556Input is instead controlled by defining the @code{YY_INPUT()} macro. 4557 4558@item 4559The @code{flex} restriction that @code{input()} cannot be redefined is 4560in accordance with the POSIX specification, which simply does not 4561specify any way of controlling the scanner's input other than by making 4562an initial assignment to @file{yyin}. 4563 4564@item 4565The @code{unput()} routine is not redefinable. This restriction is in 4566accordance with POSIX. 4567 4568@item 4569@code{flex} scanners are not as reentrant as @code{lex} scanners. In 4570particular, if you have an interactive scanner and an interrupt handler 4571which long-jumps out of the scanner, and the scanner is subsequently 4572called again, you may get the following message: 4573 4574@cindex error messages, end of buffer missed 4575@example 4576@verbatim 4577 fatal flex scanner internal error--end of buffer missed 4578@end verbatim 4579@end example 4580 4581To reenter the scanner, first use: 4582 4583@cindex restarting the scanner 4584@example 4585@verbatim 4586 yyrestart( yyin ); 4587@end verbatim 4588@end example 4589 4590Note that this call will throw away any buffered input; usually this 4591isn't a problem with an interactive scanner. @xref{Reentrant}, for 4592@code{flex}'s reentrant API. 4593 4594@item 4595Also note that @code{flex} C++ scanner classes 4596@emph{are} 4597reentrant, so if using C++ is an option for you, you should use 4598them instead. @xref{Cxx}, and @ref{Reentrant} for details. 4599 4600@item 4601@code{output()} is not supported. Output from the @b{ECHO} macro is 4602done to the file-pointer @code{yyout} (default @file{stdout)}. 4603 4604@item 4605@code{output()} is not part of the POSIX specification. 4606 4607@item 4608@code{lex} does not support exclusive start conditions (%x), though they 4609are in the POSIX specification. 4610 4611@item 4612When definitions are expanded, @code{flex} encloses them in parentheses. 4613With @code{lex}, the following: 4614 4615@cindex name definitions, not POSIX 4616@example 4617@verbatim 4618 NAME [A-Z][A-Z0-9]* 4619 %% 4620 foo{NAME}? printf( "Found it\n" ); 4621 %% 4622@end verbatim 4623@end example 4624 4625will not match the string @samp{foo} because when the macro is expanded 4626the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence 4627is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With 4628@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} 4629and so the string @samp{foo} will match. 4630 4631@item 4632Note that if the definition begins with @samp{^} or ends with @samp{$} 4633then it is @emph{not} expanded with parentheses, to allow these 4634operators to appear in definitions without losing their special 4635meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators 4636cannot be used in a @code{flex} definition. 4637 4638@item 4639Using @samp{-l} results in the @code{lex} behavior of no parentheses 4640around the definition. 4641 4642@item 4643The POSIX specification is that the definition be enclosed in parentheses. 4644 4645@item 4646Some implementations of @code{lex} allow a rule's action to begin on a 4647separate line, if the rule's pattern has trailing whitespace: 4648 4649@cindex patterns and actions on different lines 4650@example 4651@verbatim 4652 %% 4653 foo|bar<space here> 4654 { foobar_action();} 4655@end verbatim 4656@end example 4657 4658@code{flex} does not support this feature. 4659 4660@item 4661The @code{lex} @code{%r} (generate a Ratfor scanner) option is not 4662supported. It is not part of the POSIX specification. 4663 4664@item 4665After a call to @code{unput()}, @emph{yytext} is undefined until the 4666next token is matched, unless the scanner was built using @code{%array}. 4667This is not the case with @code{lex} or the POSIX specification. The 4668@samp{-l} option does away with this incompatibility. 4669 4670@item 4671The precedence of the @samp{@{,@}} (numeric range) operator is 4672different. The AT&T and POSIX specifications of @code{lex} 4673interpret @samp{abc@{1,3@}} as match one, two, 4674or three occurrences of @samp{abc}'', whereas @code{flex} interprets it 4675as ``match @samp{ab} followed by one, two, or three occurrences of 4676@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this 4677incompatibility. 4678 4679@item 4680The precedence of the @samp{^} operator is different. @code{lex} 4681interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a 4682line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match 4683either @samp{foo} or @samp{bar} if they come at the beginning of a 4684line''. The latter is in agreement with the POSIX specification. 4685 4686@item 4687The special table-size declarations such as @code{%a} supported by 4688@code{lex} are not required by @code{flex} scanners.. @code{flex} 4689ignores them. 4690@item 4691The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be 4692written for use with either @code{flex} or @code{lex}. Scanners also 4693include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} 4694and @code{YY_FLEX_SUBMINOR_VERSION} 4695indicating which version of @code{flex} generated the scanner. For 4696example, for the 2.5.22 release, these defines would be 2, 5 and 22 4697respectively. If the version of @code{flex} being used is a beta 4698version, then the symbol @code{FLEX_BETA} is defined. 4699 4700@item 4701The symbols @samp{[[} and @samp{]]} in the code sections of the input 4702may conflict with the m4 delimiters. @xref{M4 Dependency}. 4703 4704 4705@end itemize 4706 4707@cindex POSIX comp;compliance 4708@cindex non-POSIX features of flex 4709The following @code{flex} features are not included in @code{lex} or the 4710POSIX specification: 4711 4712@itemize 4713@item 4714C++ scanners 4715@item 4716%option 4717@item 4718start condition scopes 4719@item 4720start condition stacks 4721@item 4722interactive/non-interactive scanners 4723@item 4724yy_scan_string() and friends 4725@item 4726yyterminate() 4727@item 4728yy_set_interactive() 4729@item 4730yy_set_bol() 4731@item 4732YY_AT_BOL() 4733 <<EOF>> 4734@item 4735<*> 4736@item 4737YY_DECL 4738@item 4739YY_START 4740@item 4741YY_USER_ACTION 4742@item 4743YY_USER_INIT 4744@item 4745#line directives 4746@item 4747%@{@}'s around actions 4748@item 4749reentrant C API 4750@item 4751multiple actions on a line 4752@item 4753almost all of the @code{flex} command-line options 4754@end itemize 4755 4756The feature ``multiple actions on a line'' 4757refers to the fact that with @code{flex} you can put multiple actions on 4758the same line, separated with semi-colons, while with @code{lex}, the 4759following: 4760 4761@example 4762@verbatim 4763 foo handle_foo(); ++num_foos_seen; 4764@end verbatim 4765@end example 4766 4767is (rather surprisingly) truncated to 4768 4769@example 4770@verbatim 4771 foo handle_foo(); 4772@end verbatim 4773@end example 4774 4775@code{flex} does not truncate the action. Actions that are not enclosed 4776in braces are simply terminated at the end of the line. 4777 4778@node Memory Management, Serialized Tables, Lex and Posix, Top 4779@chapter Memory Management 4780 4781@cindex memory management 4782@anchor{memory-management} 4783This chapter describes how flex handles dynamic memory, and how you can 4784override the default behavior. 4785 4786@menu 4787* The Default Memory Management:: 4788* Overriding The Default Memory Management:: 4789* A Note About yytext And Memory:: 4790@end menu 4791 4792@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management 4793@section The Default Memory Management 4794 4795Flex allocates dynamic memory during initialization, and once in a while from 4796within a call to yylex(). Initialization takes place during the first call to 4797yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a 4798buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} 4799@xref{faq-memory-leak}. 4800 4801Flex allocates dynamic memory for four purposes, listed below @footnote{The 4802quantities given here are approximate, and may vary due to host architecture, 4803compiler configuration, or due to future enhancements to flex.} 4804 4805@table @asis 4806 4807@item 16kB for the input buffer. 4808Flex allocates memory for the character buffer used to perform pattern 4809matching. Flex must read ahead from the input stream and store it in a large 4810character buffer. This buffer is typically the largest chunk of dynamic memory 4811flex consumes. This buffer will grow if necessary, doubling the size each time. 4812Flex frees this memory when you call yylex_destroy(). The default size of this 4813buffer (16384 bytes) is almost always too large. The ideal size for this 4814buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few 4815extra bytes for housekeeping. Currently, to override the size of the input buffer 4816you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan 4817to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management 4818API. 4819 4820@item 64kb for the REJECT state. This will only be allocated if you use REJECT. 4821The size is large enough to hold the same number of states as characters in the input buffer. If you override the size of the 4822input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well. 4823 4824@item 100 bytes for the start condition stack. 4825Flex allocates memory for the start condition stack. This is the stack used 4826for pushing start states, i.e., with yy_push_state(). It will grow if 4827necessary. Since the states are simply integers, this stack doesn't consume 4828much memory. This stack is not present if @code{%option stack} is not 4829specified. You will rarely need to tune this buffer. The ideal size for this 4830stack is the maximum depth expected. The memory for this stack is 4831automatically destroyed when you call yylex_destroy(). @xref{option-stack}. 4832 4833@item 40 bytes for each YY_BUFFER_STATE. 4834Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself 4835is about 40 bytes, plus an additional large character buffer (described above.) 4836The initial buffer state is created during initialization, and with each call 4837to yy_create_buffer(). You can't tune the size of this, but you can tune the 4838character buffer as described above. Any buffer state that you explicitly 4839create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You 4840must call yy_delete_buffer() to free the memory. The exception to this rule is 4841that flex will delete the current buffer automatically when you call 4842yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. 4843That way, flex will not try to delete the buffer a second time (possibly 4844crashing your program!) At the time of this writing, flex does not provide a 4845growable stack for the buffer states. You have to manage that yourself. 4846@xref{Multiple Input Buffers}. 4847 4848@item 84 bytes for the reentrant scanner guts 4849Flex allocates about 84 bytes for the reentrant scanner structure when 4850you call yylex_init(). It is destroyed when the user calls yylex_destroy(). 4851 4852@end table 4853 4854 4855@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management 4856@section Overriding The Default Memory Management 4857 4858@cindex yyalloc, overriding 4859@cindex yyrealloc, overriding 4860@cindex yyfree, overriding 4861 4862Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree} 4863when it needs to allocate or free memory. By default, these functions are 4864wrappers around the standard C functions, @code{malloc}, @code{realloc}, and 4865@code{free}, respectively. You can override the default implementations by telling 4866flex that you will provide your own implementations. 4867 4868To override the default implementations, you must do two things: 4869 4870@enumerate 4871 4872@item Suppress the default implementations by specifying one or more of the 4873following options: 4874 4875@itemize 4876@opindex noyyalloc 4877@item @code{%option noyyalloc} 4878@item @code{%option noyyrealloc} 4879@item @code{%option noyyfree}. 4880@end itemize 4881 4882@item Provide your own implementation of the following functions: @footnote{It 4883is not necessary to override all (or any) of the memory management routines. 4884You may, for example, override @code{yyrealloc}, but not @code{yyfree} or 4885@code{yyalloc}.} 4886 4887@example 4888@verbatim 4889// For a non-reentrant scanner 4890void * yyalloc (size_t bytes); 4891void * yyrealloc (void * ptr, size_t bytes); 4892void yyfree (void * ptr); 4893 4894// For a reentrant scanner 4895void * yyalloc (size_t bytes, void * yyscanner); 4896void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); 4897void yyfree (void * ptr, void * yyscanner); 4898@end verbatim 4899@end example 4900 4901@end enumerate 4902 4903In the following example, we will override all three memory routines. We assume 4904that there is a custom allocator with garbage collection. In order to make this 4905example interesting, we will use a reentrant scanner, passing a pointer to the 4906custom allocator through @code{yyextra}. 4907 4908@cindex overriding the memory routines 4909@example 4910@verbatim 4911%{ 4912#include "some_allocator.h" 4913%} 4914 4915/* Suppress the default implementations. */ 4916%option noyyalloc noyyrealloc noyyfree 4917%option reentrant 4918 4919/* Initialize the allocator. */ 4920%{ 4921#define YY_EXTRA_TYPE struct allocator* 4922#define YY_USER_INIT yyextra = allocator_create(); 4923%} 4924 4925%% 4926.|\n ; 4927%% 4928 4929/* Provide our own implementations. */ 4930void * yyalloc (size_t bytes, void* yyscanner) { 4931 return allocator_alloc (yyextra, bytes); 4932} 4933 4934void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { 4935 return allocator_realloc (yyextra, bytes); 4936} 4937 4938void yyfree (void * ptr, void * yyscanner) { 4939 /* Do nothing -- we leave it to the garbage collector. */ 4940} 4941 4942@end verbatim 4943@end example 4944 4945 4946@node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management 4947@section A Note About yytext And Memory 4948 4949@cindex yytext, memory considerations 4950 4951When flex finds a match, @code{yytext} points to the first character of the 4952match in the input buffer. The string itself is part of the input buffer, and 4953is @emph{NOT} allocated separately. The value of yytext will be overwritten the next 4954time yylex() is called. In short, the value of yytext is only valid from within 4955the matched rule's action. 4956 4957Often, you want the value of yytext to persist for later processing, i.e., by a 4958parser with non-zero lookahead. In order to preserve yytext, you will have to 4959copy it with strdup() or a similar function. But this introduces some headache 4960because your parser is now responsible for freeing the copy of yytext. If you 4961use a yacc or bison parser, (commonly used with flex), you will discover that 4962the error recovery mechanisms can cause memory to be leaked. 4963 4964To prevent memory leaks from strdup'd yytext, you will have to track the memory 4965somehow. Our experience has shown that a garbage collection mechanism or a 4966pooled memory mechanism will save you a lot of grief when writing parsers. 4967 4968@node Serialized Tables, Diagnostics, Memory Management, Top 4969@chapter Serialized Tables 4970@cindex serialization 4971@cindex memory, serialized tables 4972 4973@anchor{serialization} 4974A @code{flex} scanner has the ability to save the DFA tables to a file, and 4975load them at runtime when needed. The motivation for this feature is to reduce 4976the runtime memory footprint. Traditionally, these tables have been compiled into 4977the scanner as C arrays, and are sometimes quite large. Since the tables are 4978compiled into the scanner, the memory used by the tables can never be freed. 4979This is a waste of memory, especially if an application uses several scanners, 4980but none of them at the same time. 4981 4982The serialization feature allows the tables to be loaded at runtime, before 4983scanning begins. The tables may be discarded when scanning is finished. 4984 4985@menu 4986* Creating Serialized Tables:: 4987* Loading and Unloading Serialized Tables:: 4988* Tables File Format:: 4989@end menu 4990 4991@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables 4992@section Creating Serialized Tables 4993@cindex tables, creating serialized 4994@cindex serialization of tables 4995 4996You may create a scanner with serialized tables by specifying: 4997 4998@example 4999@verbatim 5000 %option tables-file=FILE 5001or 5002 --tables-file=FILE 5003@end verbatim 5004@end example 5005 5006These options instruct flex to save the DFA tables to the file @var{FILE}. The tables 5007will @emph{not} be embedded in the generated scanner. The scanner will not 5008function on its own. The scanner will be dependent upon the serialized tables. You must 5009load the tables from this file at runtime before you can scan anything. 5010 5011If you do not specify a filename to @code{--tables-file}, the tables will be 5012saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix. 5013 5014If your project uses several different scanners, you can concatenate the 5015serialized tables into one file, and flex will find the correct set of tables, 5016using the scanner prefix as part of the lookup key. An example follows: 5017 5018@cindex serialized tables, multiple scanners 5019@example 5020@verbatim 5021$ flex --tables-file --prefix=cpp cpp.l 5022$ flex --tables-file --prefix=c c.l 5023$ cat lex.cpp.tables lex.c.tables > all.tables 5024@end verbatim 5025@end example 5026 5027The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did 5028not specify a filename, the tables were serialized to @file{lex.c.tables} and 5029@file{lex.cpp.tables}, respectively. Then, we concatenated the two files 5030together into @file{all.tables}, which we will distribute with our project. At 5031runtime, we will open the file and tell flex to load the tables from it. Flex 5032will find the correct tables automatically. (See next section). 5033 5034@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables 5035@section Loading and Unloading Serialized Tables 5036@cindex tables, loading and unloading 5037@cindex loading tables at runtime 5038@cindex tables, freeing 5039@cindex freeing tables 5040@cindex memory, serialized tables 5041 5042If you've built your scanner with @code{%option tables-file}, then you must 5043load the scanner tables at runtime. This can be accomplished with the following 5044function: 5045 5046@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) 5047Locates scanner tables in the stream pointed to by @var{fp} and loads them. 5048Memory for the tables is allocated via @code{yyalloc}. You must call this 5049function before the first call to @code{yylex}. The argument @var{scanner} 5050only appears in the reentrant scanner. 5051This function returns @samp{0} (zero) on success, or non-zero on error. 5052@end deftypefun 5053 5054The loaded tables are @strong{not} automatically destroyed (unloaded) when you 5055call @code{yylex_destroy}. The reason is that you may create several scanners 5056of the same type (in a reentrant scanner), each of which needs access to these 5057tables. To avoid a nasty memory leak, you must call the following function: 5058 5059@deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) 5060Unloads the scanner tables. The tables must be loaded again before you can scan 5061any more data. The argument @var{scanner} only appears in the reentrant 5062scanner. This function returns @samp{0} (zero) on success, or non-zero on 5063error. 5064@end deftypefun 5065 5066@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not 5067thread-safe.} You must ensure that these functions are called exactly once (for 5068each scanner type) in a threaded program, before any thread calls @code{yylex}. 5069After the tables are loaded, they are never written to, and no thread 5070protection is required thereafter -- until you destroy them. 5071 5072@node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables 5073@section Tables File Format 5074@cindex tables, file format 5075@cindex file format, serialized tables 5076 5077This section defines the file format of serialized @code{flex} tables. 5078 5079The tables format allows for one or more sets of tables to be 5080specified, where each set corresponds to a given scanner. Scanners are 5081indexed by name, as described below. The file format is as follows: 5082 5083@example 5084@verbatim 5085 TABLE SET 1 5086 +-------------------------------+ 5087 Header | uint32 th_magic; | 5088 | uint32 th_hsize; | 5089 | uint32 th_ssize; | 5090 | uint16 th_flags; | 5091 | char th_version[]; | 5092 | char th_name[]; | 5093 | uint8 th_pad64[]; | 5094 +-------------------------------+ 5095 Table 1 | uint16 td_id; | 5096 | uint16 td_flags; | 5097 | uint32 td_hilen; | 5098 | uint32 td_lolen; | 5099 | void td_data[]; | 5100 | uint8 td_pad64[]; | 5101 +-------------------------------+ 5102 Table 2 | | 5103 . . . 5104 . . . 5105 . . . 5106 . . . 5107 Table n | | 5108 +-------------------------------+ 5109 TABLE SET 2 5110 . 5111 . 5112 . 5113 TABLE SET N 5114@end verbatim 5115@end example 5116 5117The above diagram shows that a complete set of tables consists of a header 5118followed by multiple individual tables. Furthermore, multiple complete sets may 5119be present in the same file, each set with its own header and tables. The sets 5120are contiguous in the file. The only way to know if another set follows is to 5121check the next four bytes for the magic number (or check for EOF). The header 5122and tables sections are padded to 64-bit boundaries. Below we describe each 5123field in detail. This format does not specify how the scanner will expand the 5124given data, i.e., data may be serialized as int8, but expanded to an int32 5125array at runtime. This is to reduce the size of the serialized data where 5126possible. Remember, @emph{all integer values are in network byte order}. 5127 5128@noindent 5129Fields of a table header: 5130 5131@table @code 5132@item th_magic 5133Magic number, always 0xF13C57B1. 5134 5135@item th_hsize 5136Size of this entire header, in bytes, including all fields plus any padding. 5137 5138@item th_ssize 5139Size of this entire set, in bytes, including the header, all tables, plus 5140any padding. 5141 5142@item th_flags 5143Bit flags for this table set. Currently unused. 5144 5145@item th_version[] 5146Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is 5147the version of flex that was used to create the serialized tables. 5148 5149@item th_name[] 5150Contains the name of this table set. The default is @samp{yytables}, 5151and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. 5152 5153@item th_pad64[] 5154Zero or more NULL bytes, padding the entire header to the next 64-bit boundary 5155as calculated from the beginning of the header. 5156@end table 5157 5158@noindent 5159Fields of a table: 5160 5161@table @code 5162@item td_id 5163Specifies the table identifier. Possible values are: 5164@table @code 5165@item YYTD_ID_ACCEPT (0x01) 5166@code{yy_accept} 5167@item YYTD_ID_BASE (0x02) 5168@code{yy_base} 5169@item YYTD_ID_CHK (0x03) 5170@code{yy_chk} 5171@item YYTD_ID_DEF (0x04) 5172@code{yy_def} 5173@item YYTD_ID_EC (0x05) 5174@code{yy_ec } 5175@item YYTD_ID_META (0x06) 5176@code{yy_meta} 5177@item YYTD_ID_NUL_TRANS (0x07) 5178@code{yy_NUL_trans} 5179@item YYTD_ID_NXT (0x08) 5180@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen} 5181field below. 5182@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09) 5183@code{yy_rule_can_match_eol} 5184@item YYTD_ID_START_STATE_LIST (0x0A) 5185@code{yy_start_state_list}. This array is handled specially because it is an 5186array of pointers to structs. See the @code{td_flags} field below. 5187@item YYTD_ID_TRANSITION (0x0B) 5188@code{yy_transition}. This array is handled specially because it is an array of 5189structs. See the @code{td_lolen} field below. 5190@item YYTD_ID_ACCLIST (0x0C) 5191@code{yy_acclist} 5192@end table 5193 5194@item td_flags 5195Bit flags describing how to interpret the data in @code{td_data}. 5196The data arrays are one-dimensional by default, but may be 5197two dimensional as specified in the @code{td_hilen} field. 5198 5199@table @code 5200@item YYTD_DATA8 (0x01) 5201The data is serialized as an array of type int8. 5202@item YYTD_DATA16 (0x02) 5203The data is serialized as an array of type int16. 5204@item YYTD_DATA32 (0x04) 5205The data is serialized as an array of type int32. 5206@item YYTD_PTRANS (0x08) 5207The data is a list of indexes of entries in the expanded @code{yy_transition} 5208array. Each index should be expanded to a pointer to the corresponding entry 5209in the @code{yy_transition} array. We count on the fact that the 5210@code{yy_transition} array has already been seen. 5211@item YYTD_STRUCT (0x10) 5212The data is a list of yy_trans_info structs, each of which consists of 5213two integers. There is no padding between struct elements or between structs. 5214The type of each member is determined by the @code{YYTD_DATA*} bits. 5215@end table 5216 5217@item td_hilen 5218If @code{td_hilen} is non-zero, then the data is a two-dimensional array. 5219Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the 5220number of elements in the higher dimensional array, and @code{td_lolen} contains 5221the number of elements in the lowest dimension. 5222 5223Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or 5224@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified 5225by the @code{td_flags} field. It is possible for both @code{td_lolen} and 5226@code{td_hilen} to be zero, in which case @code{td_data} is a zero length 5227array, and no data is loaded, i.e., this table is simply skipped. Flex does not 5228currently generate tables of zero length. 5229 5230@item td_lolen 5231Specifies the number of elements in the lowest dimension array. If this is 5232a one-dimensional array, then it is simply the number of elements in this array. 5233The element size is determined by the @code{td_flags} field. 5234 5235@item td_data[] 5236The table data. This array may be a one- or two-dimensional array, of type 5237@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or 5238@code{struct yy_trans_info*}, depending upon the values in the 5239@code{td_flags}, @code{td_hilen}, and @code{td_lolen} fields. 5240 5241@item td_pad64[] 5242Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as 5243calculated from the beginning of this table. 5244@end table 5245 5246@node Diagnostics, Limitations, Serialized Tables, Top 5247@chapter Diagnostics 5248 5249@cindex error reporting, diagnostic messages 5250@cindex warnings, diagnostic messages 5251 5252The following is a list of @code{flex} diagnostic messages: 5253 5254@itemize 5255@item 5256@samp{warning, rule cannot be matched} indicates that the given rule 5257cannot be matched because it follows other rules that will always match 5258the same text as it. For example, in the following @samp{foo} cannot be 5259matched because it comes after an identifier ``catch-all'' rule: 5260 5261@cindex warning, rule cannot be matched 5262@example 5263@verbatim 5264 [a-z]+ got_identifier(); 5265 foo got_foo(); 5266@end verbatim 5267@end example 5268 5269Using @code{REJECT} in a scanner suppresses this warning. 5270 5271@item 5272@samp{warning, -s option given but default rule can be matched} means 5273that it is possible (perhaps only in a particular start condition) that 5274the default rule (match any single character) is the only one that will 5275match a particular input. Since @samp{-s} was given, presumably this is 5276not intended. 5277 5278@item 5279@code{reject_used_but_not_detected undefined} or 5280@code{yymore_used_but_not_detected undefined}. These errors can occur 5281at compile time. They indicate that the scanner uses @code{REJECT} or 5282@code{yymore()} but that @code{flex} failed to notice the fact, meaning 5283that @code{flex} scanned the first two sections looking for occurrences 5284of these actions and failed to find any, but somehow you snuck some in 5285(via a #include file, for example). Use @code{%option reject} or 5286@code{%option yymore} to indicate to @code{flex} that you really do use 5287these features. 5288 5289@item 5290@samp{flex scanner jammed}. a scanner compiled with 5291@samp{-s} has encountered an input string which wasn't matched by any of 5292its rules. This error can also occur due to internal problems. 5293 5294@item 5295@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} 5296and one of its rules matched a string longer than the @code{YYLMAX} 5297constant (8K bytes by default). You can increase the value by 5298#define'ing @code{YYLMAX} in the definitions section of your @code{flex} 5299input. 5300 5301@item 5302@samp{scanner requires -8 flag to use the character 'x'}. Your scanner 5303specification includes recognizing the 8-bit character @samp{'x'} and 5304you did not specify the -8 flag, and your scanner defaulted to 7-bit 5305because you used the @samp{-Cf} or @samp{-CF} table compression options. 5306See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for 5307details. 5308 5309@item 5310@samp{flex scanner push-back overflow}. you used @code{unput()} to push 5311back so much text that the scanner's buffer could not hold both the 5312pushed-back text and the current token in @code{yytext}. Ideally the 5313scanner should dynamically resize the buffer in this case, but at 5314present it does not. 5315 5316@item 5317@samp{input buffer overflow, can't enlarge buffer because scanner uses 5318REJECT}. the scanner was working on matching an extremely large token 5319and needed to expand the input buffer. This doesn't work with scanners 5320that use @code{REJECT}. 5321 5322@item 5323@samp{fatal flex scanner internal error--end of buffer missed}. This can 5324occur in a scanner which is reentered after a long-jump has jumped out 5325(or over) the scanner's activation frame. Before reentering the 5326scanner, use: 5327@example 5328@verbatim 5329 yyrestart( yyin ); 5330@end verbatim 5331@end example 5332or, as noted above, switch to using the C++ scanner class. 5333 5334@item 5335@samp{too many start conditions in <> construct!} you listed more start 5336conditions in a <> construct than exist (so you must have listed at 5337least one of them twice). 5338@end itemize 5339 5340@node Limitations, Bibliography, Diagnostics, Top 5341@chapter Limitations 5342 5343@cindex limitations of flex 5344 5345Some trailing context patterns cannot be properly matched and generate 5346warning messages (@samp{dangerous trailing context}). These are 5347patterns where the ending of the first part of the rule matches the 5348beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' 5349matches the 'x' at the beginning of the trailing context. (Note that 5350the POSIX draft states that the text matched by such patterns is 5351undefined.) For some trailing context rules, parts which are actually 5352fixed-length are not recognized as such, leading to the abovementioned 5353performance loss. In particular, parts using @samp{|} or @samp{@{n@}} 5354(such as @samp{foo@{3@}}) are always considered variable-length. 5355Combining trailing context with the special @samp{|} action can result 5356in @emph{fixed} trailing context being turned into the more expensive 5357@emph{variable} trailing context. For example, in the following: 5358 5359@cindex warning, dangerous trailing context 5360@example 5361@verbatim 5362 %% 5363 abc | 5364 xyz/def 5365@end verbatim 5366@end example 5367 5368Use of @code{unput()} invalidates yytext and yyleng, unless the 5369@code{%array} directive or the @samp{-l} option has been used. 5370Pattern-matching of @code{NUL}s is substantially slower than matching 5371other characters. Dynamic resizing of the input buffer is slow, as it 5372entails rescanning all the text matched so far by the current (generally 5373huge) token. Due to both buffering of input and read-ahead, you cannot 5374intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()}, 5375with @code{flex} rules and expect it to work. Call @code{input()} 5376instead. The total table entries listed by the @samp{-v} flag excludes 5377the number of table entries needed to determine what rule has been 5378matched. The number of entries is equal to the number of DFA states if 5379the scanner does not use @code{REJECT}, and somewhat greater than the 5380number of states if it does. @code{REJECT} cannot be used with the 5381@samp{-f} or @samp{-F} options. 5382 5383The @code{flex} internal algorithms need documentation. 5384 5385@node Bibliography, FAQ, Limitations, Top 5386@chapter Additional Reading 5387 5388You may wish to read more about the following programs: 5389@itemize 5390@item lex 5391@item yacc 5392@item sed 5393@item awk 5394@end itemize 5395 5396The following books may contain material of interest: 5397 5398John Levine, Tony Mason, and Doug Brown, 5399@emph{Lex & Yacc}, 5400O'Reilly and Associates. Be sure to get the 2nd edition. 5401 5402M. E. Lesk and E. Schmidt, 5403@emph{LEX -- Lexical Analyzer Generator} 5404 5405Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, 5406Techniques and Tools}, Addison-Wesley (1986). Describes the 5407pattern-matching techniques used by @code{flex} (deterministic finite 5408automata). 5409 5410@node FAQ, Appendices, Bibliography, Top 5411@unnumbered FAQ 5412 5413From time to time, the @code{flex} maintainer receives certain 5414questions. Rather than repeat answers to well-understood problems, we 5415publish them here. 5416 5417@menu 5418* When was flex born?:: 5419* How do I expand backslash-escape sequences in C-style quoted strings?:: 5420* Why do flex scanners call fileno if it is not ANSI compatible?:: 5421* Does flex support recursive pattern definitions?:: 5422* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 5423* Flex is not matching my patterns in the same order that I defined them.:: 5424* My actions are executing out of order or sometimes not at all.:: 5425* How can I have multiple input sources feed into the same scanner at the same time?:: 5426* Can I build nested parsers that work with the same input file?:: 5427* How can I match text only at the end of a file?:: 5428* How can I make REJECT cascade across start condition boundaries?:: 5429* Why cant I use fast or full tables with interactive mode?:: 5430* How much faster is -F or -f than -C?:: 5431* If I have a simple grammar cant I just parse it with flex?:: 5432* Why doesn't yyrestart() set the start state back to INITIAL?:: 5433* How can I match C-style comments?:: 5434* The period isn't working the way I expected.:: 5435* Can I get the flex manual in another format?:: 5436* Does there exist a "faster" NDFA->DFA algorithm?:: 5437* How does flex compile the DFA so quickly?:: 5438* How can I use more than 8192 rules?:: 5439* How do I abandon a file in the middle of a scan and switch to a new file?:: 5440* How do I execute code only during initialization (only before the first scan)?:: 5441* How do I execute code at termination?:: 5442* Where else can I find help?:: 5443* Can I include comments in the "rules" section of the file?:: 5444* I get an error about undefined yywrap().:: 5445* How can I change the matching pattern at run time?:: 5446* How can I expand macros in the input?:: 5447* How can I build a two-pass scanner?:: 5448* How do I match any string not matched in the preceding rules?:: 5449* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 5450* Is there a way to make flex treat NULL like a regular character?:: 5451* Whenever flex can not match the input it says "flex scanner jammed".:: 5452* Why doesn't flex have non-greedy operators like perl does?:: 5453* Memory leak - 16386 bytes allocated by malloc.:: 5454* How do I track the byte offset for lseek()?:: 5455* How do I use my own I/O classes in a C++ scanner?:: 5456* How do I skip as many chars as possible?:: 5457* deleteme00:: 5458* Are certain equivalent patterns faster than others?:: 5459* Is backing up a big deal?:: 5460* Can I fake multi-byte character support?:: 5461* deleteme01:: 5462* Can you discuss some flex internals?:: 5463* unput() messes up yy_at_bol:: 5464* The | operator is not doing what I want:: 5465* Why can't flex understand this variable trailing context pattern?:: 5466* The ^ operator isn't working:: 5467* Trailing context is getting confused with trailing optional patterns:: 5468* Is flex GNU or not?:: 5469* ERASEME53:: 5470* I need to scan if-then-else blocks and while loops:: 5471* ERASEME55:: 5472* ERASEME56:: 5473* ERASEME57:: 5474* Is there a repository for flex scanners?:: 5475* How can I conditionally compile or preprocess my flex input file?:: 5476* Where can I find grammars for lex and yacc?:: 5477* I get an end-of-buffer message for each character scanned.:: 5478* unnamed-faq-62:: 5479* unnamed-faq-63:: 5480* unnamed-faq-64:: 5481* unnamed-faq-65:: 5482* unnamed-faq-66:: 5483* unnamed-faq-67:: 5484* unnamed-faq-68:: 5485* unnamed-faq-69:: 5486* unnamed-faq-70:: 5487* unnamed-faq-71:: 5488* unnamed-faq-72:: 5489* unnamed-faq-73:: 5490* unnamed-faq-74:: 5491* unnamed-faq-75:: 5492* unnamed-faq-76:: 5493* unnamed-faq-77:: 5494* unnamed-faq-78:: 5495* unnamed-faq-79:: 5496* unnamed-faq-80:: 5497* unnamed-faq-81:: 5498* unnamed-faq-82:: 5499* unnamed-faq-83:: 5500* unnamed-faq-84:: 5501* unnamed-faq-85:: 5502* unnamed-faq-86:: 5503* unnamed-faq-87:: 5504* unnamed-faq-88:: 5505* unnamed-faq-90:: 5506* unnamed-faq-91:: 5507* unnamed-faq-92:: 5508* unnamed-faq-93:: 5509* unnamed-faq-94:: 5510* unnamed-faq-95:: 5511* unnamed-faq-96:: 5512* unnamed-faq-97:: 5513* unnamed-faq-98:: 5514* unnamed-faq-99:: 5515* unnamed-faq-100:: 5516* unnamed-faq-101:: 5517* What is the difference between YYLEX_PARAM and YY_DECL?:: 5518* Why do I get "conflicting types for yylex" error?:: 5519* How do I access the values set in a Flex action from within a Bison action?:: 5520@end menu 5521 5522@node When was flex born? 5523@unnumberedsec When was flex born? 5524 5525Vern Paxson took over 5526the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it 5527was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 5528a legend was born :-). 5529 5530@node How do I expand backslash-escape sequences in C-style quoted strings? 5531@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings? 5532 5533A key point when scanning quoted strings is that you cannot (easily) write 5534a single rule that will precisely match the string if you allow things 5535like embedded escape sequences and newlines. If you try to match strings 5536with a single rule then you'll wind up having to rescan the string anyway 5537to find any escape sequences. 5538 5539Instead you can use exclusive start conditions and a set of rules, one for 5540matching non-escaped text, one for matching a single escape, one for 5541matching an embedded newline, and one for recognizing the end of the 5542string. Each of these rules is then faced with the question of where to 5543put its intermediary results. The best solution is for the rules to 5544append their local value of @code{yytext} to the end of a ``string literal'' 5545buffer. A rule like the escape-matcher will append to the buffer the 5546meaning of the escape sequence rather than the literal text in @code{yytext}. 5547In this way, @code{yytext} does not need to be modified at all. 5548 5549@node Why do flex scanners call fileno if it is not ANSI compatible? 5550@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? 5551 5552Flex scanners call @code{fileno()} in order to get the file descriptor 5553corresponding to @code{yyin}. The file descriptor may be passed to 5554@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified. 5555If your system does not have @code{fileno()} support, to get rid of the 5556@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()} 5557call, you must specify one of @code{%option always-interactive} or 5558@code{%option never-interactive}. 5559 5560@node Does flex support recursive pattern definitions? 5561@unnumberedsec Does flex support recursive pattern definitions? 5562 5563e.g., 5564 5565@example 5566@verbatim 5567%% 5568block "{"({block}|{statement})*"}" 5569@end verbatim 5570@end example 5571 5572No. You cannot have recursive definitions. The pattern-matching power of 5573regular expressions in general (and therefore flex scanners, too) is 5574limited. In particular, regular expressions cannot ``balance'' parentheses 5575to an arbitrary degree. For example, it's impossible to write a regular 5576expression that matches all strings containing the same number of '@{'s 5577as '@}'s. For more powerful pattern matching, you need a parser, such 5578as @cite{GNU bison}. 5579 5580@node How do I skip huge chunks of input (tens of megabytes) while using flex? 5581@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex? 5582 5583Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}. 5584 5585@node Flex is not matching my patterns in the same order that I defined them. 5586@unnumberedsec Flex is not matching my patterns in the same order that I defined them. 5587 5588@code{flex} picks the 5589rule that matches the most text (i.e., the longest possible input string). 5590This is because @code{flex} uses an entirely different matching technique 5591(``deterministic finite automata'') that actually does all of the matching 5592simultaneously, in parallel. (Seems impossible, but it's actually a fairly 5593simple technique once you understand the principles.) 5594 5595A side-effect of this parallel matching is that when the input matches more 5596than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This 5597is explained further in the manual, in the section @xref{Matching}. 5598 5599If you want @code{flex} to choose a shorter match, then you can work around this 5600behavior by expanding your short 5601rule to match more text, then put back the extra: 5602 5603@example 5604@verbatim 5605data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; 5606@end verbatim 5607@end example 5608 5609Another fix would be to make the second rule active only during the 5610@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive 5611by declaring it with @code{%x} instead of @code{%s}. 5612 5613A final fix is to change the input language so that the ambiguity for 5614@samp{data_} is removed, by adding characters to it that don't match the 5615identifier rule, or by removing characters (such as @samp{_}) from the 5616identifier rule so it no longer matches @samp{data_}. (Of course, you might 5617also not have the option of changing the input language.) 5618 5619@node My actions are executing out of order or sometimes not at all. 5620@unnumberedsec My actions are executing out of order or sometimes not at all. 5621 5622Most likely, you have (in error) placed the opening @samp{@{} of the action 5623block on a different line than the rule, e.g., 5624 5625@example 5626@verbatim 5627^(foo|bar) 5628{ <<<--- WRONG! 5629 5630} 5631@end verbatim 5632@end example 5633 5634@code{flex} requires that the opening @samp{@{} of an action associated with a rule 5635begin on the same line as does the rule. You need instead to write your rules 5636as follows: 5637 5638@example 5639@verbatim 5640^(foo|bar) { // CORRECT! 5641 5642} 5643@end verbatim 5644@end example 5645 5646@node How can I have multiple input sources feed into the same scanner at the same time? 5647@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? 5648 5649If @dots{} 5650@itemize 5651@item 5652your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag), 5653@item 5654AND you run your scanner interactively (@samp{-I} option; default unless using special table 5655compression options), 5656@item 5657AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, 5658@end itemize 5659 5660then every time it matches a token, it will have exhausted its input 5661buffer (because the scanner is free of backtracking). This means you 5662can safely use @code{select()} at the point and only call @code{yylex()} for another 5663token if @code{select()} indicates there's data available. 5664 5665That is, move the @code{select()} out from the input function to a point where 5666it determines whether @code{yylex()} gets called for the next token. 5667 5668With this approach, you will still have problems if your input can arrive 5669piecemeal; @code{select()} could inform you that the beginning of a token is 5670available, you call @code{yylex()} to get it, but it winds up blocking waiting 5671for the later characters in the token. 5672 5673Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That 5674is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is 5675available. If input is available for the scanner, it reads and returns the 5676next byte. If input is available from another source, it calls whatever 5677function is responsible for reading from that source. (If no input is 5678available, it blocks until some input is available.) I've used this technique in an 5679interpreter I wrote that both reads keyboard input using a @code{flex} scanner and 5680IPC traffic from sockets, and it works fine. 5681 5682@node Can I build nested parsers that work with the same input file? 5683@unnumberedsec Can I build nested parsers that work with the same input file? 5684 5685This is not going to work without some additional effort. The reason is 5686that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the 5687``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K 5688of input available on yyin, and subsequent calls to other @code{yylex()}'s won't 5689see that input. You might be tempted to work around this problem by 5690redefining @code{YY_INPUT} to only return a small amount of text, but it turns out 5691that that approach is quite difficult. Instead, the best solution is to 5692combine all of your scanners into one large scanner, using a different 5693exclusive start condition for each. 5694 5695@node How can I match text only at the end of a file? 5696@unnumberedsec How can I match text only at the end of a file? 5697 5698There is no way to write a rule which is ``match this text, but only if 5699it comes at the end of the file''. You can fake it, though, if you happen 5700to have a character lying around that you don't allow in your input. 5701Then you redefine @code{YY_INPUT} to call your own routine which, if it sees 5702an @samp{EOF}, returns the magic character first (and remembers to return a 5703real @code{EOF} next time it's called). Then you could write: 5704 5705@example 5706@verbatim 5707<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ 5708@end verbatim 5709@end example 5710 5711@node How can I make REJECT cascade across start condition boundaries? 5712@unnumberedsec How can I make REJECT cascade across start condition boundaries? 5713 5714You can do this as follows. Suppose you have a start condition @samp{A}, and 5715after exhausting all of the possible matches in @samp{<A>}, you want to try 5716matches in @samp{<INITIAL>}. Then you could use the following: 5717 5718@example 5719@verbatim 5720%x A 5721%% 5722<A>rule_that_is_long ...; REJECT; 5723<A>rule ...; REJECT; /* shorter rule */ 5724<A>etc. 5725... 5726<A>.|\n { 5727/* Shortest and last rule in <A>, so 5728* cascaded REJECTs will eventually 5729* wind up matching this rule. We want 5730* to now switch to the initial state 5731* and try matching from there instead. 5732*/ 5733yyless(0); /* put back matched text */ 5734BEGIN(INITIAL); 5735} 5736@end verbatim 5737@end example 5738 5739@node Why cant I use fast or full tables with interactive mode? 5740@unnumberedsec Why can't I use fast or full tables with interactive mode? 5741 5742One of the assumptions 5743flex makes is that interactive applications are inherently slow (they're 5744waiting on a human after all). 5745It has to do with how the scanner detects that it must be finished scanning 5746a token. For interactive scanners, after scanning each character the current 5747state is looked up in a table (essentially) to see whether there's a chance 5748of another input character possibly extending the length of the match. If 5749not, the scanner halts. For non-interactive scanners, the end-of-token test 5750is much simpler, basically a compare with 0, so no memory bus cycles. Since 5751the test occurs in the innermost scanning loop, one would like to make it go 5752as fast as possible. 5753 5754Still, it seems reasonable to allow the user to choose to trade off a bit 5755of performance in this area to gain the corresponding flexibility. There 5756might be another reason, though, why fast scanners don't support the 5757interactive option. 5758 5759@node How much faster is -F or -f than -C? 5760@unnumberedsec How much faster is -F or -f than -C? 5761 5762Much faster (factor of 2-3). 5763 5764@node If I have a simple grammar cant I just parse it with flex? 5765@unnumberedsec If I have a simple grammar can't I just parse it with flex? 5766 5767Is your grammar recursive? That's almost always a sign that you're 5768better off using a parser/scanner rather than just trying to use a scanner 5769alone. 5770 5771@node Why doesn't yyrestart() set the start state back to INITIAL? 5772@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? 5773 5774There are two reasons. The first is that there might 5775be programs that rely on the start state not changing across file changes. 5776The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required, 5777so fixing the problem there doesn't solve the more general problem. 5778 5779@node How can I match C-style comments? 5780@unnumberedsec How can I match C-style comments? 5781 5782You might be tempted to try something like this: 5783 5784@example 5785@verbatim 5786"/*".*"*/" // WRONG! 5787@end verbatim 5788@end example 5789 5790or, worse, this: 5791 5792@example 5793@verbatim 5794"/*"(.|\n)"*/" // WRONG! 5795@end verbatim 5796@end example 5797 5798The above rules will eat too much input, and blow up on things like: 5799 5800@example 5801@verbatim 5802/* a comment */ do_my_thing( "oops */" ); 5803@end verbatim 5804@end example 5805 5806Here is one way which allows you to track line information: 5807 5808@example 5809@verbatim 5810<INITIAL>{ 5811"/*" BEGIN(IN_COMMENT); 5812} 5813<IN_COMMENT>{ 5814"*/" BEGIN(INITIAL); 5815[^*\n]+ // eat comment in chunks 5816"*" // eat the lone star 5817\n yylineno++; 5818} 5819@end verbatim 5820@end example 5821 5822@node The period isn't working the way I expected. 5823@unnumberedsec The '.' isn't working the way I expected. 5824 5825Here are some tips for using @samp{.}: 5826 5827@itemize 5828@item 5829A common mistake is to place the grouping parenthesis AFTER an operator, when 5830you really meant to place the parenthesis BEFORE the operator, e.g., you 5831probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. 5832 5833The first pattern matches the words @samp{foo} or @samp{bar} any number of 5834times, e.g., it matches the text @samp{barfoofoobarfoo}. The 5835second pattern matches a single instance of @code{foo} or a single instance of 5836@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . 5837@item 5838A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), 5839and NOT ``any character except newline''. 5840@item 5841Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). 5842If you really want to match ANY character, including newlines, then use @code{(.|\n)} 5843Beware that the regex @code{(.|\n)+} will match your entire input! 5844@item 5845Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} 5846@end itemize 5847 5848@node Can I get the flex manual in another format? 5849@unnumberedsec Can I get the flex manual in another format? 5850 5851The @code{flex} source distribution includes a texinfo manual. You are 5852free to convert that texinfo into whatever format you desire. The 5853@code{texinfo} package includes tools for conversion to a number of formats. 5854 5855@node Does there exist a "faster" NDFA->DFA algorithm? 5856@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? 5857 5858There's no way around the potential exponential running time - it 5859can take you exponential time just to enumerate all of the DFA states. 5860In practice, though, the running time is closer to linear, or sometimes 5861quadratic. 5862 5863@node How does flex compile the DFA so quickly? 5864@unnumberedsec How does flex compile the DFA so quickly? 5865 5866There are two big speed wins that @code{flex} uses: 5867 5868@enumerate 5869@item 5870It analyzes the input rules to construct equivalence classes for those 5871characters that always make the same transitions. It then rewrites the NFA 5872using equivalence classes for transitions instead of characters. This cuts 5873down the NFA->DFA computation time dramatically, to the point where, for 5874uncompressed DFA tables, the DFA generation is often I/O bound in writing out 5875the tables. 5876@item 5877It maintains hash values for previously computed DFA states, so testing 5878whether a newly constructed DFA state is equivalent to a previously constructed 5879state can be done very quickly, by first comparing hash values. 5880@end enumerate 5881 5882@node How can I use more than 8192 rules? 5883@unnumberedsec How can I use more than 8192 rules? 5884 5885@code{Flex} is compiled with an upper limit of 8192 rules per scanner. 5886If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex} 5887with the following changes in @file{flexdef.h}: 5888 5889@example 5890@verbatim 5891< #define YY_TRAILING_MASK 0x2000 5892< #define YY_TRAILING_HEAD_MASK 0x4000 5893-- 5894> #define YY_TRAILING_MASK 0x20000000 5895> #define YY_TRAILING_HEAD_MASK 0x40000000 5896@end verbatim 5897@end example 5898 5899This should work okay as long as your C compiler uses 32 bit integers. 5900But you might want to think about whether using such a huge number of rules 5901is the best way to solve your problem. 5902 5903The following may also be relevant: 5904 5905With luck, you should be able to increase the definitions in flexdef.h for: 5906 5907@example 5908@verbatim 5909#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 5910#define MAXIMUM_MNS 31999 5911#define BAD_SUBSCRIPT -32767 5912@end verbatim 5913@end example 5914 5915recompile everything, and it'll all work. Flex only has these 16-bit-like 5916values built into it because a long time ago it was developed on a machine 5917with 16-bit ints. I've given this advice to others in the past but haven't 5918heard back from them whether it worked okay or not... 5919 5920@node How do I abandon a file in the middle of a scan and switch to a new file? 5921@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? 5922 5923Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a 5924``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}. 5925 5926@node How do I execute code only during initialization (only before the first scan)? 5927@unnumberedsec How do I execute code only during initialization (only before the first scan)? 5928 5929You can specify an initial action by defining the macro @code{YY_USER_INIT} (though 5930note that @code{yyout} may not be available at the time this macro is executed). Or you 5931can add to the beginning of your rules section: 5932 5933@example 5934@verbatim 5935%% 5936 /* Must be indented! */ 5937 static int did_init = 0; 5938 5939 if ( ! did_init ){ 5940do_my_init(); 5941 did_init = 1; 5942 } 5943@end verbatim 5944@end example 5945 5946@node How do I execute code at termination? 5947@unnumberedsec How do I execute code at termination? 5948 5949You can specify an action for the @code{<<EOF>>} rule. 5950 5951@node Where else can I find help? 5952@unnumberedsec Where else can I find help? 5953 5954You can find the flex homepage on the web at 5955@uref{http://flex.sourceforge.net/}. See that page for details about flex 5956mailing lists as well. 5957 5958@node Can I include comments in the "rules" section of the file? 5959@unnumberedsec Can I include comments in the "rules" section of the file? 5960 5961Yes, just about anywhere you want to. See the manual for the specific syntax. 5962 5963@node I get an error about undefined yywrap(). 5964@unnumberedsec I get an error about undefined yywrap(). 5965 5966You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a} 5967(which provides one), or use 5968 5969@example 5970@verbatim 5971%option noyywrap 5972@end verbatim 5973@end example 5974 5975in your source to say you don't want a @code{yywrap()} function. 5976 5977@node How can I change the matching pattern at run time? 5978@unnumberedsec How can I change the matching pattern at run time? 5979 5980You can't, it's compiled into a static table when flex builds the scanner. 5981 5982@node How can I expand macros in the input? 5983@unnumberedsec How can I expand macros in the input? 5984 5985The best way to approach this problem is at a higher level, e.g., in the parser. 5986 5987However, you can do this using multiple input buffers. 5988 5989@example 5990@verbatim 5991%% 5992macro/[a-z]+ { 5993/* Saw the macro "macro" followed by extra stuff. */ 5994main_buffer = YY_CURRENT_BUFFER; 5995expansion_buffer = yy_scan_string(expand(yytext)); 5996yy_switch_to_buffer(expansion_buffer); 5997} 5998 5999<<EOF>> { 6000if ( expansion_buffer ) 6001{ 6002// We were doing an expansion, return to where 6003// we were. 6004yy_switch_to_buffer(main_buffer); 6005yy_delete_buffer(expansion_buffer); 6006expansion_buffer = 0; 6007} 6008else 6009yyterminate(); 6010} 6011@end verbatim 6012@end example 6013 6014You probably will want a stack of expansion buffers to allow nested macros. 6015From the above though hopefully the idea is clear. 6016 6017@node How can I build a two-pass scanner? 6018@unnumberedsec How can I build a two-pass scanner? 6019 6020One way to do it is to filter the first pass to a temporary file, 6021then process the temporary file on the second pass. You will probably see a 6022performance hit, due to all the disk I/O. 6023 6024When you need to look ahead far forward like this, it almost always means 6025that the right solution is to build a parse tree of the entire input, then 6026walk it after the parse in order to generate the output. In a sense, this 6027is a two-pass approach, once through the text and once through the parse 6028tree, but the performance hit for the latter is usually an order of magnitude 6029smaller, since everything is already classified, in binary format, and 6030residing in memory. 6031 6032@node How do I match any string not matched in the preceding rules? 6033@unnumberedsec How do I match any string not matched in the preceding rules? 6034 6035One way to assign precedence, is to place the more specific rules first. If 6036two rules would match the same input (same sequence of characters) then the 6037first rule listed in the @code{flex} input wins, e.g., 6038 6039@example 6040@verbatim 6041%% 6042foo[a-zA-Z_]+ return FOO_ID; 6043bar[a-zA-Z_]+ return BAR_ID; 6044[a-zA-Z_]+ return GENERIC_ID; 6045@end verbatim 6046@end example 6047 6048Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the 6049same amount of text as the more specific rules, and in that case the 6050@code{flex} scanner will pick the first rule listed in your scanner as the 6051one to match. 6052 6053@node I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6054@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6055 6056Those are internal variables pointing into the AT&T scanner's input buffer. I 6057imagine they're being manipulated in user versions of the @code{input()} and @code{unput()} 6058functions. If so, what you need to do is analyze those functions to figure out 6059what they're doing, and then replace @code{input()} with an appropriate definition of 6060@code{YY_INPUT}. You shouldn't need to (and must not) replace 6061@code{flex}'s @code{unput()} function. 6062 6063@node Is there a way to make flex treat NULL like a regular character? 6064@unnumberedsec Is there a way to make flex treat NULL like a regular character? 6065 6066Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient 6067version of @code{flex}. The latest release is version @value{VERSION}. 6068 6069@node Whenever flex can not match the input it says "flex scanner jammed". 6070@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". 6071 6072You need to add a rule that matches the otherwise-unmatched text, 6073e.g., 6074 6075@example 6076@verbatim 6077%option yylineno 6078%% 6079[[a bunch of rules here]] 6080 6081. printf("bad input character '%s' at line %d\n", yytext, yylineno); 6082@end verbatim 6083@end example 6084 6085See @code{%option default} for more information. 6086 6087@node Why doesn't flex have non-greedy operators like perl does? 6088@unnumberedsec Why doesn't flex have non-greedy operators like perl does? 6089 6090A DFA can do a non-greedy match by stopping 6091the first time it enters an accepting state, instead of consuming input until 6092it determines that no further matching is possible (a ``jam'' state). This 6093is actually easier to implement than longest leftmost match (which flex does). 6094 6095But it's also much less useful than longest leftmost match. In general, 6096when you find yourself wishing for non-greedy matching, that's usually a 6097sign that you're trying to make the scanner do some parsing. That's 6098generally the wrong approach, since it lacks the power to do a decent job. 6099Better is to either introduce a separate parser, or to split the scanner 6100into multiple scanners using (exclusive) start conditions. 6101 6102You might have 6103a separate start state once you've seen the @samp{BEGIN}. In that state, you 6104might then have a regex that will match @samp{END} (to kick you out of the 6105state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... 6106 6107This approach also has much better error-reporting properties. 6108 6109@node Memory leak - 16386 bytes allocated by malloc. 6110@unnumberedsec Memory leak - 16386 bytes allocated by malloc. 6111@anchor{faq-memory-leak} 6112 6113UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not 6114call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read 6115on. 6116 6117The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and 6118about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in 6119the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ 6120scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed. 6121 6122However, the leak won't multiply since the buffer is reused no matter how many 6123times you call @code{yylex()}. 6124 6125If you want to reclaim the memory when you are completely done scanning, then 6126you might try this: 6127 6128@example 6129@verbatim 6130/* For non-reentrant C scanner only. */ 6131yy_delete_buffer(YY_CURRENT_BUFFER); 6132yy_init = 1; 6133@end verbatim 6134@end example 6135 6136Note: @code{yy_init} is an "internal variable", and hasn't been tested in this 6137situation. It is possible that some other globals may need resetting as well. 6138 6139@node How do I track the byte offset for lseek()? 6140@unnumberedsec How do I track the byte offset for lseek()? 6141 6142@example 6143@verbatim 6144> We thought that it would be possible to have this number through the 6145> evaluation of the following expression: 6146> 6147> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf 6148@end verbatim 6149@end example 6150 6151While this is the right idea, it has two problems. The first is that 6152it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during 6153an invocation of @code{YY_INPUT} (or that your input source will return less 6154even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem 6155is that when refilling its internal buffer, @code{flex} keeps some characters 6156from the previous buffer (because usually it's in the middle of a match, 6157and needs those characters to construct @code{yytext} for the match once it's 6158done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't 6159be exactly the number of characters already read from the current buffer. 6160 6161An alternative solution is to count the number of characters you've matched 6162since starting to scan. This can be done by using @code{YY_USER_ACTION}. For 6163example, 6164 6165@example 6166@verbatim 6167#define YY_USER_ACTION num_chars += yyleng; 6168@end verbatim 6169@end example 6170 6171(You need to be careful to update your bookkeeping if you use @code{yymore(}), 6172@code{yyless()}, @code{unput()}, or @code{input()}.) 6173 6174@node How do I use my own I/O classes in a C++ scanner? 6175@section How do I use my own I/O classes in a C++ scanner? 6176 6177When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. 6178 6179@cindex LexerOutput, overriding 6180@cindex LexerInput, overriding 6181@cindex overriding LexerOutput 6182@cindex overriding LexerInput 6183@cindex customizing I/O in C++ scanners 6184@cindex C++ I/O, customizing 6185You can do this by passing the various functions (such as @code{LexerInput()} 6186and @code{LexerOutput()}) NULL @code{iostream*}'s, and then 6187dealing with your own I/O classes surreptitiously (i.e., stashing them in 6188special member variables). This works because the only assumption about 6189the lexer regarding what's done with the iostream's is that they're 6190ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever 6191is necessary with them. 6192 6193@c faq edit stopped here 6194@node How do I skip as many chars as possible? 6195@unnumberedsec How do I skip as many chars as possible? 6196 6197How do I skip as many chars as possible -- without interfering with the other 6198patterns? 6199 6200In the example below, we want to skip over characters until we see the phrase 6201"endskip". The following will @emph{NOT} work correctly (do you see why not?) 6202 6203@example 6204@verbatim 6205/* INCORRECT SCANNER */ 6206%x SKIP 6207%% 6208<INITIAL>startskip BEGIN(SKIP); 6209... 6210<SKIP>"endskip" BEGIN(INITIAL); 6211<SKIP>.* ; 6212@end verbatim 6213@end example 6214 6215The problem is that the pattern .* will eat up the word "endskip." 6216The simplest (but slow) fix is: 6217 6218@example 6219@verbatim 6220<SKIP>"endskip" BEGIN(INITIAL); 6221<SKIP>. ; 6222@end verbatim 6223@end example 6224 6225The fix involves making the second rule match more, without 6226making it match "endskip" plus something else. So for example: 6227 6228@example 6229@verbatim 6230<SKIP>"endskip" BEGIN(INITIAL); 6231<SKIP>[^e]+ ; 6232<SKIP>. ;/* so you eat up e's, too */ 6233@end verbatim 6234@end example 6235 6236@c TODO: Evaluate this faq. 6237@node deleteme00 6238@unnumberedsec deleteme00 6239@example 6240@verbatim 6241QUESTION: 6242When was flex born? 6243 6244Vern Paxson took over 6245the Software Tools lex project from Jef Poskanzer in 1982. At that point it 6246was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 6247a legend was born :-). 6248@end verbatim 6249@end example 6250 6251@c TODO: Evaluate this faq. 6252@node Are certain equivalent patterns faster than others? 6253@unnumberedsec Are certain equivalent patterns faster than others? 6254@example 6255@verbatim 6256To: Adoram Rogel <adoram@orna.hybridge.com> 6257Subject: Re: Flex 2.5.2 performance questions 6258In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. 6259Date: Wed, 18 Sep 96 10:51:02 PDT 6260From: Vern Paxson <vern> 6261 6262[Note, the most recent flex release is 2.5.4, which you can get from 6263ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] 6264 6265> 1. Using the pattern 6266> ([Ff](oot)?)?[Nn](ote)?(\.)? 6267> instead of 6268> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) 6269> (in a very complicated flex program) caused the program to slow from 6270> 300K+/min to 100K/min (no other changes were done). 6271 6272These two are not equivalent. For example, the first can match "footnote." 6273but the second can only match "footnote". This is almost certainly the 6274cause in the discrepancy - the slower scanner run is matching more tokens, 6275and/or having to do more backing up. 6276 6277> 2. Which of these two are better: [Ff]oot or (F|f)oot ? 6278 6279From a performance point of view, they're equivalent (modulo presumably 6280minor effects such as memory cache hit rates; and the presence of trailing 6281context, see below). From a space point of view, the first is slightly 6282preferable. 6283 6284> 3. I have a pattern that look like this: 6285> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) 6286> 6287> running yet another complicated program that includes the following rule: 6288> <snext>{and}/{no4}{bb}{pats} 6289> 6290> gets me to "too complicated - over 32,000 states"... 6291 6292I can't tell from this example whether the trailing context is variable-length 6293or fixed-length (it could be the latter if {and} is fixed-length). If it's 6294variable length, which flex -p will tell you, then this reflects a basic 6295performance problem, and if you can eliminate it by restructuring your 6296scanner, you will see significant improvement. 6297 6298> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about 6299> 10 patterns and changed the rule to be 5 rules. 6300> This did compile, but what is the rule of thumb here ? 6301 6302The rule is to avoid trailing context other than fixed-length, in which for 6303a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use 6304of the '|' operator automatically makes the pattern variable length, so in 6305this case '[Ff]oot' is preferred to '(F|f)oot'. 6306 6307> 4. I changed a rule that looked like this: 6308> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... 6309> 6310> to the next 2 rules: 6311> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} 6312> <snext8>{and}{bb}/{ROMAN} { BEGIN... 6313> 6314> Again, I understand the using [^...] will cause a great performance loss 6315 6316Actually, it doesn't cause any sort of performance loss. It's a surprising 6317fact about regular expressions that they always match in linear time 6318regardless of how complex they are. 6319 6320> but are there any specific rules about it ? 6321 6322See the "Performance Considerations" section of the man page, and also 6323the example in MISC/fastwc/. 6324 6325 Vern 6326@end verbatim 6327@end example 6328 6329@c TODO: Evaluate this faq. 6330@node Is backing up a big deal? 6331@unnumberedsec Is backing up a big deal? 6332@example 6333@verbatim 6334To: Adoram Rogel <adoram@hybridge.com> 6335Subject: Re: Flex 2.5.2 performance questions 6336In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. 6337Date: Thu, 19 Sep 96 09:58:00 PDT 6338From: Vern Paxson <vern> 6339 6340> a lot about the backing up problem. 6341> I believe that there lies my biggest problem, and I'll try to improve 6342> it. 6343 6344Since you have variable trailing context, this is a bigger performance 6345problem. Fixing it is usually easier than fixing backing up, which in a 6346complicated scanner (yours seems to fit the bill) can be extremely 6347difficult to do correctly. 6348 6349You also don't mention what flags you are using for your scanner. 6350-f makes a large speed difference, and -Cfe buys you nearly as much 6351speed but the resulting scanner is considerably smaller. 6352 6353> I have an | operator in {and} and in {pats} so both of them are variable 6354> length. 6355 6356-p should have reported this. 6357 6358> Is changing one of them to fixed-length is enough ? 6359 6360Yes. 6361 6362> Is it possible to change the 32,000 states limit ? 6363 6364Yes. I've appended instructions on how. Before you make this change, 6365though, you should think about whether there are ways to fundamentally 6366simplify your scanner - those are certainly preferable! 6367 6368 Vern 6369 6370To increase the 32K limit (on a machine with 32 bit integers), you increase 6371the magnitude of the following in flexdef.h: 6372 6373#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6374#define MAXIMUM_MNS 31999 6375#define BAD_SUBSCRIPT -32767 6376#define MAX_SHORT 32700 6377 6378Adding a 0 or two after each should do the trick. 6379@end verbatim 6380@end example 6381 6382@c TODO: Evaluate this faq. 6383@node Can I fake multi-byte character support? 6384@unnumberedsec Can I fake multi-byte character support? 6385@example 6386@verbatim 6387To: Heeman_Lee@hp.com 6388Subject: Re: flex - multi-byte support? 6389In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. 6390Date: Fri, 04 Oct 1996 11:42:18 PDT 6391From: Vern Paxson <vern> 6392 6393> I assume as long as my *.l file defines the 6394> range of expected character code values (in octal format), flex will 6395> scan the file and read multi-byte characters correctly. But I have no 6396> confidence in this assumption. 6397 6398Your lack of confidence is justified - this won't work. 6399 6400Flex has in it a widespread assumption that the input is processed 6401one byte at a time. Fixing this is on the to-do list, but is involved, 6402so it won't happen any time soon. In the interim, the best I can suggest 6403(unless you want to try fixing it yourself) is to write your rules in 6404terms of pairs of bytes, using definitions in the first section: 6405 6406 X \xfe\xc2 6407 ... 6408 %% 6409 foo{X}bar found_foo_fe_c2_bar(); 6410 6411etc. Definitely a pain - sorry about that. 6412 6413By the way, the email address you used for me is ancient, indicating you 6414have a very old version of flex. You can get the most recent, 2.5.4, from 6415ftp.ee.lbl.gov. 6416 6417 Vern 6418@end verbatim 6419@end example 6420 6421@c TODO: Evaluate this faq. 6422@node deleteme01 6423@unnumberedsec deleteme01 6424@example 6425@verbatim 6426To: moleary@primus.com 6427Subject: Re: Flex / Unicode compatibility question 6428In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. 6429Date: Tue, 22 Oct 1996 11:06:13 PDT 6430From: Vern Paxson <vern> 6431 6432Unfortunately flex at the moment has a widespread assumption within it 6433that characters are processed 8 bits at a time. I don't see any easy 6434fix for this (other than writing your rules in terms of double characters - 6435a pain). I also don't know of a wider lex, though you might try surfing 6436the Plan 9 stuff because I know it's a Unicode system, and also the PCCT 6437toolkit (try searching say Alta Vista for "Purdue Compiler Construction 6438Toolkit"). 6439 6440Fixing flex to handle wider characters is on the long-term to-do list. 6441But since flex is a strictly spare-time project these days, this probably 6442won't happen for quite a while, unless someone else does it first. 6443 6444 Vern 6445@end verbatim 6446@end example 6447 6448@c TODO: Evaluate this faq. 6449@node Can you discuss some flex internals? 6450@unnumberedsec Can you discuss some flex internals? 6451@example 6452@verbatim 6453To: Johan Linde <jl@theophys.kth.se> 6454Subject: Re: translation of flex 6455In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. 6456Date: Mon, 11 Nov 1996 10:33:50 PST 6457From: Vern Paxson <vern> 6458 6459> I'm working for the Swedish team translating GNU program, and I'm currently 6460> working with flex. I have a few questions about some of the messages which 6461> I hope you can answer. 6462 6463All of the things you're wondering about, by the way, concerning flex 6464internals - probably the only person who understands what they mean in 6465English is me! So I wouldn't worry too much about getting them right. 6466That said ... 6467 6468> #: main.c:545 6469> msgid " %d protos created\n" 6470> 6471> Does proto mean prototype? 6472 6473Yes - prototypes of state compression tables. 6474 6475> #: main.c:539 6476> msgid " %d/%d (peak %d) template nxt-chk entries created\n" 6477> 6478> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) 6479> However, 'template next-check entries' doesn't make much sense to me. To be 6480> able to find a good translation I need to know a little bit more about it. 6481 6482There is a scheme in the Aho/Sethi/Ullman compiler book for compressing 6483scanner tables. It involves creating two pairs of tables. The first has 6484"base" and "default" entries, the second has "next" and "check" entries. 6485The "base" entry is indexed by the current state and yields an index into 6486the next/check table. The "default" entry gives what to do if the state 6487transition isn't found in next/check. The "next" entry gives the next 6488state to enter, but only if the "check" entry verifies that this entry is 6489correct for the current state. Flex creates templates of series of 6490next/check entries and then encodes differences from these templates as a 6491way to compress the tables. 6492 6493> #: main.c:533 6494> msgid " %d/%d base-def entries created\n" 6495> 6496> The same problem here for 'base-def'. 6497 6498See above. 6499 6500 Vern 6501@end verbatim 6502@end example 6503 6504@c TODO: Evaluate this faq. 6505@node unput() messes up yy_at_bol 6506@unnumberedsec unput() messes up yy_at_bol 6507@example 6508@verbatim 6509To: Xinying Li <xli@npac.syr.edu> 6510Subject: Re: FLEX ? 6511In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. 6512Date: Wed, 13 Nov 1996 19:51:54 PST 6513From: Vern Paxson <vern> 6514 6515> "unput()" them to input flow, question occurs. If I do this after I scan 6516> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That 6517> means the carriage flag has gone. 6518 6519You can control this by calling yy_set_bol(). It's described in the manual. 6520 6521> And if in pre-reading it goes to the end of file, is anything done 6522> to control the end of curren buffer and end of file? 6523 6524No, there's no way to put back an end-of-file. 6525 6526> By the way I am using flex 2.5.2 and using the "-l". 6527 6528The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and 65292.5.3. You can get it from ftp.ee.lbl.gov. 6530 6531 Vern 6532@end verbatim 6533@end example 6534 6535@c TODO: Evaluate this faq. 6536@node The | operator is not doing what I want 6537@unnumberedsec The | operator is not doing what I want 6538@example 6539@verbatim 6540To: Alain.ISSARD@st.com 6541Subject: Re: Start condition with FLEX 6542In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. 6543Date: Mon, 18 Nov 1996 10:41:34 PST 6544From: Vern Paxson <vern> 6545 6546> I am not able to use the start condition scope and to use the | (OR) with 6547> rules having start conditions. 6548 6549The problem is that if you use '|' as a regular expression operator, for 6550example "a|b" meaning "match either 'a' or 'b'", then it must *not* have 6551any blanks around it. If you instead want the special '|' *action* (which 6552from your scanner appears to be the case), which is a way of giving two 6553different rules the same action: 6554 6555 foo | 6556 bar matched_foo_or_bar(); 6557 6558then '|' *must* be separated from the first rule by whitespace and *must* 6559be followed by a new line. You *cannot* write it as: 6560 6561 foo | bar matched_foo_or_bar(); 6562 6563even though you might think you could because yacc supports this syntax. 6564The reason for this unfortunately incompatibility is historical, but it's 6565unlikely to be changed. 6566 6567Your problems with start condition scope are simply due to syntax errors 6568from your use of '|' later confusing flex. 6569 6570Let me know if you still have problems. 6571 6572 Vern 6573@end verbatim 6574@end example 6575 6576@c TODO: Evaluate this faq. 6577@node Why can't flex understand this variable trailing context pattern? 6578@unnumberedsec Why can't flex understand this variable trailing context pattern? 6579@example 6580@verbatim 6581To: Gregory Margo <gmargo@newton.vip.best.com> 6582Subject: Re: flex-2.5.3 bug report 6583In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. 6584Date: Sat, 23 Nov 1996 17:07:32 PST 6585From: Vern Paxson <vern> 6586 6587> Enclosed is a lex file that "real" lex will process, but I cannot get 6588> flex to process it. Could you try it and maybe point me in the right direction? 6589 6590Your problem is that some of the definitions in the scanner use the '/' 6591trailing context operator, and have it enclosed in ()'s. Flex does not 6592allow this operator to be enclosed in ()'s because doing so allows undefined 6593regular expressions such as "(a/b)+". So the solution is to remove the 6594parentheses. Note that you must also be building the scanner with the -l 6595option for AT&T lex compatibility. Without this option, flex automatically 6596encloses the definitions in parentheses. 6597 6598 Vern 6599@end verbatim 6600@end example 6601 6602@c TODO: Evaluate this faq. 6603@node The ^ operator isn't working 6604@unnumberedsec The ^ operator isn't working 6605@example 6606@verbatim 6607To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> 6608Subject: Re: Flex Bug ? 6609In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. 6610Date: Tue, 26 Nov 1996 11:15:05 PST 6611From: Vern Paxson <vern> 6612 6613> In my lexer code, i have the line : 6614> ^\*.* { } 6615> 6616> Thus all lines starting with an astrix (*) are comment lines. 6617> This does not work ! 6618 6619I can't get this problem to reproduce - it works fine for me. Note 6620though that if what you have is slightly different: 6621 6622 COMMENT ^\*.* 6623 %% 6624 {COMMENT} { } 6625 6626then it won't work, because flex pushes back macro definitions enclosed 6627in ()'s, so the rule becomes 6628 6629 (^\*.*) { } 6630 6631and now that the '^' operator is not at the immediate beginning of the 6632line, it's interpreted as just a regular character. You can avoid this 6633behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". 6634 6635 Vern 6636@end verbatim 6637@end example 6638 6639@c TODO: Evaluate this faq. 6640@node Trailing context is getting confused with trailing optional patterns 6641@unnumberedsec Trailing context is getting confused with trailing optional patterns 6642@example 6643@verbatim 6644To: Adoram Rogel <adoram@hybridge.com> 6645Subject: Re: Flex 2.5.4 BOF ??? 6646In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. 6647Date: Wed, 27 Nov 1996 10:56:25 PST 6648From: Vern Paxson <vern> 6649 6650> Organization(s)?/[a-z] 6651> 6652> This matched "Organizations" (looking in debug mode, the trailing s 6653> was matched with trailing context instead of the optional (s) in the 6654> end of the word. 6655 6656That should only happen with lex. Flex can properly match this pattern. 6657(That might be what you're saying, I'm just not sure.) 6658 6659> Is there a way to avoid this dangerous trailing context problem ? 6660 6661Unfortunately, there's no easy way. On the other hand, I don't see why 6662it should be a problem. Lex's matching is clearly wrong, and I'd hope 6663that usually the intent remains the same as expressed with the pattern, 6664so flex's matching will be correct. 6665 6666 Vern 6667@end verbatim 6668@end example 6669 6670@c TODO: Evaluate this faq. 6671@node Is flex GNU or not? 6672@unnumberedsec Is flex GNU or not? 6673@example 6674@verbatim 6675To: Cameron MacKinnon <mackin@interlog.com> 6676Subject: Re: Flex documentation bug 6677In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. 6678Date: Sun, 01 Dec 1996 22:29:39 PST 6679From: Vern Paxson <vern> 6680 6681> I'm not sure how or where to submit bug reports (documentation or 6682> otherwise) for the GNU project stuff ... 6683 6684Well, strictly speaking flex isn't part of the GNU project. They just 6685distribute it because no one's written a decent GPL'd lex replacement. 6686So you should send bugs directly to me. Those sent to the GNU folks 6687sometimes find there way to me, but some may drop between the cracks. 6688 6689> In GNU Info, under the section 'Start Conditions', and also in the man 6690> page (mine's dated April '95) is a nice little snippet showing how to 6691> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in 6692> size. Unfortunately, no overflow checking is ever done ... 6693 6694This is already mentioned in the manual: 6695 6696Finally, here's an example of how to match C-style quoted 6697strings using exclusive start conditions, including expanded 6698escape sequences (but not including checking for a string 6699that's too long): 6700 6701The reason for not doing the overflow checking is that it will needlessly 6702clutter up an example whose main purpose is just to demonstrate how to 6703use flex. 6704 6705The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. 6706 6707 Vern 6708@end verbatim 6709@end example 6710 6711@c TODO: Evaluate this faq. 6712@node ERASEME53 6713@unnumberedsec ERASEME53 6714@example 6715@verbatim 6716To: tsv@cs.UManitoba.CA 6717Subject: Re: Flex (reg).. 6718In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. 6719Date: Thu, 06 Mar 1997 15:54:19 PST 6720From: Vern Paxson <vern> 6721 6722> [:alpha:] ([:alnum:] | \\_)* 6723 6724If your rule really has embedded blanks as shown above, then it won't 6725work, as the first blank delimits the rule from the action. (It wouldn't 6726even compile ...) You need instead: 6727 6728[:alpha:]([:alnum:]|\\_)* 6729 6730and that should work fine - there's no restriction on what can go inside 6731of ()'s except for the trailing context operator, '/'. 6732 6733 Vern 6734@end verbatim 6735@end example 6736 6737@c TODO: Evaluate this faq. 6738@node I need to scan if-then-else blocks and while loops 6739@unnumberedsec I need to scan if-then-else blocks and while loops 6740@example 6741@verbatim 6742To: "Mike Stolnicki" <mstolnic@ford.com> 6743Subject: Re: FLEX help 6744In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. 6745Date: Fri, 30 May 1997 10:46:35 PDT 6746From: Vern Paxson <vern> 6747 6748> We'd like to add "if-then-else", "while", and "for" statements to our 6749> language ... 6750> We've investigated many possible solutions. The one solution that seems 6751> the most reasonable involves knowing the position of a TOKEN in yyin. 6752 6753I strongly advise you to instead build a parse tree (abstract syntax tree) 6754and loop over that instead. You'll find this has major benefits in keeping 6755your interpreter simple and extensible. 6756 6757That said, the functionality you mention for get_position and set_position 6758have been on the to-do list for a while. As flex is a purely spare-time 6759project for me, no guarantees when this will be added (in particular, it 6760for sure won't be for many months to come). 6761 6762 Vern 6763@end verbatim 6764@end example 6765 6766@c TODO: Evaluate this faq. 6767@node ERASEME55 6768@unnumberedsec ERASEME55 6769@example 6770@verbatim 6771To: Colin Paul Adams <colin@colina.demon.co.uk> 6772Subject: Re: Flex C++ classes and Bison 6773In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. 6774Date: Fri, 15 Aug 1997 10:48:19 PDT 6775From: Vern Paxson <vern> 6776 6777> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control 6778> *parm) 6779> 6780> I have been trying to get this to work as a C++ scanner, but it does 6781> not appear to be possible (warning that it matches no declarations in 6782> yyFlexLexer, or something like that). 6783> 6784> Is this supposed to be possible, or is it being worked on (I DID 6785> notice the comment that scanner classes are still experimental, so I'm 6786> not too hopeful)? 6787 6788What you need to do is derive a subclass from yyFlexLexer that provides 6789the above yylex() method, squirrels away lvalp and parm into member 6790variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. 6791 6792 Vern 6793@end verbatim 6794@end example 6795 6796@c TODO: Evaluate this faq. 6797@node ERASEME56 6798@unnumberedsec ERASEME56 6799@example 6800@verbatim 6801To: Mikael.Latvala@lmf.ericsson.se 6802Subject: Re: Possible mistake in Flex v2.5 document 6803In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. 6804Date: Fri, 05 Sep 1997 10:01:54 PDT 6805From: Vern Paxson <vern> 6806 6807> In that example you show how to count comment lines when using 6808> C style /* ... */ comments. My question is, shouldn't you take into 6809> account a scenario where end of a comment marker occurs inside 6810> character or string literals? 6811 6812The scanner certainly needs to also scan character and string literals. 6813However it does that (there's an example in the man page for strings), the 6814lexer will recognize the beginning of the literal before it runs across the 6815embedded "/*". Consequently, it will finish scanning the literal before it 6816even considers the possibility of matching "/*". 6817 6818Example: 6819 6820 '([^']*|{ESCAPE_SEQUENCE})' 6821 6822will match all the text between the ''s (inclusive). So the lexer 6823considers this as a token beginning at the first ', and doesn't even 6824attempt to match other tokens inside it. 6825 6826I thinnk this subtlety is not worth putting in the manual, as I suspect 6827it would confuse more people than it would enlighten. 6828 6829 Vern 6830@end verbatim 6831@end example 6832 6833@c TODO: Evaluate this faq. 6834@node ERASEME57 6835@unnumberedsec ERASEME57 6836@example 6837@verbatim 6838To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> 6839Subject: Re: flex limitations 6840In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. 6841Date: Mon, 08 Sep 1997 11:38:08 PDT 6842From: Vern Paxson <vern> 6843 6844> %% 6845> [a-zA-Z]+ /* skip a line */ 6846> { printf("got %s\n", yytext); } 6847> %% 6848 6849What version of flex are you using? If I feed this to 2.5.4, it complains: 6850 6851 "bug.l", line 5: EOF encountered inside an action 6852 "bug.l", line 5: unrecognized rule 6853 "bug.l", line 5: fatal parse error 6854 6855Not the world's greatest error message, but it manages to flag the problem. 6856 6857(With the introduction of start condition scopes, flex can't accommodate 6858an action on a separate line, since it's ambiguous with an indented rule.) 6859 6860You can get 2.5.4 from ftp.ee.lbl.gov. 6861 6862 Vern 6863@end verbatim 6864@end example 6865 6866@c TODO: Evaluate this faq. 6867@node Is there a repository for flex scanners? 6868@unnumberedsec Is there a repository for flex scanners? 6869 6870Not that we know of. You might try asking on comp.compilers. 6871 6872@c TODO: Evaluate this faq. 6873@node How can I conditionally compile or preprocess my flex input file? 6874@unnumberedsec How can I conditionally compile or preprocess my flex input file? 6875 6876 6877Flex doesn't have a preprocessor like C does. You might try using m4, or the C 6878preprocessor plus a sed script to clean up the result. 6879 6880 6881@c TODO: Evaluate this faq. 6882@node Where can I find grammars for lex and yacc? 6883@unnumberedsec Where can I find grammars for lex and yacc? 6884 6885In the sources for flex and bison. 6886 6887@c TODO: Evaluate this faq. 6888@node I get an end-of-buffer message for each character scanned. 6889@unnumberedsec I get an end-of-buffer message for each character scanned. 6890 6891This will happen if your LexerInput() function returns only one character 6892at a time, which can happen either if you're scanner is "interactive", or 6893if the streams library on your platform always returns 1 for yyin->gcount(). 6894 6895Solution: override LexerInput() with a version that returns whole buffers. 6896 6897@c TODO: Evaluate this faq. 6898@node unnamed-faq-62 6899@unnumberedsec unnamed-faq-62 6900@example 6901@verbatim 6902To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 6903Subject: Re: Flex maximums 6904In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. 6905Date: Mon, 17 Nov 1997 17:16:15 PST 6906From: Vern Paxson <vern> 6907 6908> I took a quick look into the flex-sources and altered some #defines in 6909> flexdefs.h: 6910> 6911> #define INITIAL_MNS 64000 6912> #define MNS_INCREMENT 1024000 6913> #define MAXIMUM_MNS 64000 6914 6915The things to fix are to add a couple of zeroes to: 6916 6917#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6918#define MAXIMUM_MNS 31999 6919#define BAD_SUBSCRIPT -32767 6920#define MAX_SHORT 32700 6921 6922and, if you get complaints about too many rules, make the following change too: 6923 6924 #define YY_TRAILING_MASK 0x200000 6925 #define YY_TRAILING_HEAD_MASK 0x400000 6926 6927- Vern 6928@end verbatim 6929@end example 6930 6931@c TODO: Evaluate this faq. 6932@node unnamed-faq-63 6933@unnumberedsec unnamed-faq-63 6934@example 6935@verbatim 6936To: jimmey@lexis-nexis.com (Jimmey Todd) 6937Subject: Re: FLEX question regarding istream vs ifstream 6938In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. 6939Date: Mon, 15 Dec 1997 13:21:35 PST 6940From: Vern Paxson <vern> 6941 6942> stdin_handle = YY_CURRENT_BUFFER; 6943> ifstream fin( "aFile" ); 6944> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); 6945> 6946> What I'm wanting to do, is pass the contents of a file thru one set 6947> of rules and then pass stdin thru another set... It works great if, I 6948> don't use the C++ classes. But since everything else that I'm doing is 6949> in C++, I thought I'd be consistent. 6950> 6951> The problem is that 'yy_create_buffer' is expecting an istream* as it's 6952> first argument (as stated in the man page). However, fin is a ifstream 6953> object. Any ideas on what I might be doing wrong? Any help would be 6954> appreciated. Thanks!! 6955 6956You need to pass &fin, to turn it into an ifstream* instead of an ifstream. 6957Then its type will be compatible with the expected istream*, because ifstream 6958is derived from istream. 6959 6960 Vern 6961@end verbatim 6962@end example 6963 6964@c TODO: Evaluate this faq. 6965@node unnamed-faq-64 6966@unnumberedsec unnamed-faq-64 6967@example 6968@verbatim 6969To: Enda Fadian <fadiane@piercom.ie> 6970Subject: Re: Question related to Flex man page? 6971In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. 6972Date: Tue, 16 Dec 1997 14:17:09 PST 6973From: Vern Paxson <vern> 6974 6975> Can you explain to me what is ment by a long-jump in relation to flex? 6976 6977Using the longjmp() function while inside yylex() or a routine called by it. 6978 6979> what is the flex activation frame. 6980 6981Just yylex()'s stack frame. 6982 6983> As far as I can see yyrestart will bring me back to the sart of the input 6984> file and using flex++ isnot really an option! 6985 6986No, yyrestart() doesn't imply a rewind, even though its name might sound 6987like it does. It tells the scanner to flush its internal buffers and 6988start reading from the given file at its present location. 6989 6990 Vern 6991@end verbatim 6992@end example 6993 6994@c TODO: Evaluate this faq. 6995@node unnamed-faq-65 6996@unnumberedsec unnamed-faq-65 6997@example 6998@verbatim 6999To: hassan@larc.info.uqam.ca (Hassan Alaoui) 7000Subject: Re: Need urgent Help 7001In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. 7002Date: Sun, 21 Dec 1997 21:30:46 PST 7003From: Vern Paxson <vern> 7004 7005> /usr/lib/yaccpar: In function `int yyparse()': 7006> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' 7007> 7008> ld: Undefined symbol 7009> _yylex 7010> _yyparse 7011> _yyin 7012 7013This is a known problem with Solaris C++ (and/or Solaris yacc). I believe 7014the fix is to explicitly insert some 'extern "C"' statements for the 7015corresponding routines/symbols. 7016 7017 Vern 7018@end verbatim 7019@end example 7020 7021@c TODO: Evaluate this faq. 7022@node unnamed-faq-66 7023@unnumberedsec unnamed-faq-66 7024@example 7025@verbatim 7026To: mc0307@mclink.it 7027Cc: gnu@prep.ai.mit.edu 7028Subject: Re: [mc0307@mclink.it: Help request] 7029In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. 7030Date: Sun, 21 Dec 1997 22:33:37 PST 7031From: Vern Paxson <vern> 7032 7033> This is my definition for float and integer types: 7034> . . . 7035> NZD [1-9] 7036> ... 7037> I've tested my program on other lex version (on UNIX Sun Solaris an HP 7038> UNIX) and it work well, so I think that my definitions are correct. 7039> There are any differences between Lex and Flex? 7040 7041There are indeed differences, as discussed in the man page. The one 7042you are probably running into is that when flex expands a name definition, 7043it puts parentheses around the expansion, while lex does not. There's 7044an example in the man page of how this can lead to different matching. 7045Flex's behavior complies with the POSIX standard (or at least with the 7046last POSIX draft I saw). 7047 7048 Vern 7049@end verbatim 7050@end example 7051 7052@c TODO: Evaluate this faq. 7053@node unnamed-faq-67 7054@unnumberedsec unnamed-faq-67 7055@example 7056@verbatim 7057To: hassan@larc.info.uqam.ca (Hassan Alaoui) 7058Subject: Re: Thanks 7059In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. 7060Date: Mon, 22 Dec 1997 14:35:05 PST 7061From: Vern Paxson <vern> 7062 7063> Thank you very much for your help. I compile and link well with C++ while 7064> declaring 'yylex ...' extern, But a little problem remains. I get a 7065> segmentation default when executing ( I linked with lfl library) while it 7066> works well when using LEX instead of flex. Do you have some ideas about the 7067> reason for this ? 7068 7069The one possible reason for this that comes to mind is if you've defined 7070yytext as "extern char yytext[]" (which is what lex uses) instead of 7071"extern char *yytext" (which is what flex uses). If it's not that, then 7072I'm afraid I don't know what the problem might be. 7073 7074 Vern 7075@end verbatim 7076@end example 7077 7078@c TODO: Evaluate this faq. 7079@node unnamed-faq-68 7080@unnumberedsec unnamed-faq-68 7081@example 7082@verbatim 7083To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> 7084Subject: Re: flex 2.5: c++ scanners & start conditions 7085In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. 7086Date: Tue, 06 Jan 1998 19:19:30 PST 7087From: Vern Paxson <vern> 7088 7089> The problem is that when I do this (using %option c++) start 7090> conditions seem to not apply. 7091 7092The BEGIN macro modifies the yy_start variable. For C scanners, this 7093is a static with scope visible through the whole file. For C++ scanners, 7094it's a member variable, so it only has visible scope within a member 7095function. Your lexbegin() routine is not a member function when you 7096build a C++ scanner, so it's not modifying the correct yy_start. The 7097diagnostic that indicates this is that you found you needed to add 7098a declaration of yy_start in order to get your scanner to compile when 7099using C++; instead, the correct fix is to make lexbegin() a member 7100function (by deriving from yyFlexLexer). 7101 7102 Vern 7103@end verbatim 7104@end example 7105 7106@c TODO: Evaluate this faq. 7107@node unnamed-faq-69 7108@unnumberedsec unnamed-faq-69 7109@example 7110@verbatim 7111To: "Boris Zinin" <boris@ippe.rssi.ru> 7112Subject: Re: current position in flex buffer 7113In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. 7114Date: Mon, 12 Jan 1998 12:03:15 PST 7115From: Vern Paxson <vern> 7116 7117> The problem is how to determine the current position in flex active 7118> buffer when a rule is matched.... 7119 7120You will need to keep track of this explicitly, such as by redefining 7121YY_USER_ACTION to count the number of characters matched. 7122 7123The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. 7124 7125 Vern 7126@end verbatim 7127@end example 7128 7129@c TODO: Evaluate this faq. 7130@node unnamed-faq-70 7131@unnumberedsec unnamed-faq-70 7132@example 7133@verbatim 7134To: Bik.Dhaliwal@bis.org 7135Subject: Re: Flex question 7136In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. 7137Date: Tue, 27 Jan 1998 22:41:52 PST 7138From: Vern Paxson <vern> 7139 7140> That requirement involves knowing 7141> the character position at which a particular token was matched 7142> in the lexer. 7143 7144The way you have to do this is by explicitly keeping track of where 7145you are in the file, by counting the number of characters scanned 7146for each token (available in yyleng). It may prove convenient to 7147do this by redefining YY_USER_ACTION, as described in the manual. 7148 7149 Vern 7150@end verbatim 7151@end example 7152 7153@c TODO: Evaluate this faq. 7154@node unnamed-faq-71 7155@unnumberedsec unnamed-faq-71 7156@example 7157@verbatim 7158To: Vladimir Alexiev <vladimir@cs.ualberta.ca> 7159Subject: Re: flex: how to control start condition from parser? 7160In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. 7161Date: Tue, 27 Jan 1998 22:45:37 PST 7162From: Vern Paxson <vern> 7163 7164> It seems useful for the parser to be able to tell the lexer about such 7165> context dependencies, because then they don't have to be limited to 7166> local or sequential context. 7167 7168One way to do this is to have the parser call a stub routine that's 7169included in the scanner's .l file, and consequently that has access ot 7170BEGIN. The only ugliness is that the parser can't pass in the state 7171it wants, because those aren't visible - but if you don't have many 7172such states, then using a different set of names doesn't seem like 7173to much of a burden. 7174 7175While generating a .h file like you suggests is certainly cleaner, 7176flex development has come to a virtual stand-still :-(, so a workaround 7177like the above is much more pragmatic than waiting for a new feature. 7178 7179 Vern 7180@end verbatim 7181@end example 7182 7183@c TODO: Evaluate this faq. 7184@node unnamed-faq-72 7185@unnumberedsec unnamed-faq-72 7186@example 7187@verbatim 7188To: Barbara Denny <denny@3com.com> 7189Subject: Re: freebsd flex bug? 7190In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. 7191Date: Fri, 30 Jan 1998 12:42:32 PST 7192From: Vern Paxson <vern> 7193 7194> lex.yy.c:1996: parse error before `=' 7195 7196This is the key, identifying this error. (It may help to pinpoint 7197it by using flex -L, so it doesn't generate #line directives in its 7198output.) I will bet you heavy money that you have a start condition 7199name that is also a variable name, or something like that; flex spits 7200out #define's for each start condition name, mapping them to a number, 7201so you can wind up with: 7202 7203 %x foo 7204 %% 7205 ... 7206 %% 7207 void bar() 7208 { 7209 int foo = 3; 7210 } 7211 7212and the penultimate will turn into "int 1 = 3" after C preprocessing, 7213since flex will put "#define foo 1" in the generated scanner. 7214 7215 Vern 7216@end verbatim 7217@end example 7218 7219@c TODO: Evaluate this faq. 7220@node unnamed-faq-73 7221@unnumberedsec unnamed-faq-73 7222@example 7223@verbatim 7224To: Maurice Petrie <mpetrie@infoscigroup.com> 7225Subject: Re: Lost flex .l file 7226In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. 7227Date: Mon, 02 Feb 1998 11:15:12 PST 7228From: Vern Paxson <vern> 7229 7230> I am curious as to 7231> whether there is a simple way to backtrack from the generated source to 7232> reproduce the lost list of tokens we are searching on. 7233 7234In theory, it's straight-forward to go from the DFA representation 7235back to a regular-expression representation - the two are isomorphic. 7236In practice, a huge headache, because you have to unpack all the tables 7237back into a single DFA representation, and then write a program to munch 7238on that and translate it into an RE. 7239 7240Sorry for the less-than-happy news ... 7241 7242 Vern 7243@end verbatim 7244@end example 7245 7246@c TODO: Evaluate this faq. 7247@node unnamed-faq-74 7248@unnumberedsec unnamed-faq-74 7249@example 7250@verbatim 7251To: jimmey@lexis-nexis.com (Jimmey Todd) 7252Subject: Re: Flex performance question 7253In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7254Date: Thu, 19 Feb 1998 08:48:51 PST 7255From: Vern Paxson <vern> 7256 7257> What I have found, is that the smaller the data chunk, the faster the 7258> program executes. This is the opposite of what I expected. Should this be 7259> happening this way? 7260 7261This is exactly what will happen if your input file has embedded NULs. 7262From the man page: 7263 7264A final note: flex is slow when matching NUL's, particularly 7265when a token contains multiple NUL's. It's best to write 7266rules which match short amounts of text if it's anticipated 7267that the text will often include NUL's. 7268 7269So that's the first thing to look for. 7270 7271 Vern 7272@end verbatim 7273@end example 7274 7275@c TODO: Evaluate this faq. 7276@node unnamed-faq-75 7277@unnumberedsec unnamed-faq-75 7278@example 7279@verbatim 7280To: jimmey@lexis-nexis.com (Jimmey Todd) 7281Subject: Re: Flex performance question 7282In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7283Date: Thu, 19 Feb 1998 15:42:25 PST 7284From: Vern Paxson <vern> 7285 7286So there are several problems. 7287 7288First, to go fast, you want to match as much text as possible, which 7289your scanners don't in the case that what they're scanning is *not* 7290a <RN> tag. So you want a rule like: 7291 7292 [^<]+ 7293 7294Second, C++ scanners are particularly slow if they're interactive, 7295which they are by default. Using -B speeds it up by a factor of 3-4 7296on my workstation. 7297 7298Third, C++ scanners that use the istream interface are slow, because 7299of how poorly implemented istream's are. I built two versions of 7300the following scanner: 7301 7302 %% 7303 .*\n 7304 .* 7305 %% 7306 7307and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. 7308The C++ istream version, using -B, takes 3.8 seconds. 7309 7310 Vern 7311@end verbatim 7312@end example 7313 7314@c TODO: Evaluate this faq. 7315@node unnamed-faq-76 7316@unnumberedsec unnamed-faq-76 7317@example 7318@verbatim 7319To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> 7320Subject: Re: FLEX 2.5 & THE YEAR 2000 7321In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. 7322Date: Wed, 03 Jun 1998 10:22:26 PDT 7323From: Vern Paxson <vern> 7324 7325> I am researching the Y2K problem with General Electric R&D 7326> and need to know if there are any known issues concerning 7327> the above mentioned software and Y2K regardless of version. 7328 7329There shouldn't be, all it ever does with the date is ask the system 7330for it and then print it out. 7331 7332 Vern 7333@end verbatim 7334@end example 7335 7336@c TODO: Evaluate this faq. 7337@node unnamed-faq-77 7338@unnumberedsec unnamed-faq-77 7339@example 7340@verbatim 7341To: "Hans Dermot Doran" <htd@ibhdoran.com> 7342Subject: Re: flex problem 7343In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. 7344Date: Tue, 21 Jul 1998 14:23:34 PDT 7345From: Vern Paxson <vern> 7346 7347> To overcome this, I gets() the stdin into a string and lex the string. The 7348> string is lexed OK except that the end of string isn't lexed properly 7349> (yy_scan_string()), that is the lexer dosn't recognise the end of string. 7350 7351Flex doesn't contain mechanisms for recognizing buffer endpoints. But if 7352you use fgets instead (which you should anyway, to protect against buffer 7353overflows), then the final \n will be preserved in the string, and you can 7354scan that in order to find the end of the string. 7355 7356 Vern 7357@end verbatim 7358@end example 7359 7360@c TODO: Evaluate this faq. 7361@node unnamed-faq-78 7362@unnumberedsec unnamed-faq-78 7363@example 7364@verbatim 7365To: soumen@almaden.ibm.com 7366Subject: Re: Flex++ 2.5.3 instance member vs. static member 7367In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. 7368Date: Tue, 28 Jul 1998 01:10:34 PDT 7369From: Vern Paxson <vern> 7370 7371> %{ 7372> int mylineno = 0; 7373> %} 7374> ws [ \t]+ 7375> alpha [A-Za-z] 7376> dig [0-9] 7377> %% 7378> 7379> Now you'd expect mylineno to be a member of each instance of class 7380> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to 7381> indicate otherwise; unless I am missing something the declaration of 7382> mylineno seems to be outside any class scope. 7383> 7384> How will this work if I want to run a multi-threaded application with each 7385> thread creating a FlexLexer instance? 7386 7387Derive your own subclass and make mylineno a member variable of it. 7388 7389 Vern 7390@end verbatim 7391@end example 7392 7393@c TODO: Evaluate this faq. 7394@node unnamed-faq-79 7395@unnumberedsec unnamed-faq-79 7396@example 7397@verbatim 7398To: Adoram Rogel <adoram@hybridge.com> 7399Subject: Re: More than 32K states change hangs 7400In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. 7401Date: Tue, 04 Aug 1998 22:28:45 PDT 7402From: Vern Paxson <vern> 7403 7404> Vern Paxson, 7405> 7406> I followed your advice, posted on Usenet bu you, and emailed to me 7407> personally by you, on how to overcome the 32K states limit. I'm running 7408> on Linux machines. 7409> I took the full source of version 2.5.4 and did the following changes in 7410> flexdef.h: 7411> #define JAMSTATE -327660 7412> #define MAXIMUM_MNS 319990 7413> #define BAD_SUBSCRIPT -327670 7414> #define MAX_SHORT 327000 7415> 7416> and compiled. 7417> All looked fine, including check and bigcheck, so I installed. 7418 7419Hmmm, you shouldn't increase MAX_SHORT, though looking through my email 7420archives I see that I did indeed recommend doing so. Try setting it back 7421to 32700; that should suffice that you no longer need -Ca. If it still 7422hangs, then the interesting question is - where? 7423 7424> Compiling the same hanged program with a out-of-the-box (RedHat 4.2 7425> distribution of Linux) 7426> flex 2.5.4 binary works. 7427 7428Since Linux comes with source code, you should diff it against what 7429you have to see what problems they missed. 7430 7431> Should I always compile with the -Ca option now ? even short and simple 7432> filters ? 7433 7434No, definitely not. It's meant to be for those situations where you 7435absolutely must squeeze every last cycle out of your scanner. 7436 7437 Vern 7438@end verbatim 7439@end example 7440 7441@c TODO: Evaluate this faq. 7442@node unnamed-faq-80 7443@unnumberedsec unnamed-faq-80 7444@example 7445@verbatim 7446To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> 7447Subject: Re: flex output for static code portion 7448In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. 7449Date: Mon, 17 Aug 1998 23:57:42 PDT 7450From: Vern Paxson <vern> 7451 7452> I would like to use flex under the hood to generate a binary file 7453> containing the data structures that control the parse. 7454 7455This has been on the wish-list for a long time. In principle it's 7456straight-forward - you redirect mkdata() et al's I/O to another file, 7457and modify the skeleton to have a start-up function that slurps these 7458into dynamic arrays. The concerns are (1) the scanner generation code 7459is hairy and full of corner cases, so it's easy to get surprised when 7460going down this path :-( ; and (2) being careful about buffering so 7461that when the tables change you make sure the scanner starts in the 7462correct state and reading at the right point in the input file. 7463 7464> I was wondering if you know of anyone who has used flex in this way. 7465 7466I don't - but it seems like a reasonable project to undertake (unlike 7467numerous other flex tweaks :-). 7468 7469 Vern 7470@end verbatim 7471@end example 7472 7473@c TODO: Evaluate this faq. 7474@node unnamed-faq-81 7475@unnumberedsec unnamed-faq-81 7476@example 7477@verbatim 7478Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) 7479 by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 7480 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) 7481Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) 7482 by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 7483 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 7484Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 7485From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> 7486Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> 7487Subject: "flex scanner push-back overflow" 7488To: vern@ee.lbl.gov 7489Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) 7490Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7491X-NoJunk: Do NOT send commercial mail, spam or ads to this address! 7492X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ 7493X-Mailer: ELM [version 2.4ME+ PL28 (25)] 7494MIME-Version: 1.0 7495Content-Type: text/plain; charset=US-ASCII 7496Content-Transfer-Encoding: 7bit 7497 7498Hi Vern, 7499 7500Yesterday, I encountered a strange problem: I use the macro processor m4 7501to include some lengthy lists into a .l file. Following is a flex macro 7502definition that causes some serious pain in my neck: 7503 7504AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) 7505 7506The complete list contains about 10kB. When I try to "flex" this file 7507(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased 7508some of the predefined values in flexdefs.h) I get the error: 7509 7510myflex/flex -8 sentag.tmp.l 7511flex scanner push-back overflow 7512 7513When I remove the slashes in the macro definition everything works fine. 7514As I understand it, the double quotes escape the slash-character so it 7515really means "/" and not "trailing context". Furthermore, I tried to 7516escape the slashes with backslashes, but with no use, the same error message 7517appeared when flexing the code. 7518 7519Do you have an idea what's going on here? 7520 7521Greetings from Germany, 7522 Georg 7523-- 7524Georg Rehm georg@cl-ki.uni-osnabrueck.de 7525Institute for Semantic Information Processing, University of Osnabrueck, FRG 7526@end verbatim 7527@end example 7528 7529@c TODO: Evaluate this faq. 7530@node unnamed-faq-82 7531@unnumberedsec unnamed-faq-82 7532@example 7533@verbatim 7534To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7535Subject: Re: "flex scanner push-back overflow" 7536In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. 7537Date: Thu, 20 Aug 1998 07:05:35 PDT 7538From: Vern Paxson <vern> 7539 7540> myflex/flex -8 sentag.tmp.l 7541> flex scanner push-back overflow 7542 7543Flex itself uses a flex scanner. That scanner is running out of buffer 7544space when it tries to unput() the humongous macro you've defined. When 7545you remove the '/'s, you make it small enough so that it fits in the buffer; 7546removing spaces would do the same thing. 7547 7548The fix is to either rethink how come you're using such a big macro and 7549perhaps there's another/better way to do it; or to rebuild flex's own 7550scan.c with a larger value for 7551 7552 #define YY_BUF_SIZE 16384 7553 7554- Vern 7555@end verbatim 7556@end example 7557 7558@c TODO: Evaluate this faq. 7559@node unnamed-faq-83 7560@unnumberedsec unnamed-faq-83 7561@example 7562@verbatim 7563To: Jan Kort <jan@research.techforce.nl> 7564Subject: Re: Flex 7565In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. 7566Date: Sat, 05 Sep 1998 00:59:49 PDT 7567From: Vern Paxson <vern> 7568 7569> %% 7570> 7571> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } 7572> ^\n { fprintf(stderr, "empty line\n"); } 7573> . { } 7574> \n { fprintf(stderr, "new line\n"); } 7575> 7576> %% 7577> -- input --------------------------------------- 7578> TEST1 7579> -- output -------------------------------------- 7580> TEST1 7581> empty line 7582> ------------------------------------------------ 7583 7584IMHO, it's not clear whether or not this is in fact a bug. It depends 7585on whether you view yyless() as backing up in the input stream, or as 7586pushing new characters onto the beginning of the input stream. Flex 7587interprets it as the latter (for implementation convenience, I'll admit), 7588and so considers the newline as in fact matching at the beginning of a 7589line, as after all the last token scanned an entire line and so the 7590scanner is now at the beginning of a new line. 7591 7592I agree that this is counter-intuitive for yyless(), given its 7593functional description (it's less so for unput(), depending on whether 7594you're unput()'ing new text or scanned text). But I don't plan to 7595change it any time soon, as it's a pain to do so. Consequently, 7596you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak 7597your scanner into the behavior you desire. 7598 7599Sorry for the less-than-completely-satisfactory answer. 7600 7601 Vern 7602@end verbatim 7603@end example 7604 7605@c TODO: Evaluate this faq. 7606@node unnamed-faq-84 7607@unnumberedsec unnamed-faq-84 7608@example 7609@verbatim 7610To: Patrick Krusenotto <krusenot@mac-info-link.de> 7611Subject: Re: Problems with restarting flex-2.5.2-generated scanner 7612In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. 7613Date: Thu, 24 Sep 1998 23:28:43 PDT 7614From: Vern Paxson <vern> 7615 7616> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately 7617> trying to make my scanner restart with a new file after my parser stops 7618> with a parse error. When my compiler restarts, the parser always 7619> receives the token after the token (in the old file!) that caused the 7620> parser error. 7621 7622I suspect the problem is that your parser has read ahead in order 7623to attempt to resolve an ambiguity, and when it's restarted it picks 7624up with that token rather than reading a fresh one. If you're using 7625yacc, then the special "error" production can sometimes be used to 7626consume tokens in an attempt to get the parser into a consistent state. 7627 7628 Vern 7629@end verbatim 7630@end example 7631 7632@c TODO: Evaluate this faq. 7633@node unnamed-faq-85 7634@unnumberedsec unnamed-faq-85 7635@example 7636@verbatim 7637To: Henric Jungheim <junghelh@pe-nelson.com> 7638Subject: Re: flex 2.5.4a 7639In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. 7640Date: Tue, 27 Oct 1998 16:50:14 PST 7641From: Vern Paxson <vern> 7642 7643> This brings up a feature request: How about a command line 7644> option to specify the filename when reading from stdin? That way one 7645> doesn't need to create a temporary file in order to get the "#line" 7646> directives to make sense. 7647 7648Use -o combined with -t (per the man page description of -o). 7649 7650> P.S., Is there any simple way to use non-blocking IO to parse multiple 7651> streams? 7652 7653Simple, no. 7654 7655One approach might be to return a magic character on EWOULDBLOCK and 7656have a rule 7657 7658 .*<magic-character> // put back .*, eat magic character 7659 7660This is off the top of my head, not sure it'll work. 7661 7662 Vern 7663@end verbatim 7664@end example 7665 7666@c TODO: Evaluate this faq. 7667@node unnamed-faq-86 7668@unnumberedsec unnamed-faq-86 7669@example 7670@verbatim 7671To: "Repko, Billy D" <billy.d.repko@intel.com> 7672Subject: Re: Compiling scanners 7673In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. 7674Date: Thu, 14 Jan 1999 00:25:30 PST 7675From: Vern Paxson <vern> 7676 7677> It appears that maybe it cannot find the lfl library. 7678 7679The Makefile in the distribution builds it, so you should have it. 7680It's exceedingly trivial, just a main() that calls yylex() and 7681a yyrap() that always returns 1. 7682 7683> %% 7684> \n ++num_lines; ++num_chars; 7685> . ++num_chars; 7686 7687You can't indent your rules like this - that's where the errors are coming 7688from. Flex copies indented text to the output file, it's how you do things 7689like 7690 7691 int num_lines_seen = 0; 7692 7693to declare local variables. 7694 7695 Vern 7696@end verbatim 7697@end example 7698 7699@c TODO: Evaluate this faq. 7700@node unnamed-faq-87 7701@unnumberedsec unnamed-faq-87 7702@example 7703@verbatim 7704To: Erick Branderhorst <Erick.Branderhorst@asml.nl> 7705Subject: Re: flex input buffer 7706In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. 7707Date: Tue, 09 Feb 1999 21:03:37 PST 7708From: Vern Paxson <vern> 7709 7710> In the flex.skl file the size of the default input buffers is set. Can you 7711> explain why this size is set and why it is such a high number. 7712 7713It's large to optimize performance when scanning large files. You can 7714safely make it a lot lower if needed. 7715 7716 Vern 7717@end verbatim 7718@end example 7719 7720@c TODO: Evaluate this faq. 7721@node unnamed-faq-88 7722@unnumberedsec unnamed-faq-88 7723@example 7724@verbatim 7725To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> 7726Subject: Re: Flex error message 7727In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. 7728Date: Thu, 25 Feb 1999 00:11:31 PST 7729From: Vern Paxson <vern> 7730 7731> I'm extending a larger scanner written in Flex and I keep running into 7732> problems. More specifically, I get the error message: 7733> "flex: input rules are too complicated (>= 32000 NFA states)" 7734 7735Increase the definitions in flexdef.h for: 7736 7737#define JAMSTATE -32766 /* marks a reference to the state that always j 7738ams */ 7739#define MAXIMUM_MNS 31999 7740#define BAD_SUBSCRIPT -32767 7741 7742recompile everything, and it should all work. 7743 7744 Vern 7745@end verbatim 7746@end example 7747 7748@c TODO: Evaluate this faq. 7749@node unnamed-faq-90 7750@unnumberedsec unnamed-faq-90 7751@example 7752@verbatim 7753To: "Dmitriy Goldobin" <gold@ems.chel.su> 7754Subject: Re: FLEX trouble 7755In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. 7756Date: Tue, 01 Jun 1999 00:15:07 PDT 7757From: Vern Paxson <vern> 7758 7759> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 7760> but rule "/*"(.|\n)*"*/" don't work ? 7761 7762The second of these will have to scan the entire input stream (because 7763"(.|\n)*" matches an arbitrary amount of any text) in order to see if 7764it ends with "*/", terminating the comment. That potentially will overflow 7765the input buffer. 7766 7767> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error 7768> 'unrecognized rule'. 7769 7770You can't use the '/' operator inside parentheses. It's not clear 7771what "(a/b)*" actually means. 7772 7773> I now use workaround with state <comment>, but single-rule is 7774> better, i think. 7775 7776Single-rule is nice but will always have the problem of either setting 7777restrictions on comments (like not allowing multi-line comments) and/or 7778running the risk of consuming the entire input stream, as noted above. 7779 7780 Vern 7781@end verbatim 7782@end example 7783 7784@c TODO: Evaluate this faq. 7785@node unnamed-faq-91 7786@unnumberedsec unnamed-faq-91 7787@example 7788@verbatim 7789Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) 7790 by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 7791 for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) 7792Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 7793To: vern@ee.lbl.gov 7794Date: Tue, 15 Jun 1999 08:55:43 -0700 7795From: "Aki Niimura" <neko@my-deja.com> 7796Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> 7797Mime-Version: 1.0 7798Cc: 7799X-Sent-Mail: on 7800Reply-To: 7801X-Mailer: MailCity Service 7802Subject: A question on flex C++ scanner 7803X-Sender-Ip: 12.72.207.61 7804Organization: My Deja Email (http://www.my-deja.com:80) 7805Content-Type: text/plain; charset=us-ascii 7806Content-Transfer-Encoding: 7bit 7807 7808Dear Dr. Paxon, 7809 7810I have been using flex for years. 7811It works very well on many projects. 7812Most case, I used it to generate a scanner on C language. 7813However, one project I needed to generate a scanner 7814on C++ lanuage. Thanks to your enhancement, flex did 7815the job. 7816 7817Currently, I'm working on enhancing my previous project. 7818I need to deal with multiple input streams (recursive 7819inclusion) in this scanner (C++). 7820I did similar thing for another scanner (C) as you 7821explained in your documentation. 7822 7823The generated scanner (C++) has necessary methods: 7824- switch_to_buffer(struct yy_buffer_state *b) 7825- yy_create_buffer(istream *is, int sz) 7826- yy_delete_buffer(struct yy_buffer_state *b) 7827 7828However, I couldn't figure out how to access current 7829buffer (yy_current_buffer). 7830 7831yy_current_buffer is a protected member of yyFlexLexer. 7832I can't access it directly. 7833Then, I thought yy_create_buffer() with is = 0 might 7834return current stream buffer. But it seems not as far 7835as I checked the source. (flex 2.5.4) 7836 7837I went through the Web in addition to Flex documentation. 7838However, it hasn't been successful, so far. 7839 7840It is not my intention to bother you, but, can you 7841comment about how to obtain the current stream buffer? 7842 7843Your response would be highly appreciated. 7844 7845Best regards, 7846Aki Niimura 7847 7848--== Sent via Deja.com http://www.deja.com/ ==-- 7849Share what you know. Learn what you don't. 7850@end verbatim 7851@end example 7852 7853@c TODO: Evaluate this faq. 7854@node unnamed-faq-92 7855@unnumberedsec unnamed-faq-92 7856@example 7857@verbatim 7858To: neko@my-deja.com 7859Subject: Re: A question on flex C++ scanner 7860In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. 7861Date: Tue, 15 Jun 1999 09:04:24 PDT 7862From: Vern Paxson <vern> 7863 7864> However, I couldn't figure out how to access current 7865> buffer (yy_current_buffer). 7866 7867Derive your own subclass from yyFlexLexer. 7868 7869 Vern 7870@end verbatim 7871@end example 7872 7873@c TODO: Evaluate this faq. 7874@node unnamed-faq-93 7875@unnumberedsec unnamed-faq-93 7876@example 7877@verbatim 7878To: "Stones, Darren" <Darren.Stones@nectech.co.uk> 7879Subject: Re: You're the man to see? 7880In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. 7881Date: Wed, 23 Jun 1999 09:01:40 PDT 7882From: Vern Paxson <vern> 7883 7884> I hope you can help me. I am using Flex and Bison to produce an interpreted 7885> language. However all goes well until I try to implement an IF statement or 7886> a WHILE. I cannot get this to work as the parser parses all the conditions 7887> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot 7888> make a decision!! 7889 7890You need to use the parser to build a parse tree (= abstract syntax trwee), 7891and when that's all done you recursively evaluate the tree, binding variables 7892to values at that time. 7893 7894 Vern 7895@end verbatim 7896@end example 7897 7898@c TODO: Evaluate this faq. 7899@node unnamed-faq-94 7900@unnumberedsec unnamed-faq-94 7901@example 7902@verbatim 7903To: Petr Danecek <petr@ics.cas.cz> 7904Subject: Re: flex - question 7905In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. 7906Date: Fri, 02 Jul 1999 16:52:13 PDT 7907From: Vern Paxson <vern> 7908 7909> file, it takes an enormous amount of time. It is funny, because the 7910> source code has only 12 rules!!! I think it looks like an exponencial 7911> growth. 7912 7913Right, that's the problem - some patterns (those with a lot of 7914ambiguity, where yours has because at any given time the scanner can 7915be in the middle of all sorts of combinations of the different 7916rules) blow up exponentially. 7917 7918For your rules, there is an easy fix. Change the ".*" that comes fater 7919the directory name to "[^ ]*". With that in place, the rules are no 7920longer nearly so ambiguous, because then once one of the directories 7921has been matched, no other can be matched (since they all require a 7922leading blank). 7923 7924If that's not an acceptable solution, then you can enter a start state 7925to pick up the .*\n after each directory is matched. 7926 7927Also note that for speed, you'll want to add a ".*" rule at the end, 7928otherwise rules that don't match any of the patterns will be matched 7929very slowly, a character at a time. 7930 7931 Vern 7932@end verbatim 7933@end example 7934 7935@c TODO: Evaluate this faq. 7936@node unnamed-faq-95 7937@unnumberedsec unnamed-faq-95 7938@example 7939@verbatim 7940To: Tielman Koekemoer <tielman@spi.co.za> 7941Subject: Re: Please help. 7942In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. 7943Date: Thu, 08 Jul 1999 08:20:39 PDT 7944From: Vern Paxson <vern> 7945 7946> I was hoping you could help me with my problem. 7947> 7948> I tried compiling (gnu)flex on a Solaris 2.4 machine 7949> but when I ran make (after configure) I got an error. 7950> 7951> -------------------------------------------------------------- 7952> gcc -c -I. -I. -g -O parse.c 7953> ./flex -t -p ./scan.l >scan.c 7954> sh: ./flex: not found 7955> *** Error code 1 7956> make: Fatal error: Command failed for target `scan.c' 7957> ------------------------------------------------------------- 7958> 7959> What's strange to me is that I'm only 7960> trying to install flex now. I then edited the Makefile to 7961> and changed where it says "FLEX = flex" to "FLEX = lex" 7962> ( lex: the native Solaris one ) but then it complains about 7963> the "-p" option. Is there any way I can compile flex without 7964> using flex or lex? 7965> 7966> Thanks so much for your time. 7967 7968You managed to step on the bootstrap sequence, which first copies 7969initscan.c to scan.c in order to build flex. Try fetching a fresh 7970distribution from ftp.ee.lbl.gov. (Or you can first try removing 7971".bootstrap" and doing a make again.) 7972 7973 Vern 7974@end verbatim 7975@end example 7976 7977@c TODO: Evaluate this faq. 7978@node unnamed-faq-96 7979@unnumberedsec unnamed-faq-96 7980@example 7981@verbatim 7982To: Tielman Koekemoer <tielman@spi.co.za> 7983Subject: Re: Please help. 7984In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. 7985Date: Fri, 09 Jul 1999 00:27:20 PDT 7986From: Vern Paxson <vern> 7987 7988> First I removed .bootstrap (and ran make) - no luck. I downloaded the 7989> software but I still have the same problem. Is there anything else I 7990> could try. 7991 7992Try: 7993 7994 cp initscan.c scan.c 7995 touch scan.c 7996 make scan.o 7997 7998If this last tries to first build scan.c from scan.l using ./flex, then 7999your "make" is broken, in which case compile scan.c to scan.o by hand. 8000 8001 Vern 8002@end verbatim 8003@end example 8004 8005@c TODO: Evaluate this faq. 8006@node unnamed-faq-97 8007@unnumberedsec unnamed-faq-97 8008@example 8009@verbatim 8010To: Sumanth Kamenani <skamenan@crl.nmsu.edu> 8011Subject: Re: Error 8012In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. 8013Date: Tue, 20 Jul 1999 00:18:26 PDT 8014From: Vern Paxson <vern> 8015 8016> I am getting a compilation error. The error is given as "unknown symbol- yylex". 8017 8018The parser relies on calling yylex(), but you're instead using the C++ scanning 8019class, so you need to supply a yylex() "glue" function that calls an instance 8020scanner of the scanner (e.g., "scanner->yylex()"). 8021 8022 Vern 8023@end verbatim 8024@end example 8025 8026@c TODO: Evaluate this faq. 8027@node unnamed-faq-98 8028@unnumberedsec unnamed-faq-98 8029@example 8030@verbatim 8031To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) 8032Subject: Re: lex 8033In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. 8034Date: Tue, 23 Nov 1999 15:54:30 PST 8035From: Vern Paxson <vern> 8036 8037Well, your problem is the 8038 8039switch (yybgin-yysvec-1) { /* witchcraft */ 8040 8041at the beginning of lex rules. "witchcraft" == "non-portable". It's 8042assuming knowledge of the AT&T lex's internal variables. 8043 8044For flex, you can probably do the equivalent using a switch on YYSTATE. 8045 8046 Vern 8047@end verbatim 8048@end example 8049 8050@c TODO: Evaluate this faq. 8051@node unnamed-faq-99 8052@unnumberedsec unnamed-faq-99 8053@example 8054@verbatim 8055To: archow@hss.hns.com 8056Subject: Re: Regarding distribution of flex and yacc based grammars 8057In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. 8058Date: Wed, 22 Dec 1999 01:56:24 PST 8059From: Vern Paxson <vern> 8060 8061> When we provide the customer with an object code distribution, is it 8062> necessary for us to provide source 8063> for the generated C files from flex and bison since they are generated by 8064> flex and bison ? 8065 8066For flex, no. I don't know what the current state of this is for bison. 8067 8068> Also, is there any requrirement for us to neccessarily provide source for 8069> the grammar files which are fed into flex and bison ? 8070 8071Again, for flex, no. 8072 8073See the file "COPYING" in the flex distribution for the legalese. 8074 8075 Vern 8076@end verbatim 8077@end example 8078 8079@c TODO: Evaluate this faq. 8080@node unnamed-faq-100 8081@unnumberedsec unnamed-faq-100 8082@example 8083@verbatim 8084To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> 8085Subject: Re: Flex, and self referencing rules 8086In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. 8087Date: Sat, 19 Feb 2000 18:33:16 PST 8088From: Vern Paxson <vern> 8089 8090> However, I do not use unput anywhere. I do use self-referencing 8091> rules like this: 8092> 8093> UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) 8094 8095You can't do this - flex is *not* a parser like yacc (which does indeed 8096allow recursion), it is a scanner that's confined to regular expressions. 8097 8098 Vern 8099@end verbatim 8100@end example 8101 8102@c TODO: Evaluate this faq. 8103@node unnamed-faq-101 8104@unnumberedsec unnamed-faq-101 8105@example 8106@verbatim 8107To: slg3@lehigh.edu (SAMUEL L. GULDEN) 8108Subject: Re: Flex problem 8109In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. 8110Date: Thu, 02 Mar 2000 23:00:46 PST 8111From: Vern Paxson <vern> 8112 8113If this is exactly your program: 8114 8115> digit [0-9] 8116> digits {digit}+ 8117> whitespace [ \t\n]+ 8118> 8119> %% 8120> "[" { printf("open_brac\n");} 8121> "]" { printf("close_brac\n");} 8122> "+" { printf("addop\n");} 8123> "*" { printf("multop\n");} 8124> {digits} { printf("NUMBER = %s\n", yytext);} 8125> whitespace ; 8126 8127then the problem is that the last rule needs to be "{whitespace}" ! 8128 8129 Vern 8130@end verbatim 8131@end example 8132 8133@node What is the difference between YYLEX_PARAM and YY_DECL? 8134@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL? 8135 8136YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra 8137params when it calls yylex() from the parser. 8138 8139YY_DECL is the Flex declaration of yylex. The default is similar to this: 8140 8141@example 8142@verbatim 8143#define int yy_lex () 8144@end verbatim 8145@end example 8146 8147 8148@node Why do I get "conflicting types for yylex" error? 8149@unnumberedsec Why do I get "conflicting types for yylex" error? 8150 8151This is a compiler error regarding a generated Bison parser, not a Flex scanner. 8152It means you need a prototype of yylex() in the top of the Bison file. 8153Be sure the prototype matches YY_DECL. 8154 8155@node How do I access the values set in a Flex action from within a Bison action? 8156@unnumberedsec How do I access the values set in a Flex action from within a Bison action? 8157 8158With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual. 8159See @ref{Top, , , bison, the GNU Bison Manual}. 8160 8161@node Appendices, Indices, FAQ, Top 8162@appendix Appendices 8163 8164@menu 8165* Makefiles and Flex:: 8166* Bison Bridge:: 8167* M4 Dependency:: 8168* Common Patterns:: 8169@end menu 8170 8171@node Makefiles and Flex, Bison Bridge, Appendices, Appendices 8172@appendixsec Makefiles and Flex 8173 8174@cindex Makefile, syntax 8175 8176In this appendix, we provide tips for writing Makefiles to build your scanners. 8177 8178In a traditional build environment, we say that the @file{.c} files are the 8179sources, and the @file{.o} files are the intermediate files. When using 8180@code{flex}, however, the @file{.l} files are the sources, and the generated 8181@file{.c} files (along with the @file{.o} files) are the intermediate files. 8182This requires you to carefully plan your Makefile. 8183 8184Modern @command{make} programs understand that @file{foo.l} is intended to 8185generate @file{lex.yy.c} or @file{foo.c}, and will behave 8186accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such 8187programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake} 8188may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want, 8189then you should provide an explicit rule in your Makefile.am}. The 8190following Makefile does not explicitly instruct @command{make} how to build 8191@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the 8192@command{make} program to build the intermediate file, @file{scan.c}: 8193 8194@cindex Makefile, example of implicit rules 8195@example 8196@verbatim 8197 # Basic Makefile -- relies on implicit rules 8198 # Creates "myprogram" from "scan.l" and "myprogram.c" 8199 # 8200 LEX=flex 8201 myprogram: scan.o myprogram.o 8202 scan.o: scan.l 8203 8204@end verbatim 8205@end example 8206 8207 8208For simple cases, the above may be sufficient. For other cases, 8209you may have to explicitly instruct @command{make} how to build your scanner. 8210The following is an example of a Makefile containing explicit rules: 8211 8212@cindex Makefile, explicit example 8213@example 8214@verbatim 8215 # Basic Makefile -- provides explicit rules 8216 # Creates "myprogram" from "scan.l" and "myprogram.c" 8217 # 8218 LEX=flex 8219 myprogram: scan.o myprogram.o 8220 $(CC) -o $@ $(LDFLAGS) $^ 8221 8222 myprogram.o: myprogram.c 8223 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8224 8225 scan.o: scan.c 8226 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8227 8228 scan.c: scan.l 8229 $(LEX) $(LFLAGS) -o $@ $^ 8230 8231 clean: 8232 $(RM) *.o scan.c 8233 8234@end verbatim 8235@end example 8236 8237Notice in the above example that @file{scan.c} is in the @code{clean} target. 8238This is because we consider the file @file{scan.c} to be an intermediate file. 8239 8240Finally, we provide a realistic example of a @code{flex} scanner used with a 8241@code{bison} parser@footnote{This example also applies to yacc parsers.}. 8242There is a tricky problem we have to deal with. Since a @code{flex} scanner 8243will typically include a header file (e.g., @file{y.tab.h}) generated by the 8244parser, we need to be sure that the header file is generated BEFORE the scanner 8245is compiled. We handle this case in the following example: 8246 8247@example 8248@verbatim 8249 # Makefile example -- scanner and parser. 8250 # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" 8251 # 8252 LEX = flex 8253 YACC = bison -y 8254 YFLAGS = -d 8255 objects = scan.o parse.o myprogram.o 8256 8257 myprogram: $(objects) 8258 scan.o: scan.l parse.c 8259 parse.o: parse.y 8260 myprogram.o: myprogram.c 8261 8262@end verbatim 8263@end example 8264 8265In the above example, notice the line, 8266 8267@example 8268@verbatim 8269 scan.o: scan.l parse.c 8270@end verbatim 8271@end example 8272 8273, which lists the file @file{parse.c} (the generated parser) as a dependency of 8274@file{scan.o}. We want to ensure that the parser is created before the scanner 8275is compiled, and the above line seems to do the trick. Feel free to experiment 8276with your specific implementation of @command{make}. 8277 8278 8279For more details on writing Makefiles, see @ref{Top, , , make, The 8280GNU Make Manual}. 8281 8282@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices 8283@section C Scanners with Bison Parsers 8284 8285@cindex bison, bridging with flex 8286@vindex yylval 8287@vindex yylloc 8288@tindex YYLTYPE 8289@tindex YYSTYPE 8290 8291This section describes the @code{flex} features useful when integrating 8292@code{flex} with @code{GNU bison}@footnote{The features described here are 8293purely optional, and are by no means the only way to use flex with bison. 8294We merely provide some glue to ease development of your parser-scanner pair.}. 8295Skip this section if you are not using 8296@code{bison} with your scanner. Here we discuss only the @code{flex} 8297half of the @code{flex} and @code{bison} pair. We do not discuss 8298@code{bison} in any detail. For more information about generating 8299@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}. 8300 8301A compatible @code{bison} scanner is generated by declaring @samp{%option 8302bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex} 8303from the command line. This instructs @code{flex} that the macro 8304@code{yylval} may be used. The data type for 8305@code{yylval}, @code{YYSTYPE}, 8306is typically defined in a header file, included in section 1 of the 8307@code{flex} input file. For a list of functions and macros 8308available, @xref{bison-functions}. 8309 8310The declaration of yylex becomes, 8311 8312@findex yylex (reentrant version) 8313@example 8314@verbatim 8315 int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); 8316@end verbatim 8317@end example 8318 8319If @code{%option bison-locations} is specified, then the declaration 8320becomes, 8321 8322@findex yylex (reentrant version) 8323@example 8324@verbatim 8325 int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); 8326@end verbatim 8327@end example 8328 8329Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers. 8330Support for @code{yylloc} is optional in @code{bison}, so it is optional in 8331@code{flex} as well. The following is an example of a @code{flex} scanner that 8332is compatible with @code{bison}. 8333 8334@cindex bison, scanner to be called from bison 8335@example 8336@verbatim 8337 /* Scanner for "C" assignment statements... sort of. */ 8338 %{ 8339 #include "y.tab.h" /* Generated by bison. */ 8340 %} 8341 8342 %option bison-bridge bison-locations 8343 % 8344 8345 [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} 8346 [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} 8347 "="|";" { return yytext[0];} 8348 . {} 8349 % 8350@end verbatim 8351@end example 8352 8353As you can see, there really is no magic here. We just use 8354@code{yylval} as we would any other variable. The data type of 8355@code{yylval} is generated by @code{bison}, and included in the file 8356@file{y.tab.h}. Here is the corresponding @code{bison} parser: 8357 8358@cindex bison, parser 8359@example 8360@verbatim 8361 /* Parser to convert "C" assignments to lisp. */ 8362 %{ 8363 /* Pass the argument to yyparse through to yylex. */ 8364 #define YYPARSE_PARAM scanner 8365 #define YYLEX_PARAM scanner 8366 %} 8367 %locations 8368 %pure_parser 8369 %union { 8370 int num; 8371 char* str; 8372 } 8373 %token <str> STRING 8374 %token <num> NUMBER 8375 %% 8376 assignment: 8377 STRING '=' NUMBER ';' { 8378 printf( "(setf %s %d)", $1, $3 ); 8379 } 8380 ; 8381@end verbatim 8382@end example 8383 8384@node M4 Dependency, Common Patterns, Bison Bridge, Appendices 8385@section M4 Dependency 8386@cindex m4 8387The macro processor @code{m4}@footnote{The use of m4 is subject to change in 8388future revisions of flex. It is not part of the public API of flex. Do not depend on it.} 8389must be installed wherever flex is installed. 8390@code{flex} invokes @samp{m4}, found by searching the directories in the 8391@code{PATH} environment variable. Any code you place in section 1 or in the 8392actions will be sent through m4. Please follow these rules to protect your 8393code from unwanted @code{m4} processing. 8394 8395@itemize 8396 8397@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, 8398or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for 8399some reason you need m4_ as a prefix, use a preprocessor #define to get your 8400symbol past m4 unmangled. 8401 8402@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The 8403former is not valid in C, except within comments and strings, but the latter is valid in 8404code such as @code{x[y[z]]}. The solution is simple. To get the literal string 8405@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]}, 8406use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and 8407escape them. However, it's best to avoid this complexity where possible, by 8408removing such sequences from your code. 8409 8410@end itemize 8411 8412@code{m4} is only required at the time you run @code{flex}. The generated 8413scanner is ordinary C or C++, and does @emph{not} require @code{m4}. 8414 8415@node Common Patterns, ,M4 Dependency, Appendices 8416@section Common Patterns 8417@cindex patterns, common 8418 8419This appendix provides examples of common regular expressions you might use 8420in your scanner. 8421 8422@menu 8423* Numbers:: 8424* Identifiers:: 8425* Quoted Constructs:: 8426* Addresses:: 8427@end menu 8428 8429 8430@node Numbers, Identifiers, ,Common Patterns 8431@subsection Numbers 8432 8433@table @asis 8434 8435@item C99 decimal constant 8436@code{([[:digit:]]@{-@}[0])[[:digit:]]*} 8437 8438@item C99 hexadecimal constant 8439@code{0[xX][[:xdigit:]]+} 8440 8441@item C99 octal constant 8442@code{0[01234567]*} 8443 8444@item C99 floating point constant 8445@verbatim 8446 {dseq} ([[:digit:]]+) 8447 {dseq_opt} ([[:digit:]]*) 8448 {frac} (({dseq_opt}"."{dseq})|{dseq}".") 8449 {exp} ([eE][+-]?{dseq}) 8450 {exp_opt} ({exp}?) 8451 {fsuff} [flFL] 8452 {fsuff_opt} ({fsuff}?) 8453 {hpref} (0[xX]) 8454 {hdseq} ([[:xdigit:]]+) 8455 {hdseq_opt} ([[:xdigit:]]*) 8456 {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) 8457 {bexp} ([pP][+-]?{dseq}) 8458 {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) 8459 {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) 8460 8461 {c99_floating_point_constant} ({dfc}|{hfc}) 8462@end verbatim 8463 8464See C99 section 6.4.4.2 for the gory details. 8465 8466@end table 8467 8468@node Identifiers, Quoted Constructs, Numbers, Common Patterns 8469@subsection Identifiers 8470 8471@table @asis 8472 8473@item C99 Identifier 8474@verbatim 8475ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) 8476nondigit [_[:alpha:]] 8477c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* 8478@end verbatim 8479 8480Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for 8481"implementation-defined" characters. In practice, C compilers follow the above pattern, with the 8482addition of the @samp{$} character. 8483 8484@item UTF-8 Encoded Unicode Code Point 8485@verbatim 8486[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) 8487@end verbatim 8488 8489@end table 8490 8491@node Quoted Constructs, Addresses, Identifiers, Common Patterns 8492@subsection Quoted Constructs 8493 8494@table @asis 8495@item C99 String Literal 8496@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"} 8497 8498@item C99 Comment 8499@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)} 8500 8501Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief, 8502does not include the trailing @samp{\n} character. 8503 8504A better way to scan @samp{/* */} comments is by line, rather than matching 8505possibly huge comments all at once. This will allow you to scan comments of 8506unlimited length, as long as line breaks appear at sane intervals. This is also 8507more efficient when used with automatic line number processing. @xref{option-yylineno}. 8508 8509@verbatim 8510<INITIAL>{ 8511 "/*" BEGIN(COMMENT); 8512} 8513<COMMENT>{ 8514 "*/" BEGIN(0); 8515 [^*\n]+ ; 8516 "*"[^/] ; 8517 \n ; 8518} 8519@end verbatim 8520 8521@end table 8522 8523@node Addresses, ,Quoted Constructs, Common Patterns 8524@subsection Addresses 8525 8526@table @asis 8527 8528@item IPv4 Address 8529@verbatim 8530dec-octet [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] 8531IPv4address {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet} 8532@end verbatim 8533 8534@item IPv6 Address 8535@verbatim 8536h16 [0-9A-Fa-f]{1,4} 8537ls32 {h16}:{h16}|{IPv4address} 8538IPv6address ({h16}:){6}{ls32}| 8539 ::({h16}:){5}{ls32}| 8540 ({h16})?::({h16}:){4}{ls32}| 8541 (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}| 8542 (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}| 8543 (({h16}:){0,3}{h16})?::{h16}:{ls32}| 8544 (({h16}:){0,4}{h16})?::{ls32}| 8545 (({h16}:){0,5}{h16})?::{h16}| 8546 (({h16}:){0,6}{h16})?:: 8547@end verbatim 8548 8549See @uref{http://www.ietf.org/rfc/rfc2373.txt, RFC 2373} for details. 8550Note that you have to fold the definition of @code{IPv6address} into one 8551line and that it also matches the ``unspecified address'' ``::''. 8552 8553@item URI 8554@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} 8555 8556This pattern is nearly useless, since it allows just about any character 8557to appear in a URI, including spaces and control characters. See 8558@uref{http://www.ietf.org/rfc/rfc2396.txt, RFC 2396} for details. 8559 8560@end table 8561 8562 8563@node Indices, , Appendices, Top 8564@unnumbered Indices 8565 8566@menu 8567* Concept Index:: 8568* Index of Functions and Macros:: 8569* Index of Variables:: 8570* Index of Data Types:: 8571* Index of Hooks:: 8572* Index of Scanner Options:: 8573@end menu 8574 8575@node Concept Index, Index of Functions and Macros, Indices, Indices 8576@unnumberedsec Concept Index 8577 8578@printindex cp 8579 8580@node Index of Functions and Macros, Index of Variables, Concept Index, Indices 8581@unnumberedsec Index of Functions and Macros 8582 8583This is an index of functions and preprocessor macros that look like functions. 8584For macros that expand to variables or constants, see @ref{Index of Variables}. 8585 8586@printindex fn 8587 8588@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices 8589@unnumberedsec Index of Variables 8590 8591This is an index of variables, constants, and preprocessor macros 8592that expand to variables or constants. 8593 8594@printindex vr 8595 8596@node Index of Data Types, Index of Hooks, Index of Variables, Indices 8597@unnumberedsec Index of Data Types 8598@printindex tp 8599 8600@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices 8601@unnumberedsec Index of Hooks 8602 8603This is an index of "hooks" that the user may define. These hooks typically correspond 8604to specific locations in the generated scanner, and may be used to insert arbitrary code. 8605 8606@printindex hk 8607 8608@node Index of Scanner Options, , Index of Hooks, Indices 8609@unnumberedsec Index of Scanner Options 8610 8611@printindex op 8612 8613@c A vim script to name the faq entries. delete this when faqs are no longer 8614@c named "unnamed-faq-XXX". 8615@c 8616@c fu! Faq2 () range abort 8617@c let @r=input("Rename to: ") 8618@c exe "%s/" . @w . "/" . @r . "/g" 8619@c normal 'f 8620@c endf 8621@c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> 8622 8623@bye 8624