flex.texi revision 1.1.1.3
1\input texinfo.tex @c -*-texinfo-*- 2@c %**start of header 3@setfilename flex.info 4@include version.texi 5@settitle Lexical Analysis With Flex, for Flex @value{VERSION} 6@set authors Vern Paxson, Will Estes and John Millaway 7@c "Macro Hooks" index 8@defindex hk 9@c "Options" index 10@defindex op 11@dircategory Programming 12@direntry 13* flex: (flex). Fast lexical analyzer generator (lex replacement). 14@end direntry 15@c %**end of header 16 17@copying 18 19The flex manual is placed under the same licensing conditions as the 20rest of flex: 21 22Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012 23The Flex Project. 24 25Copyright @copyright{} 1990, 1997 The Regents of the University of California. 26All rights reserved. 27 28This code is derived from software contributed to Berkeley by 29Vern Paxson. 30 31The United States Government has rights in this work pursuant 32to contract no. DE-AC03-76SF00098 between the United States 33Department of Energy and the University of California. 34 35Redistribution and use in source and binary forms, with or without 36modification, are permitted provided that the following conditions 37are met: 38 39@enumerate 40@item 41 Redistributions of source code must retain the above copyright 42notice, this list of conditions and the following disclaimer. 43 44@item 45Redistributions in binary form must reproduce the above copyright 46notice, this list of conditions and the following disclaimer in the 47documentation and/or other materials provided with the distribution. 48@end enumerate 49 50Neither the name of the University nor the names of its contributors 51may be used to endorse or promote products derived from this software 52without specific prior written permission. 53 54THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 55IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 56WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 57PURPOSE. 58@end copying 59 60@titlepage 61@title Lexical Analysis with Flex 62@subtitle Edition @value{EDITION}, @value{UPDATED} 63@author @value{authors} 64@page 65@vskip 0pt plus 1filll 66@insertcopying 67@end titlepage 68@contents 69@ifnottex 70@node Top, Copyright, (dir), (dir) 71@top flex 72 73This manual describes @code{flex}, a tool for generating programs that 74perform pattern-matching on text. The manual includes both tutorial and 75reference sections. 76 77This edition of @cite{The flex Manual} documents @code{flex} version 78@value{VERSION}. It was last updated on @value{UPDATED}. 79 80This manual was written by @value{authors}. 81 82@menu 83* Copyright:: 84* Reporting Bugs:: 85* Introduction:: 86* Simple Examples:: 87* Format:: 88* Patterns:: 89* Matching:: 90* Actions:: 91* Generated Scanner:: 92* Start Conditions:: 93* Multiple Input Buffers:: 94* EOF:: 95* Misc Macros:: 96* User Values:: 97* Yacc:: 98* Scanner Options:: 99* Performance:: 100* Cxx:: 101* Reentrant:: 102* Lex and Posix:: 103* Memory Management:: 104* Serialized Tables:: 105* Diagnostics:: 106* Limitations:: 107* Bibliography:: 108* FAQ:: 109* Appendices:: 110* Indices:: 111 112@detailmenu 113 --- The Detailed Node Listing --- 114 115Format of the Input File 116 117* Definitions Section:: 118* Rules Section:: 119* User Code Section:: 120* Comments in the Input:: 121 122Scanner Options 123 124* Options for Specifying Filenames:: 125* Options Affecting Scanner Behavior:: 126* Code-Level And API Options:: 127* Options for Scanner Speed and Size:: 128* Debugging Options:: 129* Miscellaneous Options:: 130 131Reentrant C Scanners 132 133* Reentrant Uses:: 134* Reentrant Overview:: 135* Reentrant Example:: 136* Reentrant Detail:: 137* Reentrant Functions:: 138 139The Reentrant API in Detail 140 141* Specify Reentrant:: 142* Extra Reentrant Argument:: 143* Global Replacement:: 144* Init and Destroy Functions:: 145* Accessor Methods:: 146* Extra Data:: 147* About yyscan_t:: 148 149Memory Management 150 151* The Default Memory Management:: 152* Overriding The Default Memory Management:: 153* A Note About yytext And Memory:: 154 155Serialized Tables 156 157* Creating Serialized Tables:: 158* Loading and Unloading Serialized Tables:: 159* Tables File Format:: 160 161FAQ 162 163* When was flex born?:: 164* How do I expand backslash-escape sequences in C-style quoted strings?:: 165* Why do flex scanners call fileno if it is not ANSI compatible?:: 166* Does flex support recursive pattern definitions?:: 167* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 168* Flex is not matching my patterns in the same order that I defined them.:: 169* My actions are executing out of order or sometimes not at all.:: 170* How can I have multiple input sources feed into the same scanner at the same time?:: 171* Can I build nested parsers that work with the same input file?:: 172* How can I match text only at the end of a file?:: 173* How can I make REJECT cascade across start condition boundaries?:: 174* Why cant I use fast or full tables with interactive mode?:: 175* How much faster is -F or -f than -C?:: 176* If I have a simple grammar cant I just parse it with flex?:: 177* Why doesn't yyrestart() set the start state back to INITIAL?:: 178* How can I match C-style comments?:: 179* The period isn't working the way I expected.:: 180* Can I get the flex manual in another format?:: 181* Does there exist a "faster" NDFA->DFA algorithm?:: 182* How does flex compile the DFA so quickly?:: 183* How can I use more than 8192 rules?:: 184* How do I abandon a file in the middle of a scan and switch to a new file?:: 185* How do I execute code only during initialization (only before the first scan)?:: 186* How do I execute code at termination?:: 187* Where else can I find help?:: 188* Can I include comments in the "rules" section of the file?:: 189* I get an error about undefined yywrap().:: 190* How can I change the matching pattern at run time?:: 191* How can I expand macros in the input?:: 192* How can I build a two-pass scanner?:: 193* How do I match any string not matched in the preceding rules?:: 194* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 195* Is there a way to make flex treat NULL like a regular character?:: 196* Whenever flex can not match the input it says "flex scanner jammed".:: 197* Why doesn't flex have non-greedy operators like perl does?:: 198* Memory leak - 16386 bytes allocated by malloc.:: 199* How do I track the byte offset for lseek()?:: 200* How do I use my own I/O classes in a C++ scanner?:: 201* How do I skip as many chars as possible?:: 202* deleteme00:: 203* Are certain equivalent patterns faster than others?:: 204* Is backing up a big deal?:: 205* Can I fake multi-byte character support?:: 206* deleteme01:: 207* Can you discuss some flex internals?:: 208* unput() messes up yy_at_bol:: 209* The | operator is not doing what I want:: 210* Why can't flex understand this variable trailing context pattern?:: 211* The ^ operator isn't working:: 212* Trailing context is getting confused with trailing optional patterns:: 213* Is flex GNU or not?:: 214* ERASEME53:: 215* I need to scan if-then-else blocks and while loops:: 216* ERASEME55:: 217* ERASEME56:: 218* ERASEME57:: 219* Is there a repository for flex scanners?:: 220* How can I conditionally compile or preprocess my flex input file?:: 221* Where can I find grammars for lex and yacc?:: 222* I get an end-of-buffer message for each character scanned.:: 223* unnamed-faq-62:: 224* unnamed-faq-63:: 225* unnamed-faq-64:: 226* unnamed-faq-65:: 227* unnamed-faq-66:: 228* unnamed-faq-67:: 229* unnamed-faq-68:: 230* unnamed-faq-69:: 231* unnamed-faq-70:: 232* unnamed-faq-71:: 233* unnamed-faq-72:: 234* unnamed-faq-73:: 235* unnamed-faq-74:: 236* unnamed-faq-75:: 237* unnamed-faq-76:: 238* unnamed-faq-77:: 239* unnamed-faq-78:: 240* unnamed-faq-79:: 241* unnamed-faq-80:: 242* unnamed-faq-81:: 243* unnamed-faq-82:: 244* unnamed-faq-83:: 245* unnamed-faq-84:: 246* unnamed-faq-85:: 247* unnamed-faq-86:: 248* unnamed-faq-87:: 249* unnamed-faq-88:: 250* unnamed-faq-90:: 251* unnamed-faq-91:: 252* unnamed-faq-92:: 253* unnamed-faq-93:: 254* unnamed-faq-94:: 255* unnamed-faq-95:: 256* unnamed-faq-96:: 257* unnamed-faq-97:: 258* unnamed-faq-98:: 259* unnamed-faq-99:: 260* unnamed-faq-100:: 261* unnamed-faq-101:: 262* What is the difference between YYLEX_PARAM and YY_DECL?:: 263* Why do I get "conflicting types for yylex" error?:: 264* How do I access the values set in a Flex action from within a Bison action?:: 265 266Appendices 267 268* Makefiles and Flex:: 269* Bison Bridge:: 270* M4 Dependency:: 271* Common Patterns:: 272 273Indices 274 275* Concept Index:: 276* Index of Functions and Macros:: 277* Index of Variables:: 278* Index of Data Types:: 279* Index of Hooks:: 280* Index of Scanner Options:: 281 282@end detailmenu 283@end menu 284@end ifnottex 285@node Copyright, Reporting Bugs, Top, Top 286@chapter Copyright 287 288@cindex copyright of flex 289@cindex distributing flex 290@insertcopying 291 292@node Reporting Bugs, Introduction, Copyright, Top 293@chapter Reporting Bugs 294 295@cindex bugs, reporting 296@cindex reporting bugs 297 298If you find a bug in @code{flex}, please report it using 299the SourceForge Bug Tracking facilities which can be found on 300@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}. 301 302@node Introduction, Simple Examples, Reporting Bugs, Top 303@chapter Introduction 304 305@cindex scanner, definition of 306@code{flex} is a tool for generating @dfn{scanners}. A scanner is a 307program which recognizes lexical patterns in text. The @code{flex} 308program reads the given input files, or its standard input if no file 309names are given, for a description of a scanner to generate. The 310description is in the form of pairs of regular expressions and C code, 311called @dfn{rules}. @code{flex} generates as output a C source file, 312@file{lex.yy.c} by default, which defines a routine @code{yylex()}. 313This file can be compiled and linked with the flex runtime library to 314produce an executable. When the executable is run, it analyzes its 315input for occurrences of the regular expressions. Whenever it finds 316one, it executes the corresponding C code. 317 318@node Simple Examples, Format, Introduction, Top 319@chapter Some Simple Examples 320 321First some simple examples to get the flavor of how one uses 322@code{flex}. 323 324@cindex username expansion 325The following @code{flex} input specifies a scanner which, when it 326encounters the string @samp{username} will replace it with the user's 327login name: 328 329@example 330@verbatim 331 %% 332 username printf( "%s", getlogin() ); 333@end verbatim 334@end example 335 336@cindex default rule 337@cindex rules, default 338By default, any text not matched by a @code{flex} scanner is copied to 339the output, so the net effect of this scanner is to copy its input file 340to its output with each occurrence of @samp{username} expanded. In this 341input, there is just one rule. @samp{username} is the @dfn{pattern} and 342the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the 343beginning of the rules. 344 345Here's another simple example: 346 347@cindex counting characters and lines 348@example 349@verbatim 350 int num_lines = 0, num_chars = 0; 351 352 %% 353 \n ++num_lines; ++num_chars; 354 . ++num_chars; 355 356 %% 357 358 int main() 359 { 360 yylex(); 361 printf( "# of lines = %d, # of chars = %d\n", 362 num_lines, num_chars ); 363 } 364@end verbatim 365@end example 366 367This scanner counts the number of characters and the number of lines in 368its input. It produces no output other than the final report on the 369character and line counts. The first line declares two globals, 370@code{num_lines} and @code{num_chars}, which are accessible both inside 371@code{yylex()} and in the @code{main()} routine declared after the 372second @samp{%%}. There are two rules, one which matches a newline 373(@samp{\n}) and increments both the line count and the character count, 374and one which matches any character other than a newline (indicated by 375the @samp{.} regular expression). 376 377A somewhat more complicated example: 378 379@cindex Pascal-like language 380@example 381@verbatim 382 /* scanner for a toy Pascal-like language */ 383 384 %{ 385 /* need this for the call to atof() below */ 386 #include <math.h> 387 %} 388 389 DIGIT [0-9] 390 ID [a-z][a-z0-9]* 391 392 %% 393 394 {DIGIT}+ { 395 printf( "An integer: %s (%d)\n", yytext, 396 atoi( yytext ) ); 397 } 398 399 {DIGIT}+"."{DIGIT}* { 400 printf( "A float: %s (%g)\n", yytext, 401 atof( yytext ) ); 402 } 403 404 if|then|begin|end|procedure|function { 405 printf( "A keyword: %s\n", yytext ); 406 } 407 408 {ID} printf( "An identifier: %s\n", yytext ); 409 410 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 411 412 "{"[\^{}}\n]*"}" /* eat up one-line comments */ 413 414 [ \t\n]+ /* eat up whitespace */ 415 416 . printf( "Unrecognized character: %s\n", yytext ); 417 418 %% 419 420 int main( int argc, char **argv ) 421 { 422 ++argv, --argc; /* skip over program name */ 423 if ( argc > 0 ) 424 yyin = fopen( argv[0], "r" ); 425 else 426 yyin = stdin; 427 428 yylex(); 429 } 430@end verbatim 431@end example 432 433This is the beginnings of a simple scanner for a language like Pascal. 434It identifies different types of @dfn{tokens} and reports on what it has 435seen. 436 437The details of this example will be explained in the following 438sections. 439 440@node Format, Patterns, Simple Examples, Top 441@chapter Format of the Input File 442 443 444@cindex format of flex input 445@cindex input, format of 446@cindex file format 447@cindex sections of flex input 448 449The @code{flex} input file consists of three sections, separated by a 450line containing only @samp{%%}. 451 452@cindex format of input file 453@example 454@verbatim 455 definitions 456 %% 457 rules 458 %% 459 user code 460@end verbatim 461@end example 462 463@menu 464* Definitions Section:: 465* Rules Section:: 466* User Code Section:: 467* Comments in the Input:: 468@end menu 469 470@node Definitions Section, Rules Section, Format, Format 471@section Format of the Definitions Section 472 473@cindex input file, Definitions section 474@cindex Definitions, in flex input 475The @dfn{definitions section} contains declarations of simple @dfn{name} 476definitions to simplify the scanner specification, and declarations of 477@dfn{start conditions}, which are explained in a later section. 478 479@cindex aliases, how to define 480@cindex pattern aliases, how to define 481Name definitions have the form: 482 483@example 484@verbatim 485 name definition 486@end verbatim 487@end example 488 489The @samp{name} is a word beginning with a letter or an underscore 490(@samp{_}) followed by zero or more letters, digits, @samp{_}, or 491@samp{-} (dash). The definition is taken to begin at the first 492non-whitespace character following the name and continuing to the end of 493the line. The definition can subsequently be referred to using 494@samp{@{name@}}, which will expand to @samp{(definition)}. For example, 495 496@cindex pattern aliases, defining 497@cindex defining pattern aliases 498@example 499@verbatim 500 DIGIT [0-9] 501 ID [a-z][a-z0-9]* 502@end verbatim 503@end example 504 505Defines @samp{DIGIT} to be a regular expression which matches a single 506digit, and @samp{ID} to be a regular expression which matches a letter 507followed by zero-or-more letters-or-digits. A subsequent reference to 508 509@cindex pattern aliases, use of 510@example 511@verbatim 512 {DIGIT}+"."{DIGIT}* 513@end verbatim 514@end example 515 516is identical to 517 518@example 519@verbatim 520 ([0-9])+"."([0-9])* 521@end verbatim 522@end example 523 524and matches one-or-more digits followed by a @samp{.} followed by 525zero-or-more digits. 526 527@cindex comments in flex input 528An unindented comment (i.e., a line 529beginning with @samp{/*}) is copied verbatim to the output up 530to the next @samp{*/}. 531 532@cindex %@{ and %@}, in Definitions Section 533@cindex embedding C code in flex input 534@cindex C code in flex input 535Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 536is also copied verbatim to the output (with the %@{ and %@} symbols 537removed). The %@{ and %@} symbols must appear unindented on lines by 538themselves. 539 540@cindex %top 541 542A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except 543that the code in a @code{%top} block is relocated to the @emph{top} of the 544generated file, before any flex definitions @footnote{Actually, 545@code{yyIN_HEADER} is defined before the @samp{%top} block.}. 546The @code{%top} block is useful when you want certain preprocessor macros to be 547defined or certain files to be included before the generated code. 548The single characters, @samp{@{} and @samp{@}} are used to delimit the 549@code{%top} block, as show in the example below: 550 551@example 552@verbatim 553 %top{ 554 /* This code goes at the "top" of the generated file. */ 555 #include <stdint.h> 556 #include <inttypes.h> 557 } 558@end verbatim 559@end example 560 561Multiple @code{%top} blocks are allowed, and their order is preserved. 562 563@node Rules Section, User Code Section, Definitions Section, Format 564@section Format of the Rules Section 565 566@cindex input file, Rules Section 567@cindex rules, in flex input 568The @dfn{rules} section of the @code{flex} input contains a series of 569rules of the form: 570 571@example 572@verbatim 573 pattern action 574@end verbatim 575@end example 576 577where the pattern must be unindented and the action must begin 578on the same line. 579@xref{Patterns}, for a further description of patterns and actions. 580 581In the rules section, any indented or %@{ %@} enclosed text appearing 582before the first rule may be used to declare variables which are local 583to the scanning routine and (after the declarations) code which is to be 584executed whenever the scanning routine is entered. Other indented or 585%@{ %@} text in the rule section is still copied to the output, but its 586meaning is not well-defined and it may well cause compile-time errors 587(this feature is present for @acronym{POSIX} compliance. @xref{Lex and 588Posix}, for other such features). 589 590Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 591is copied verbatim to the output (with the %@{ and %@} symbols removed). 592The %@{ and %@} symbols must appear unindented on lines by themselves. 593 594@node User Code Section, Comments in the Input, Rules Section, Format 595@section Format of the User Code Section 596 597@cindex input file, user code Section 598@cindex user code, in flex input 599The user code section is simply copied to @file{lex.yy.c} verbatim. It 600is used for companion routines which call or are called by the scanner. 601The presence of this section is optional; if it is missing, the second 602@samp{%%} in the input file may be skipped, too. 603 604@node Comments in the Input, , User Code Section, Format 605@section Comments in the Input 606 607@cindex comments, syntax of 608Flex supports C-style comments, that is, anything between @samp{/*} and 609@samp{*/} is 610considered a comment. Whenever flex encounters a comment, it copies the 611entire comment verbatim to the generated source code. Comments may 612appear just about anywhere, but with the following exceptions: 613 614@itemize 615@cindex comments, in rules section 616@item 617Comments may not appear in the Rules Section wherever flex is expecting 618a regular expression. This means comments may not appear at the 619beginning of a line, or immediately following a list of scanner states. 620@item 621Comments may not appear on an @samp{%option} line in the Definitions 622Section. 623@end itemize 624 625If you want to follow a simple rule, then always begin a comment on a 626new line, with one or more whitespace characters before the initial 627@samp{/*}). This rule will work anywhere in the input file. 628 629All the comments in the following example are valid: 630 631@cindex comments, valid uses of 632@cindex comments in the input 633@example 634@verbatim 635%{ 636/* code block */ 637%} 638 639/* Definitions Section */ 640%x STATE_X 641 642%% 643 /* Rules Section */ 644ruleA /* after regex */ { /* code block */ } /* after code block */ 645 /* Rules Section (indented) */ 646<STATE_X>{ 647ruleC ECHO; 648ruleD ECHO; 649%{ 650/* code block */ 651%} 652} 653%% 654/* User Code Section */ 655 656@end verbatim 657@end example 658 659@node Patterns, Matching, Format, Top 660@chapter Patterns 661 662@cindex patterns, in rules section 663@cindex regular expressions, in patterns 664The patterns in the input (see @ref{Rules Section}) are written using an 665extended set of regular expressions. These are: 666 667@cindex patterns, syntax 668@cindex patterns, syntax 669@table @samp 670@item x 671match the character 'x' 672 673@item . 674any character (byte) except newline 675 676@cindex [] in patterns 677@cindex character classes in patterns, syntax of 678@cindex POSIX, character classes in patterns, syntax of 679@item [xyz] 680a @dfn{character class}; in this case, the pattern 681matches either an 'x', a 'y', or a 'z' 682 683@cindex ranges in patterns 684@item [abj-oZ] 685a "character class" with a range in it; matches 686an 'a', a 'b', any letter from 'j' through 'o', 687or a 'Z' 688 689@cindex ranges in patterns, negating 690@cindex negating ranges in patterns 691@item [^A-Z] 692a "negated character class", i.e., any character 693but those in the class. In this case, any 694character EXCEPT an uppercase letter. 695 696@item [^A-Z\n] 697any character EXCEPT an uppercase letter or 698a newline 699 700@item [a-z]@{-@}[aeiou] 701the lowercase consonants 702 703@item r* 704zero or more r's, where r is any regular expression 705 706@item r+ 707one or more r's 708 709@item r? 710zero or one r's (that is, ``an optional r'') 711 712@cindex braces in patterns 713@item r@{2,5@} 714anywhere from two to five r's 715 716@item r@{2,@} 717two or more r's 718 719@item r@{4@} 720exactly 4 r's 721 722@cindex pattern aliases, expansion of 723@item @{name@} 724the expansion of the @samp{name} definition 725(@pxref{Format}). 726 727@cindex literal text in patterns, syntax of 728@cindex verbatim text in patterns, syntax of 729@item "[xyz]\"foo" 730the literal string: @samp{[xyz]"foo} 731 732@cindex escape sequences in patterns, syntax of 733@item \X 734if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or 735@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a 736literal @samp{X} (used to escape operators such as @samp{*}) 737 738@cindex NULL character in patterns, syntax of 739@item \0 740a NUL character (ASCII code 0) 741 742@cindex octal characters in patterns 743@item \123 744the character with octal value 123 745 746@item \x2a 747the character with hexadecimal value 2a 748 749@item (r) 750match an @samp{r}; parentheses are used to override precedence (see below) 751 752@item (?r-s:pattern) 753apply option @samp{r} and omit option @samp{s} while interpreting pattern. 754Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}. 755 756@samp{i} means case-insensitive. @samp{-i} means case-sensitive. 757 758@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever. 759@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}. 760 761@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless 762it is backslash-escaped, contained within @samp{""}s, or appears inside a 763character class. 764 765The following are all valid: 766 767@verbatim 768(?:foo) same as (foo) 769(?i:ab7) same as ([aA][bB]7) 770(?-i:ab) same as (ab) 771(?s:.) same as [\x00-\xFF] 772(?-s:.) same as [^\n] 773(?ix-s: a . b) same as ([Aa][^\n][bB]) 774(?x:a b) same as ("ab") 775(?x:a\ b) same as ("a b") 776(?x:a" "b) same as ("a b") 777(?x:a[ ]b) same as ("a b") 778(?x:a 779 /* comment */ 780 b 781 c) same as (abc) 782@end verbatim 783 784@item (?# comment ) 785omit everything within @samp{()}. The first @samp{)} 786character encountered ends the pattern. It is not possible to for the comment 787to contain a @samp{)} character. The comment may span lines. 788 789@cindex concatenation, in patterns 790@item rs 791the regular expression @samp{r} followed by the regular expression @samp{s}; called 792@dfn{concatenation} 793 794@item r|s 795either an @samp{r} or an @samp{s} 796 797@cindex trailing context, in patterns 798@item r/s 799an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is 800included when determining whether this rule is the longest match, but is 801then returned to the input before the action is executed. So the action 802only sees the text matched by @samp{r}. This type of pattern is called 803@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex 804cannot match correctly. @xref{Limitations}, regarding dangerous trailing 805context.) 806 807@cindex beginning of line, in patterns 808@cindex BOL, in patterns 809@item ^r 810an @samp{r}, but only at the beginning of a line (i.e., 811when just starting to scan, or right after a 812newline has been scanned). 813 814@cindex end of line, in patterns 815@cindex EOL, in patterns 816@item r$ 817an @samp{r}, but only at the end of a line (i.e., just before a 818newline). Equivalent to @samp{r/\n}. 819 820@cindex newline, matching in patterns 821Note that @code{flex}'s notion of ``newline'' is exactly 822whatever the C compiler used to compile @code{flex} 823interprets @samp{\n} as; in particular, on some DOS 824systems you must either filter out @samp{\r}s in the 825input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. 826 827@cindex start conditions, in patterns 828@item <s>r 829an @samp{r}, but only in start condition @code{s} (see @ref{Start 830Conditions} for discussion of start conditions). 831 832@item <s1,s2,s3>r 833same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. 834 835@item <*>r 836an @samp{r} in any start condition, even an exclusive one. 837 838@cindex end of file, in patterns 839@cindex EOF in patterns, syntax of 840@item <<EOF>> 841an end-of-file. 842 843@item <s1,s2><<EOF>> 844an end-of-file when in start condition @code{s1} or @code{s2} 845@end table 846 847Note that inside of a character class, all regular expression operators 848lose their special meaning except escape (@samp{\}) and the character class 849operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. 850 851@cindex patterns, precedence of operators 852The regular expressions listed above are grouped according to 853precedence, from highest precedence at the top to lowest at the bottom. 854Those grouped together have equal precedence (see special note on the 855precedence of the repeat operator, @samp{@{@}}, under the documentation 856for the @samp{--posix} POSIX compliance option). For example, 857 858@cindex patterns, grouping and precedence 859@example 860@verbatim 861 foo|bar* 862@end verbatim 863@end example 864 865is the same as 866 867@example 868@verbatim 869 (foo)|(ba(r*)) 870@end verbatim 871@end example 872 873since the @samp{*} operator has higher precedence than concatenation, 874and concatenation higher than alternation (@samp{|}). This pattern 875therefore matches @emph{either} the string @samp{foo} @emph{or} the 876string @samp{ba} followed by zero-or-more @samp{r}'s. To match 877@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: 878 879@example 880@verbatim 881 foo|(bar)* 882@end verbatim 883@end example 884 885And to match a sequence of zero or more repetitions of @samp{foo} and 886@samp{bar}: 887 888@cindex patterns, repetitions with grouping 889@example 890@verbatim 891 (foo|bar)* 892@end verbatim 893@end example 894 895@cindex character classes in patterns 896In addition to characters and ranges of characters, character classes 897can also contain @dfn{character class expressions}. These are 898expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which 899themselves must appear between the @samp{[} and @samp{]} of the 900character class. Other elements may occur inside the character class, 901too). The valid expressions are: 902 903@cindex patterns, valid character classes 904@example 905@verbatim 906 [:alnum:] [:alpha:] [:blank:] 907 [:cntrl:] [:digit:] [:graph:] 908 [:lower:] [:print:] [:punct:] 909 [:space:] [:upper:] [:xdigit:] 910@end verbatim 911@end example 912 913These expressions all designate a set of characters equivalent to the 914corresponding standard C @code{isXXX} function. For example, 915@samp{[:alnum:]} designates those characters for which @code{isalnum()} 916returns true - i.e., any alphabetic or numeric character. Some systems 917don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a 918blank or a tab. 919 920For example, the following character classes are all equivalent: 921 922@cindex character classes, equivalence of 923@cindex patterns, character class equivalence 924@example 925@verbatim 926 [[:alnum:]] 927 [[:alpha:][:digit:]] 928 [[:alpha:][0-9]] 929 [a-zA-Z0-9] 930@end verbatim 931@end example 932 933A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. 934This means the character classes are sensitive to the locale in which @code{flex} 935is executed, and the resulting scanner will not be sensitive to the runtime locale. 936This may or may not be desirable. 937 938 939@itemize 940@cindex case-insensitive, effect on character classes 941@item If your scanner is case-insensitive (the @samp{-i} flag), then 942@samp{[:upper:]} and @samp{[:lower:]} are equivalent to 943@samp{[:alpha:]}. 944 945@anchor{case and character ranges} 946@item Character classes with ranges, such as @samp{[a-Z]}, should be used with 947caution in a case-insensitive scanner if the range spans upper or lowercase 948characters. Flex does not know if you want to fold all upper and lowercase 949characters together, or if you want the literal numeric range specified (with 950no case folding). When in doubt, flex will assume that you meant the literal 951numeric range, and will issue a warning. The exception to this rule is a 952character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you 953want case-folding to occur. Here are some examples with the @samp{-i} flag 954enabled: 955 956@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} 957@item Range @tab Result @tab Literal Range @tab Alternate Range 958@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab 959@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab 960@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} 961@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} 962@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} 963@end multitable 964 965@cindex end of line, in negated character classes 966@cindex EOL, in negated character classes 967@item 968A negated character class such as the example @samp{[^A-Z]} above 969@emph{will} match a newline unless @samp{\n} (or an equivalent escape 970sequence) is one of the characters explicitly present in the negated 971character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other 972regular expression tools treat negated character classes, but 973unfortunately the inconsistency is historically entrenched. Matching 974newlines means that a pattern like @samp{[^"]*} can match the entire 975input unless there's another quote in the input. 976 977Flex allows negation of character class expressions by prepending @samp{^} to 978the POSIX character class name. 979 980@example 981@verbatim 982 [:^alnum:] [:^alpha:] [:^blank:] 983 [:^cntrl:] [:^digit:] [:^graph:] 984 [:^lower:] [:^print:] [:^punct:] 985 [:^space:] [:^upper:] [:^xdigit:] 986@end verbatim 987@end example 988 989Flex will issue a warning if the expressions @samp{[:^upper:]} and 990@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is 991unclear. The current behavior is to skip them entirely, but this may change 992without notice in future revisions of flex. 993 994@item 995 996The @samp{@{-@}} operator computes the difference of two character classes. For 997example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class 998@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is 999just the single character @samp{a}). The @samp{@{-@}} operator is left 1000associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful 1001not to accidentally create an empty set, which will never match. 1002 1003@item 1004 1005The @samp{@{+@}} operator computes the union of two character classes. For 1006example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator 1007is useful when preceded by the result of a difference operation, as in, 1008@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to 1009@samp{[A-Zq]} in the "C" locale. 1010 1011@cindex trailing context, limits of 1012@cindex ^ as non-special character in patterns 1013@cindex $ as normal character in patterns 1014@item 1015A rule can have at most one instance of trailing context (the @samp{/} operator 1016or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns 1017can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, 1018cannot be grouped inside parentheses. A @samp{^} which does not occur at 1019the beginning of a rule or a @samp{$} which does not occur at the end of 1020a rule loses its special properties and is treated as a normal character. 1021 1022@item 1023The following are invalid: 1024 1025@cindex patterns, invalid trailing context 1026@example 1027@verbatim 1028 foo/bar$ 1029 <sc1>foo<sc2>bar 1030@end verbatim 1031@end example 1032 1033Note that the first of these can be written @samp{foo/bar\n}. 1034 1035@item 1036The following will result in @samp{$} or @samp{^} being treated as a normal character: 1037 1038@cindex patterns, special characters treated as non-special 1039@example 1040@verbatim 1041 foo|(bar$) 1042 foo|^bar 1043@end verbatim 1044@end example 1045 1046If the desired meaning is a @samp{foo} or a 1047@samp{bar}-followed-by-a-newline, the following could be used (the 1048special @code{|} action is explained below, @pxref{Actions}): 1049 1050@cindex patterns, end of line 1051@example 1052@verbatim 1053 foo | 1054 bar$ /* action goes here */ 1055@end verbatim 1056@end example 1057 1058A similar trick will work for matching a @samp{foo} or a 1059@samp{bar}-at-the-beginning-of-a-line. 1060@end itemize 1061 1062@node Matching, Actions, Patterns, Top 1063@chapter How the Input Is Matched 1064 1065@cindex patterns, matching 1066@cindex input, matching 1067@cindex trailing context, matching 1068@cindex matching, and trailing context 1069@cindex matching, length of 1070@cindex matching, multiple matches 1071When the generated scanner is run, it analyzes its input looking for 1072strings which match any of its patterns. If it finds more than one 1073match, it takes the one matching the most text (for trailing context 1074rules, this includes the length of the trailing part, even though it 1075will then be returned to the input). If it finds two or more matches of 1076the same length, the rule listed first in the @code{flex} input file is 1077chosen. 1078 1079@cindex token 1080@cindex yytext 1081@cindex yyleng 1082Once the match is determined, the text corresponding to the match 1083(called the @dfn{token}) is made available in the global character 1084pointer @code{yytext}, and its length in the global integer 1085@code{yyleng}. The @dfn{action} corresponding to the matched pattern is 1086then executed (@pxref{Actions}), and then the remaining input is scanned 1087for another match. 1088 1089@cindex default rule 1090If no match is found, then the @dfn{default rule} is executed: the next 1091character in the input is considered matched and copied to the standard 1092output. Thus, the simplest valid @code{flex} input is: 1093 1094@cindex minimal scanner 1095@example 1096@verbatim 1097 %% 1098@end verbatim 1099@end example 1100 1101which generates a scanner that simply copies its input (one character at 1102a time) to its output. 1103 1104@cindex yytext, two types of 1105@cindex %array, use of 1106@cindex %pointer, use of 1107@vindex yytext 1108Note that @code{yytext} can be defined in two different ways: either as 1109a character @emph{pointer} or as a character @emph{array}. You can 1110control which definition @code{flex} uses by including one of the 1111special directives @code{%pointer} or @code{%array} in the first 1112(definitions) section of your flex input. The default is 1113@code{%pointer}, unless you use the @samp{-l} lex compatibility option, 1114in which case @code{yytext} will be an array. The advantage of using 1115@code{%pointer} is substantially faster scanning and no buffer overflow 1116when matching very large tokens (unless you run out of dynamic memory). 1117The disadvantage is that you are restricted in how your actions can 1118modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()} 1119function destroys the present contents of @code{yytext}, which can be a 1120considerable porting headache when moving between different @code{lex} 1121versions. 1122 1123@cindex %array, advantages of 1124The advantage of @code{%array} is that you can then modify @code{yytext} 1125to your heart's content, and calls to @code{unput()} do not destroy 1126@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex} 1127programs sometimes access @code{yytext} externally using declarations of 1128the form: 1129 1130@example 1131@verbatim 1132 extern char yytext[]; 1133@end verbatim 1134@end example 1135 1136This definition is erroneous when used with @code{%pointer}, but correct 1137for @code{%array}. 1138 1139The @code{%array} declaration defines @code{yytext} to be an array of 1140@code{YYLMAX} characters, which defaults to a fairly large value. You 1141can change the size by simply #define'ing @code{YYLMAX} to a different 1142value in the first section of your @code{flex} input. As mentioned 1143above, with @code{%pointer} yytext grows dynamically to accommodate 1144large tokens. While this means your @code{%pointer} scanner can 1145accommodate very large tokens (such as matching entire blocks of 1146comments), bear in mind that each time the scanner must resize 1147@code{yytext} it also must rescan the entire token from the beginning, 1148so matching such tokens can prove slow. @code{yytext} presently does 1149@emph{not} dynamically grow if a call to @code{unput()} results in too 1150much text being pushed back; instead, a run-time error results. 1151 1152@cindex %array, with C++ 1153Also note that you cannot use @code{%array} with C++ scanner classes 1154(@pxref{Cxx}). 1155 1156@node Actions, Generated Scanner, Matching, Top 1157@chapter Actions 1158 1159@cindex actions 1160Each pattern in a rule has a corresponding @dfn{action}, which can be 1161any arbitrary C statement. The pattern ends at the first non-escaped 1162whitespace character; the remainder of the line is its action. If the 1163action is empty, then when the pattern is matched the input token is 1164simply discarded. For example, here is the specification for a program 1165which deletes all occurrences of @samp{zap me} from its input: 1166 1167@cindex deleting lines from input 1168@example 1169@verbatim 1170 %% 1171 "zap me" 1172@end verbatim 1173@end example 1174 1175This example will copy all other characters in the input to the output 1176since they will be matched by the default rule. 1177 1178Here is a program which compresses multiple blanks and tabs down to a 1179single blank, and throws away whitespace found at the end of a line: 1180 1181@cindex whitespace, compressing 1182@cindex compressing whitespace 1183@example 1184@verbatim 1185 %% 1186 [ \t]+ putchar( ' ' ); 1187 [ \t]+$ /* ignore this token */ 1188@end verbatim 1189@end example 1190 1191@cindex %@{ and %@}, in Rules Section 1192@cindex actions, use of @{ and @} 1193@cindex actions, embedded C strings 1194@cindex C-strings, in actions 1195@cindex comments, in actions 1196If the action contains a @samp{@{}, then the action spans till the 1197balancing @samp{@}} is found, and the action may cross multiple lines. 1198@code{flex} knows about C strings and comments and won't be fooled by 1199braces found within them, but also allows actions to begin with 1200@samp{%@{} and will consider the action to be all the text up to the 1201next @samp{%@}} (regardless of ordinary braces inside the action). 1202 1203@cindex |, in actions 1204An action consisting solely of a vertical bar (@samp{|}) means ``same as the 1205action for the next rule''. See below for an illustration. 1206 1207Actions can include arbitrary C code, including @code{return} statements 1208to return a value to whatever routine called @code{yylex()}. Each time 1209@code{yylex()} is called it continues processing tokens from where it 1210last left off until it either reaches the end of the file or executes a 1211return. 1212 1213@cindex yytext, modification of 1214Actions are free to modify @code{yytext} except for lengthening it 1215(adding characters to its end--these will overwrite later characters in 1216the input stream). This however does not apply when using @code{%array} 1217(@pxref{Matching}). In that case, @code{yytext} may be freely modified 1218in any way. 1219 1220@cindex yyleng, modification of 1221@cindex yymore, and yyleng 1222Actions are free to modify @code{yyleng} except they should not do so if 1223the action also includes use of @code{yymore()} (see below). 1224 1225@cindex preprocessor macros, for use in actions 1226There are a number of special directives which can be included within an 1227action: 1228 1229@table @code 1230@item ECHO 1231@cindex ECHO 1232copies yytext to the scanner's output. 1233 1234@item BEGIN 1235@cindex BEGIN 1236followed by the name of a start condition places the scanner in the 1237corresponding start condition (see below). 1238 1239@item REJECT 1240@cindex REJECT 1241directs the scanner to proceed on to the ``second best'' rule which 1242matched the input (or a prefix of the input). The rule is chosen as 1243described above in @ref{Matching}, and @code{yytext} and @code{yyleng} 1244set up appropriately. It may either be one which matched as much text 1245as the originally chosen rule but came later in the @code{flex} input 1246file, or one which matched less text. For example, the following will 1247both count the words in the input and call the routine @code{special()} 1248whenever @samp{frob} is seen: 1249 1250@example 1251@verbatim 1252 int word_count = 0; 1253 %% 1254 1255 frob special(); REJECT; 1256 [^ \t\n]+ ++word_count; 1257@end verbatim 1258@end example 1259 1260Without the @code{REJECT}, any occurrences of @samp{frob} in the input 1261would not be counted as words, since the scanner normally executes only 1262one action per token. Multiple uses of @code{REJECT} are allowed, each 1263one finding the next best choice to the currently active rule. For 1264example, when the following scanner scans the token @samp{abcd}, it will 1265write @samp{abcdabcaba} to the output: 1266 1267@cindex REJECT, calling multiple times 1268@cindex |, use of 1269@example 1270@verbatim 1271 %% 1272 a | 1273 ab | 1274 abc | 1275 abcd ECHO; REJECT; 1276 .|\n /* eat up any unmatched character */ 1277@end verbatim 1278@end example 1279 1280The first three rules share the fourth's action since they use the 1281special @samp{|} action. 1282 1283@code{REJECT} is a particularly expensive feature in terms of scanner 1284performance; if it is used in @emph{any} of the scanner's actions it 1285will slow down @emph{all} of the scanner's matching. Furthermore, 1286@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options 1287(@pxref{Scanner Options}). 1288 1289Note also that unlike the other special actions, @code{REJECT} is a 1290@emph{branch}. Code immediately following it in the action will 1291@emph{not} be executed. 1292 1293@item yymore() 1294@cindex yymore() 1295tells the scanner that the next time it matches a rule, the 1296corresponding token should be @emph{appended} onto the current value of 1297@code{yytext} rather than replacing it. For example, given the input 1298@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to 1299the output: 1300 1301@cindex yymore(), mega-kludge 1302@cindex yymore() to append token to previous token 1303@example 1304@verbatim 1305 %% 1306 mega- ECHO; yymore(); 1307 kludge ECHO; 1308@end verbatim 1309@end example 1310 1311First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} 1312is matched, but the previous @samp{mega-} is still hanging around at the 1313beginning of 1314@code{yytext} 1315so the 1316@code{ECHO} 1317for the @samp{kludge} rule will actually write @samp{mega-kludge}. 1318@end table 1319 1320@cindex yymore, performance penalty of 1321Two notes regarding use of @code{yymore()}. First, @code{yymore()} 1322depends on the value of @code{yyleng} correctly reflecting the size of 1323the current token, so you must not modify @code{yyleng} if you are using 1324@code{yymore()}. Second, the presence of @code{yymore()} in the 1325scanner's action entails a minor performance penalty in the scanner's 1326matching speed. 1327 1328@cindex yyless() 1329@code{yyless(n)} returns all but the first @code{n} characters of the 1330current token back to the input stream, where they will be rescanned 1331when the scanner looks for the next match. @code{yytext} and 1332@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now 1333be equal to @code{n}). For example, on the input @samp{foobar} the 1334following will write out @samp{foobarbar}: 1335 1336@cindex yyless(), pushing back characters 1337@cindex pushing back characters with yyless 1338@example 1339@verbatim 1340 %% 1341 foobar ECHO; yyless(3); 1342 [a-z]+ ECHO; 1343@end verbatim 1344@end example 1345 1346An argument of 0 to @code{yyless()} will cause the entire current input 1347string to be scanned again. Unless you've changed how the scanner will 1348subsequently process its input (using @code{BEGIN}, for example), this 1349will result in an endless loop. 1350 1351Note that @code{yyless()} is a macro and can only be used in the flex 1352input file, not from other source files. 1353 1354@cindex unput() 1355@cindex pushing back characters with unput 1356@code{unput(c)} puts the character @code{c} back onto the input stream. 1357It will be the next character scanned. The following action will take 1358the current token and cause it to be rescanned enclosed in parentheses. 1359 1360@cindex unput(), pushing back characters 1361@cindex pushing back characters with unput() 1362@example 1363@verbatim 1364 { 1365 int i; 1366 /* Copy yytext because unput() trashes yytext */ 1367 char *yycopy = strdup( yytext ); 1368 unput( ')' ); 1369 for ( i = yyleng - 1; i >= 0; --i ) 1370 unput( yycopy[i] ); 1371 unput( '(' ); 1372 free( yycopy ); 1373 } 1374@end verbatim 1375@end example 1376 1377Note that since each @code{unput()} puts the given character back at the 1378@emph{beginning} of the input stream, pushing back strings must be done 1379back-to-front. 1380 1381@cindex %pointer, and unput() 1382@cindex unput(), and %pointer 1383An important potential problem when using @code{unput()} is that if you 1384are using @code{%pointer} (the default), a call to @code{unput()} 1385@emph{destroys} the contents of @code{yytext}, starting with its 1386rightmost character and devouring one character to the left with each 1387call. If you need the value of @code{yytext} preserved after a call to 1388@code{unput()} (as in the above example), you must either first copy it 1389elsewhere, or build your scanner using @code{%array} instead 1390(@pxref{Matching}). 1391 1392@cindex pushing back EOF 1393@cindex EOF, pushing back 1394Finally, note that you cannot put back @samp{EOF} to attempt to mark the 1395input stream with an end-of-file. 1396 1397@cindex input() 1398@code{input()} reads the next character from the input stream. For 1399example, the following is one way to eat up C comments: 1400 1401@cindex comments, discarding 1402@cindex discarding C comments 1403@example 1404@verbatim 1405 %% 1406 "/*" { 1407 register int c; 1408 1409 for ( ; ; ) 1410 { 1411 while ( (c = input()) != '*' && 1412 c != EOF ) 1413 ; /* eat up text of comment */ 1414 1415 if ( c == '*' ) 1416 { 1417 while ( (c = input()) == '*' ) 1418 ; 1419 if ( c == '/' ) 1420 break; /* found the end */ 1421 } 1422 1423 if ( c == EOF ) 1424 { 1425 error( "EOF in comment" ); 1426 break; 1427 } 1428 } 1429 } 1430@end verbatim 1431@end example 1432 1433@cindex input(), and C++ 1434@cindex yyinput() 1435(Note that if the scanner is compiled using @code{C++}, then 1436@code{input()} is instead referred to as @b{yyinput()}, in order to 1437avoid a name clash with the @code{C++} stream by the name of 1438@code{input}.) 1439 1440@cindex flushing the internal buffer 1441@cindex YY_FLUSH_BUFFER 1442@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that 1443the next time the scanner attempts to match a token, it will first 1444refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}). 1445This action is a special case of the more general 1446@code{yy_flush_buffer;} function, described below (@pxref{Multiple 1447Input Buffers}) 1448 1449@cindex yyterminate() 1450@cindex terminating with yyterminate() 1451@cindex exiting with yyterminate() 1452@cindex halting with yyterminate() 1453@code{yyterminate()} can be used in lieu of a return statement in an 1454action. It terminates the scanner and returns a 0 to the scanner's 1455caller, indicating ``all done''. By default, @code{yyterminate()} is 1456also called when an end-of-file is encountered. It is a macro and may 1457be redefined. 1458 1459@node Generated Scanner, Start Conditions, Actions, Top 1460@chapter The Generated Scanner 1461 1462@cindex yylex(), in generated scanner 1463The output of @code{flex} is the file @file{lex.yy.c}, which contains 1464the scanning routine @code{yylex()}, a number of tables used by it for 1465matching tokens, and a number of auxiliary routines and macros. By 1466default, @code{yylex()} is declared as follows: 1467 1468@example 1469@verbatim 1470 int yylex() 1471 { 1472 ... various definitions and the actions in here ... 1473 } 1474@end verbatim 1475@end example 1476 1477@cindex yylex(), overriding 1478(If your environment supports function prototypes, then it will be 1479@code{int yylex( void )}.) This definition may be changed by defining 1480the @code{YY_DECL} macro. For example, you could use: 1481 1482@cindex yylex, overriding the prototype of 1483@example 1484@verbatim 1485 #define YY_DECL float lexscan( a, b ) float a, b; 1486@end verbatim 1487@end example 1488 1489to give the scanning routine the name @code{lexscan}, returning a float, 1490and taking two floats as arguments. Note that if you give arguments to 1491the scanning routine using a K&R-style/non-prototyped function 1492declaration, you must terminate the definition with a semi-colon (;). 1493 1494@code{flex} generates @samp{C99} function definitions by 1495default. However flex does have the ability to generate obsolete, er, 1496@samp{traditional}, function definitions. This is to support 1497bootstrapping gcc on old systems. Unfortunately, traditional 1498definitions prevent us from using any standard data types smaller than 1499int (such as short, char, or bool) as function arguments. For this 1500reason, future versions of @code{flex} may generate standard C99 code 1501only, leaving K&R-style functions to the historians. Currently, if you 1502do @strong{not} want @samp{C99} definitions, then you must use 1503@code{%option noansi-definitions}. 1504 1505@cindex stdin, default for yyin 1506@cindex yyin 1507Whenever @code{yylex()} is called, it scans tokens from the global input 1508file @file{yyin} (which defaults to stdin). It continues until it 1509either reaches an end-of-file (at which point it returns the value 0) or 1510one of its actions executes a @code{return} statement. 1511 1512@cindex EOF and yyrestart() 1513@cindex end-of-file, and yyrestart() 1514@cindex yyrestart() 1515If the scanner reaches an end-of-file, subsequent calls are undefined 1516unless either @file{yyin} is pointed at a new input file (in which case 1517scanning continues from that file), or @code{yyrestart()} is called. 1518@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which 1519can be NULL, if you've set up @code{YY_INPUT} to scan from a source other 1520than @code{yyin}), and initializes @file{yyin} for scanning from that 1521file. Essentially there is no difference between just assigning 1522@file{yyin} to a new input file or using @code{yyrestart()} to do so; 1523the latter is available for compatibility with previous versions of 1524@code{flex}, and because it can be used to switch input files in the 1525middle of scanning. It can also be used to throw away the current input 1526buffer, by calling it with an argument of @file{yyin}; but it would be 1527better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that 1528@code{yyrestart()} does @emph{not} reset the start condition to 1529@code{INITIAL} (@pxref{Start Conditions}). 1530 1531@cindex RETURN, within actions 1532If @code{yylex()} stops scanning due to executing a @code{return} 1533statement in one of the actions, the scanner may then be called again 1534and it will resume scanning where it left off. 1535 1536@cindex YY_INPUT 1537By default (and for purposes of efficiency), the scanner uses 1538block-reads rather than simple @code{getc()} calls to read characters 1539from @file{yyin}. The nature of how it gets its input can be controlled 1540by defining the @code{YY_INPUT} macro. The calling sequence for 1541@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action 1542is to place up to @code{max_size} characters in the character array 1543@code{buf} and return in the integer variable @code{result} either the 1544number of characters read or the constant @code{YY_NULL} (0 on Unix 1545systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from 1546the global file-pointer @file{yyin}. 1547 1548@cindex YY_INPUT, overriding 1549Here is a sample definition of @code{YY_INPUT} (in the definitions 1550section of the input file): 1551 1552@example 1553@verbatim 1554 %{ 1555 #define YY_INPUT(buf,result,max_size) \ 1556 { \ 1557 int c = getchar(); \ 1558 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 1559 } 1560 %} 1561@end verbatim 1562@end example 1563 1564This definition will change the input processing to occur one character 1565at a time. 1566 1567@cindex yywrap() 1568When the scanner receives an end-of-file indication from YY_INPUT, it 1569then checks the @code{yywrap()} function. If @code{yywrap()} returns 1570false (zero), then it is assumed that the function has gone ahead and 1571set up @file{yyin} to point to another input file, and scanning 1572continues. If it returns true (non-zero), then the scanner terminates, 1573returning 0 to its caller. Note that in either case, the start 1574condition remains unchanged; it does @emph{not} revert to 1575@code{INITIAL}. 1576 1577@cindex yywrap, default for 1578@cindex noyywrap, %option 1579@cindex %option noyywrapp 1580If you do not supply your own version of @code{yywrap()}, then you must 1581either use @code{%option noyywrap} (in which case the scanner behaves as 1582though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to 1583obtain the default version of the routine, which always returns 1. 1584 1585For scanning from in-memory buffers (e.g., scanning strings), see 1586@ref{Scanning Strings}. @xref{Multiple Input Buffers}. 1587 1588@cindex ECHO, and yyout 1589@cindex yyout 1590@cindex stdout, as default for yyout 1591The scanner writes its @code{ECHO} output to the @file{yyout} global 1592(default, @file{stdout}), which may be redefined by the user simply by 1593assigning it to some other @code{FILE} pointer. 1594 1595@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top 1596@chapter Start Conditions 1597 1598@cindex start conditions 1599@code{flex} provides a mechanism for conditionally activating rules. 1600Any rule whose pattern is prefixed with @samp{<sc>} will only be active 1601when the scanner is in the @dfn{start condition} named @code{sc}. For 1602example, 1603 1604@c proofread edit stopped here 1605@example 1606@verbatim 1607 <STRING>[^"]* { /* eat up the string body ... */ 1608 ... 1609 } 1610@end verbatim 1611@end example 1612 1613will be active only when the scanner is in the @code{STRING} start 1614condition, and 1615 1616@cindex start conditions, multiple 1617@example 1618@verbatim 1619 <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ 1620 ... 1621 } 1622@end verbatim 1623@end example 1624 1625will be active only when the current start condition is either 1626@code{INITIAL}, @code{STRING}, or @code{QUOTE}. 1627 1628@cindex start conditions, inclusive v.s.@: exclusive 1629Start conditions are declared in the definitions (first) section of the 1630input using unindented lines beginning with either @samp{%s} or 1631@samp{%x} followed by a list of names. The former declares 1632@dfn{inclusive} start conditions, the latter @dfn{exclusive} start 1633conditions. A start condition is activated using the @code{BEGIN} 1634action. Until the next @code{BEGIN} action is executed, rules with the 1635given start condition will be active and rules with other start 1636conditions will be inactive. If the start condition is inclusive, then 1637rules with no start conditions at all will also be active. If it is 1638exclusive, then @emph{only} rules qualified with the start condition 1639will be active. A set of rules contingent on the same exclusive start 1640condition describe a scanner which is independent of any of the other 1641rules in the @code{flex} input. Because of this, exclusive start 1642conditions make it easy to specify ``mini-scanners'' which scan portions 1643of the input that are syntactically different from the rest (e.g., 1644comments). 1645 1646If the distinction between inclusive and exclusive start conditions 1647is still a little vague, here's a simple example illustrating the 1648connection between the two. The set of rules: 1649 1650@cindex start conditions, inclusive 1651@example 1652@verbatim 1653 %s example 1654 %% 1655 1656 <example>foo do_something(); 1657 1658 bar something_else(); 1659@end verbatim 1660@end example 1661 1662is equivalent to 1663 1664@cindex start conditions, exclusive 1665@example 1666@verbatim 1667 %x example 1668 %% 1669 1670 <example>foo do_something(); 1671 1672 <INITIAL,example>bar something_else(); 1673@end verbatim 1674@end example 1675 1676Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in 1677the second example wouldn't be active (i.e., couldn't match) when in 1678start condition @code{example}. If we just used @code{<example>} to 1679qualify @code{bar}, though, then it would only be active in 1680@code{example} and not in @code{INITIAL}, while in the first example 1681it's active in both, because in the first example the @code{example} 1682start condition is an inclusive @code{(%s)} start condition. 1683 1684@cindex start conditions, special wildcard condition 1685Also note that the special start-condition specifier 1686@code{<*>} 1687matches every start condition. Thus, the above example could also 1688have been written: 1689 1690@cindex start conditions, use of wildcard condition (<*>) 1691@example 1692@verbatim 1693 %x example 1694 %% 1695 1696 <example>foo do_something(); 1697 1698 <*>bar something_else(); 1699@end verbatim 1700@end example 1701 1702The default rule (to @code{ECHO} any unmatched character) remains active 1703in start conditions. It is equivalent to: 1704 1705@cindex start conditions, behavior of default rule 1706@example 1707@verbatim 1708 <*>.|\n ECHO; 1709@end verbatim 1710@end example 1711 1712@cindex BEGIN, explanation 1713@findex BEGIN 1714@vindex INITIAL 1715@code{BEGIN(0)} returns to the original state where only the rules with 1716no start conditions are active. This state can also be referred to as 1717the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is 1718equivalent to @code{BEGIN(0)}. (The parentheses around the start 1719condition name are not required but are considered good style.) 1720 1721@code{BEGIN} actions can also be given as indented code at the beginning 1722of the rules section. For example, the following will cause the scanner 1723to enter the @code{SPECIAL} start condition whenever @code{yylex()} is 1724called and the global variable @code{enter_special} is true: 1725 1726@cindex start conditions, using BEGIN 1727@example 1728@verbatim 1729 int enter_special; 1730 1731 %x SPECIAL 1732 %% 1733 if ( enter_special ) 1734 BEGIN(SPECIAL); 1735 1736 <SPECIAL>blahblahblah 1737 ...more rules follow... 1738@end verbatim 1739@end example 1740 1741To illustrate the uses of start conditions, here is a scanner which 1742provides two different interpretations of a string like @samp{123.456}. 1743By default it will treat it as three tokens, the integer @samp{123}, a 1744dot (@samp{.}), and the integer @samp{456}. But if the string is 1745preceded earlier in the line by the string @samp{expect-floats} it will 1746treat it as a single token, the floating-point number @samp{123.456}: 1747 1748@cindex start conditions, for different interpretations of same input 1749@example 1750@verbatim 1751 %{ 1752 #include <math.h> 1753 %} 1754 %s expect 1755 1756 %% 1757 expect-floats BEGIN(expect); 1758 1759 <expect>[0-9]+.[0-9]+ { 1760 printf( "found a float, = %f\n", 1761 atof( yytext ) ); 1762 } 1763 <expect>\n { 1764 /* that's the end of the line, so 1765 * we need another "expect-number" 1766 * before we'll recognize any more 1767 * numbers 1768 */ 1769 BEGIN(INITIAL); 1770 } 1771 1772 [0-9]+ { 1773 printf( "found an integer, = %d\n", 1774 atoi( yytext ) ); 1775 } 1776 1777 "." printf( "found a dot\n" ); 1778@end verbatim 1779@end example 1780 1781@cindex comments, example of scanning C comments 1782Here is a scanner which recognizes (and discards) C comments while 1783maintaining a count of the current input line. 1784 1785@cindex recognizing C comments 1786@example 1787@verbatim 1788 %x comment 1789 %% 1790 int line_num = 1; 1791 1792 "/*" BEGIN(comment); 1793 1794 <comment>[^*\n]* /* eat anything that's not a '*' */ 1795 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1796 <comment>\n ++line_num; 1797 <comment>"*"+"/" BEGIN(INITIAL); 1798@end verbatim 1799@end example 1800 1801This scanner goes to a bit of trouble to match as much 1802text as possible with each rule. In general, when attempting to write 1803a high-speed scanner try to match as much possible in each rule, as 1804it's a big win. 1805 1806Note that start-conditions names are really integer values and 1807can be stored as such. Thus, the above could be extended in the 1808following fashion: 1809 1810@cindex start conditions, integer values 1811@cindex using integer values of start condition names 1812@example 1813@verbatim 1814 %x comment foo 1815 %% 1816 int line_num = 1; 1817 int comment_caller; 1818 1819 "/*" { 1820 comment_caller = INITIAL; 1821 BEGIN(comment); 1822 } 1823 1824 ... 1825 1826 <foo>"/*" { 1827 comment_caller = foo; 1828 BEGIN(comment); 1829 } 1830 1831 <comment>[^*\n]* /* eat anything that's not a '*' */ 1832 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1833 <comment>\n ++line_num; 1834 <comment>"*"+"/" BEGIN(comment_caller); 1835@end verbatim 1836@end example 1837 1838@cindex YY_START, example 1839Furthermore, you can access the current start condition using the 1840integer-valued @code{YY_START} macro. For example, the above 1841assignments to @code{comment_caller} could instead be written 1842 1843@cindex getting current start state with YY_START 1844@example 1845@verbatim 1846 comment_caller = YY_START; 1847@end verbatim 1848@end example 1849 1850@vindex YY_START 1851Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that 1852is what's used by AT&T @code{lex}). 1853 1854For historical reasons, start conditions do not have their own 1855name-space within the generated scanner. The start condition names are 1856unmodified in the generated scanner and generated header. 1857@xref{option-header}. @xref{option-prefix}. 1858 1859 1860 1861Finally, here's an example of how to match C-style quoted strings using 1862exclusive start conditions, including expanded escape sequences (but 1863not including checking for a string that's too long): 1864 1865@cindex matching C-style double-quoted strings 1866@example 1867@verbatim 1868 %x str 1869 1870 %% 1871 char string_buf[MAX_STR_CONST]; 1872 char *string_buf_ptr; 1873 1874 1875 \" string_buf_ptr = string_buf; BEGIN(str); 1876 1877 <str>\" { /* saw closing quote - all done */ 1878 BEGIN(INITIAL); 1879 *string_buf_ptr = '\0'; 1880 /* return string constant token type and 1881 * value to parser 1882 */ 1883 } 1884 1885 <str>\n { 1886 /* error - unterminated string constant */ 1887 /* generate error message */ 1888 } 1889 1890 <str>\\[0-7]{1,3} { 1891 /* octal escape sequence */ 1892 int result; 1893 1894 (void) sscanf( yytext + 1, "%o", &result ); 1895 1896 if ( result > 0xff ) 1897 /* error, constant is out-of-bounds */ 1898 1899 *string_buf_ptr++ = result; 1900 } 1901 1902 <str>\\[0-9]+ { 1903 /* generate error - bad escape sequence; something 1904 * like '\48' or '\0777777' 1905 */ 1906 } 1907 1908 <str>\\n *string_buf_ptr++ = '\n'; 1909 <str>\\t *string_buf_ptr++ = '\t'; 1910 <str>\\r *string_buf_ptr++ = '\r'; 1911 <str>\\b *string_buf_ptr++ = '\b'; 1912 <str>\\f *string_buf_ptr++ = '\f'; 1913 1914 <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1915 1916 <str>[^\\\n\"]+ { 1917 char *yptr = yytext; 1918 1919 while ( *yptr ) 1920 *string_buf_ptr++ = *yptr++; 1921 } 1922@end verbatim 1923@end example 1924 1925@cindex start condition, applying to multiple patterns 1926Often, such as in some of the examples above, you wind up writing a 1927whole bunch of rules all preceded by the same start condition(s). Flex 1928makes this a little easier and cleaner by introducing a notion of start 1929condition @dfn{scope}. A start condition scope is begun with: 1930 1931@example 1932@verbatim 1933 <SCs>{ 1934@end verbatim 1935@end example 1936 1937where @code{SCs} is a list of one or more start conditions. Inside the 1938start condition scope, every rule automatically has the prefix 1939@code{SCs>} applied to it, until a @samp{@}} which matches the initial 1940@samp{@{}. So, for example, 1941 1942@cindex extended scope of start conditions 1943@example 1944@verbatim 1945 <ESC>{ 1946 "\\n" return '\n'; 1947 "\\r" return '\r'; 1948 "\\f" return '\f'; 1949 "\\0" return '\0'; 1950 } 1951@end verbatim 1952@end example 1953 1954is equivalent to: 1955 1956@example 1957@verbatim 1958 <ESC>"\\n" return '\n'; 1959 <ESC>"\\r" return '\r'; 1960 <ESC>"\\f" return '\f'; 1961 <ESC>"\\0" return '\0'; 1962@end verbatim 1963@end example 1964 1965Start condition scopes may be nested. 1966 1967@cindex stacks, routines for manipulating 1968@cindex start conditions, use of a stack 1969 1970The following routines are available for manipulating stacks of start conditions: 1971 1972@deftypefun void yy_push_state ( int @code{new_state} ) 1973pushes the current start condition onto the top of the start condition 1974stack and switches to 1975@code{new_state} 1976as though you had used 1977@code{BEGIN new_state} 1978(recall that start condition names are also integers). 1979@end deftypefun 1980 1981@deftypefun void yy_pop_state () 1982pops the top of the stack and switches to it via 1983@code{BEGIN}. 1984@end deftypefun 1985 1986@deftypefun int yy_top_state () 1987returns the top of the stack without altering the stack's contents. 1988@end deftypefun 1989 1990@cindex memory, for start condition stacks 1991The start condition stack grows dynamically and so has no built-in size 1992limitation. If memory is exhausted, program execution aborts. 1993 1994To use start condition stacks, your scanner must include a @code{%option 1995stack} directive (@pxref{Scanner Options}). 1996 1997@node Multiple Input Buffers, EOF, Start Conditions, Top 1998@chapter Multiple Input Buffers 1999 2000@cindex multiple input streams 2001Some scanners (such as those which support ``include'' files) require 2002reading from several input streams. As @code{flex} scanners do a large 2003amount of buffering, one cannot control where the next input will be 2004read from by simply writing a @code{YY_INPUT()} which is sensitive to 2005the scanning context. @code{YY_INPUT()} is only called when the scanner 2006reaches the end of its buffer, which may be a long time after scanning a 2007statement such as an @code{include} statement which requires switching 2008the input source. 2009 2010To negotiate these sorts of problems, @code{flex} provides a mechanism 2011for creating and switching between multiple input buffers. An input 2012buffer is created by using: 2013 2014@cindex memory, allocating input buffers 2015@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) 2016@end deftypefun 2017 2018which takes a @code{FILE} pointer and a size and creates a buffer 2019associated with the given file and large enough to hold @code{size} 2020characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It 2021returns a @code{YY_BUFFER_STATE} handle, which may then be passed to 2022other routines (see below). 2023@tindex YY_BUFFER_STATE 2024The @code{YY_BUFFER_STATE} type is a 2025pointer to an opaque @code{struct yy_buffer_state} structure, so you may 2026safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) 20270)} if you wish, and also refer to the opaque structure in order to 2028correctly declare input buffers in source files other than that of your 2029scanner. Note that the @code{FILE} pointer in the call to 2030@code{yy_create_buffer} is only used as the value of @file{yyin} seen by 2031@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses 2032@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to 2033@code{yy_create_buffer}. You select a particular buffer to scan from 2034using: 2035 2036@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) 2037@end deftypefun 2038 2039The above function switches the scanner's input buffer so subsequent tokens 2040will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may 2041be used by @code{yywrap()} to set things up for continued scanning, instead of 2042opening a new file and pointing @file{yyin} at it. If you are looking for a 2043stack of input buffers, then you want to use @code{yypush_buffer_state()} 2044instead of this function. Note also that switching input sources via either 2045@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the 2046start condition. 2047 2048@cindex memory, deleting input buffers 2049@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) 2050@end deftypefun 2051 2052is used to reclaim the storage associated with a buffer. (@code{buffer} 2053can be NULL, in which case the routine does nothing.) You can also clear 2054the current contents of a buffer using: 2055 2056@cindex pushing an input buffer 2057@cindex stack, input buffer push 2058@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) 2059@end deftypefun 2060 2061This function pushes the new buffer state onto an internal stack. The pushed 2062state becomes the new current state. The stack is maintained by flex and will 2063grow as required. This function is intended to be used instead of 2064@code{yy_switch_to_buffer}, when you want to change states, but preserve the 2065current state for later use. 2066 2067@cindex popping an input buffer 2068@cindex stack, input buffer pop 2069@deftypefun void yypop_buffer_state ( ) 2070@end deftypefun 2071 2072This function removes the current state from the top of the stack, and deletes 2073it by calling @code{yy_delete_buffer}. The next state on the stack, if any, 2074becomes the new current state. 2075 2076@cindex clearing an input buffer 2077@cindex flushing an input buffer 2078@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) 2079@end deftypefun 2080 2081This function discards the buffer's contents, 2082so the next time the scanner attempts to match a token from the 2083buffer, it will first fill the buffer anew using 2084@code{YY_INPUT()}. 2085 2086@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) 2087@end deftypefun 2088 2089is an alias for @code{yy_create_buffer()}, 2090provided for compatibility with the C++ use of @code{new} and 2091@code{delete} for creating and destroying dynamic objects. 2092 2093@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro 2094@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the 2095current buffer. It should not be used as an lvalue. 2096 2097@cindex EOF, example using multiple input buffers 2098Here are two examples of using these features for writing a scanner 2099which expands include files (the 2100@code{<<EOF>>} 2101feature is discussed below). 2102 2103This first example uses yypush_buffer_state and yypop_buffer_state. Flex 2104maintains the stack internally. 2105 2106@cindex handling include files with multiple input buffers 2107@example 2108@verbatim 2109 /* the "incl" state is used for picking up the name 2110 * of an include file 2111 */ 2112 %x incl 2113 %% 2114 include BEGIN(incl); 2115 2116 [a-z]+ ECHO; 2117 [^a-z\n]*\n? ECHO; 2118 2119 <incl>[ \t]* /* eat the whitespace */ 2120 <incl>[^ \t\n]+ { /* got the include file name */ 2121 yyin = fopen( yytext, "r" ); 2122 2123 if ( ! yyin ) 2124 error( ... ); 2125 2126 yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); 2127 2128 BEGIN(INITIAL); 2129 } 2130 2131 <<EOF>> { 2132 yypop_buffer_state(); 2133 2134 if ( !YY_CURRENT_BUFFER ) 2135 { 2136 yyterminate(); 2137 } 2138 } 2139@end verbatim 2140@end example 2141 2142The second example, below, does the same thing as the previous example did, but 2143manages its own input buffer stack manually (instead of letting flex do it). 2144 2145@cindex handling include files with multiple input buffers 2146@example 2147@verbatim 2148 /* the "incl" state is used for picking up the name 2149 * of an include file 2150 */ 2151 %x incl 2152 2153 %{ 2154 #define MAX_INCLUDE_DEPTH 10 2155 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 2156 int include_stack_ptr = 0; 2157 %} 2158 2159 %% 2160 include BEGIN(incl); 2161 2162 [a-z]+ ECHO; 2163 [^a-z\n]*\n? ECHO; 2164 2165 <incl>[ \t]* /* eat the whitespace */ 2166 <incl>[^ \t\n]+ { /* got the include file name */ 2167 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 2168 { 2169 fprintf( stderr, "Includes nested too deeply" ); 2170 exit( 1 ); 2171 } 2172 2173 include_stack[include_stack_ptr++] = 2174 YY_CURRENT_BUFFER; 2175 2176 yyin = fopen( yytext, "r" ); 2177 2178 if ( ! yyin ) 2179 error( ... ); 2180 2181 yy_switch_to_buffer( 2182 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 2183 2184 BEGIN(INITIAL); 2185 } 2186 2187 <<EOF>> { 2188 if ( --include_stack_ptr 0 ) 2189 { 2190 yyterminate(); 2191 } 2192 2193 else 2194 { 2195 yy_delete_buffer( YY_CURRENT_BUFFER ); 2196 yy_switch_to_buffer( 2197 include_stack[include_stack_ptr] ); 2198 } 2199 } 2200@end verbatim 2201@end example 2202 2203@anchor{Scanning Strings} 2204@cindex strings, scanning strings instead of files 2205The following routines are available for setting up input buffers for 2206scanning in-memory strings instead of files. All of them create a new 2207input buffer for scanning the string, and return a corresponding 2208@code{YY_BUFFER_STATE} handle (which you should delete with 2209@code{yy_delete_buffer()} when done with it). They also switch to the 2210new buffer using @code{yy_switch_to_buffer()}, so the next call to 2211@code{yylex()} will start scanning the string. 2212 2213@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) 2214scans a NUL-terminated string. 2215@end deftypefun 2216 2217@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) 2218scans @code{len} bytes (including possibly @code{NUL}s) starting at location 2219@code{bytes}. 2220@end deftypefun 2221 2222Note that both of these functions create and scan a @emph{copy} of the 2223string or bytes. (This may be desirable, since @code{yylex()} modifies 2224the contents of the buffer it is scanning.) You can avoid the copy by 2225using: 2226 2227@vindex YY_END_OF_BUFFER_CHAR 2228@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) 2229which scans in place the buffer starting at @code{base}, consisting of 2230@code{size} bytes, the last two bytes of which @emph{must} be 2231@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not 2232scanned; thus, scanning consists of @code{base[0]} through 2233@code{base[size-2]}, inclusive. 2234@end deftypefun 2235 2236If you fail to set up @code{base} in this manner (i.e., forget the final 2237two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()} 2238returns a NULL pointer instead of creating a new input buffer. 2239 2240@deftp {Data type} yy_size_t 2241is an integral type to which you can cast an integer expression 2242reflecting the size of the buffer. 2243@end deftp 2244 2245@node EOF, Misc Macros, Multiple Input Buffers, Top 2246@chapter End-of-File Rules 2247 2248@cindex EOF, explanation 2249The special rule @code{<<EOF>>} indicates 2250actions which are to be taken when an end-of-file is 2251encountered and @code{yywrap()} returns non-zero (i.e., indicates 2252no further files to process). The action must finish 2253by doing one of the following things: 2254 2255@itemize 2256@item 2257@findex YY_NEW_FILE (now obsolete) 2258assigning @file{yyin} to a new input file (in previous versions of 2259@code{flex}, after doing the assignment you had to call the special 2260action @code{YY_NEW_FILE}. This is no longer necessary.) 2261 2262@item 2263executing a @code{return} statement; 2264 2265@item 2266executing the special @code{yyterminate()} action. 2267 2268@item 2269or, switching to a new buffer using @code{yy_switch_to_buffer()} as 2270shown in the example above. 2271@end itemize 2272 2273<<EOF>> rules may not be used with other patterns; they may only be 2274qualified with a list of start conditions. If an unqualified <<EOF>> 2275rule is given, it applies to @emph{all} start conditions which do not 2276already have <<EOF>> actions. To specify an <<EOF>> rule for only the 2277initial start condition, use: 2278 2279@example 2280@verbatim 2281 <INITIAL><<EOF>> 2282@end verbatim 2283@end example 2284 2285These rules are useful for catching things like unclosed comments. An 2286example: 2287 2288@cindex <<EOF>>, use of 2289@example 2290@verbatim 2291 %x quote 2292 %% 2293 2294 ...other rules for dealing with quotes... 2295 2296 <quote><<EOF>> { 2297 error( "unterminated quote" ); 2298 yyterminate(); 2299 } 2300 <<EOF>> { 2301 if ( *++filelist ) 2302 yyin = fopen( *filelist, "r" ); 2303 else 2304 yyterminate(); 2305 } 2306@end verbatim 2307@end example 2308 2309@node Misc Macros, User Values, EOF, Top 2310@chapter Miscellaneous Macros 2311 2312@hkindex YY_USER_ACTION 2313The macro @code{YY_USER_ACTION} can be defined to provide an action 2314which is always executed prior to the matched rule's action. For 2315example, it could be #define'd to call a routine to convert yytext to 2316lower-case. When @code{YY_USER_ACTION} is invoked, the variable 2317@code{yy_act} gives the number of the matched rule (rules are numbered 2318starting with 1). Suppose you want to profile how often each of your 2319rules is matched. The following would do the trick: 2320 2321@cindex YY_USER_ACTION to track each time a rule is matched 2322@example 2323@verbatim 2324 #define YY_USER_ACTION ++ctr[yy_act] 2325@end verbatim 2326@end example 2327 2328@vindex YY_NUM_RULES 2329where @code{ctr} is an array to hold the counts for the different rules. 2330Note that the macro @code{YY_NUM_RULES} gives the total number of rules 2331(including the default rule), even if you use @samp{-s)}, so a correct 2332declaration for @code{ctr} is: 2333 2334@example 2335@verbatim 2336 int ctr[YY_NUM_RULES]; 2337@end verbatim 2338@end example 2339 2340@hkindex YY_USER_INIT 2341The macro @code{YY_USER_INIT} may be defined to provide an action which 2342is always executed before the first scan (and before the scanner's 2343internal initializations are done). For example, it could be used to 2344call a routine to read in a data table or open a logging file. 2345 2346@findex yy_set_interactive 2347The macro @code{yy_set_interactive(is_interactive)} can be used to 2348control whether the current buffer is considered @dfn{interactive}. An 2349interactive buffer is processed more slowly, but must be used when the 2350scanner's input source is indeed interactive to avoid problems due to 2351waiting to fill buffers (see the discussion of the @samp{-I} flag in 2352@ref{Scanner Options}). A non-zero value in the macro invocation marks 2353the buffer as interactive, a zero value as non-interactive. Note that 2354use of this macro overrides @code{%option always-interactive} or 2355@code{%option never-interactive} (@pxref{Scanner Options}). 2356@code{yy_set_interactive()} must be invoked prior to beginning to scan 2357the buffer that is (or is not) to be considered interactive. 2358 2359@cindex BOL, setting it 2360@findex yy_set_bol 2361The macro @code{yy_set_bol(at_bol)} can be used to control whether the 2362current buffer's scanning context for the next token match is done as 2363though at the beginning of a line. A non-zero macro argument makes 2364rules anchored with @samp{^} active, while a zero argument makes 2365@samp{^} rules inactive. 2366 2367@cindex BOL, checking the BOL flag 2368@findex YY_AT_BOL 2369The macro @code{YY_AT_BOL()} returns true if the next token scanned from 2370the current buffer will have @samp{^} rules active, false otherwise. 2371 2372@cindex actions, redefining YY_BREAK 2373@hkindex YY_BREAK 2374In the generated scanner, the actions are all gathered in one large 2375switch statement and separated using @code{YY_BREAK}, which may be 2376redefined. By default, it is simply a @code{break}, to separate each 2377rule's action from the following rule's. Redefining @code{YY_BREAK} 2378allows, for example, C++ users to #define YY_BREAK to do nothing (while 2379being very careful that every rule ends with a @code{break} or a 2380@code{return}!) to avoid suffering from unreachable statement warnings 2381where because a rule's action ends with @code{return}, the 2382@code{YY_BREAK} is inaccessible. 2383 2384@node User Values, Yacc, Misc Macros, Top 2385@chapter Values Available To the User 2386 2387This chapter summarizes the various values available to the user in the 2388rule actions. 2389 2390@table @code 2391@vindex yytext 2392@item char *yytext 2393holds the text of the current token. It may be modified but not 2394lengthened (you cannot append characters to the end). 2395 2396@cindex yytext, default array size 2397@cindex array, default size for yytext 2398@vindex YYLMAX 2399If the special directive @code{%array} appears in the first section of 2400the scanner description, then @code{yytext} is instead declared 2401@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition 2402that you can redefine in the first section if you don't like the default 2403value (generally 8KB). Using @code{%array} results in somewhat slower 2404scanners, but the value of @code{yytext} becomes immune to calls to 2405@code{unput()}, which potentially destroy its value when @code{yytext} is 2406a character pointer. The opposite of @code{%array} is @code{%pointer}, 2407which is the default. 2408 2409@cindex C++ and %array 2410You cannot use @code{%array} when generating C++ scanner classes (the 2411@samp{-+} flag). 2412 2413@vindex yyleng 2414@item int yyleng 2415holds the length of the current token. 2416 2417@vindex yyin 2418@item FILE *yyin 2419is the file which by default @code{flex} reads from. It may be 2420redefined but doing so only makes sense before scanning begins or after 2421an EOF has been encountered. Changing it in the midst of scanning will 2422have unexpected results since @code{flex} buffers its input; use 2423@code{yyrestart()} instead. Once scanning terminates because an 2424end-of-file has been seen, you can assign @file{yyin} at the new input 2425file and then call the scanner again to continue scanning. 2426 2427@findex yyrestart 2428@item void yyrestart( FILE *new_file ) 2429may be called to point @file{yyin} at the new input file. The 2430switch-over to the new file is immediate (any previously buffered-up 2431input is lost). Note that calling @code{yyrestart()} with @file{yyin} 2432as an argument thus throws away the current input buffer and continues 2433scanning the same input file. 2434 2435@vindex yyout 2436@item FILE *yyout 2437is the file to which @code{ECHO} actions are done. It can be reassigned 2438by the user. 2439 2440@vindex YY_CURRENT_BUFFER 2441@item YY_CURRENT_BUFFER 2442returns a @code{YY_BUFFER_STATE} handle to the current buffer. 2443 2444@vindex YY_START 2445@item YY_START 2446returns an integer value corresponding to the current start condition. 2447You can subsequently use this value with @code{BEGIN} to return to that 2448start condition. 2449@end table 2450 2451@node Yacc, Scanner Options, User Values, Top 2452@chapter Interfacing with Yacc 2453 2454@cindex yacc, interface 2455 2456@vindex yylval, with yacc 2457One of the main uses of @code{flex} is as a companion to the @code{yacc} 2458parser-generator. @code{yacc} parsers expect to call a routine named 2459@code{yylex()} to find the next input token. The routine is supposed to 2460return the type of the next token as well as putting any associated 2461value in the global @code{yylval}. To use @code{flex} with @code{yacc}, 2462one specifies the @samp{-d} option to @code{yacc} to instruct it to 2463generate the file @file{y.tab.h} containing definitions of all the 2464@code{%tokens} appearing in the @code{yacc} input. This file is then 2465included in the @code{flex} scanner. For example, if one of the tokens 2466is @code{TOK_NUMBER}, part of the scanner might look like: 2467 2468@cindex yacc interface 2469@example 2470@verbatim 2471 %{ 2472 #include "y.tab.h" 2473 %} 2474 2475 %% 2476 2477 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2478@end verbatim 2479@end example 2480 2481@node Scanner Options, Performance, Yacc, Top 2482@chapter Scanner Options 2483 2484@cindex command-line options 2485@cindex options, command-line 2486@cindex arguments, command-line 2487 2488The various @code{flex} options are categorized by function in the following 2489menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. 2490 2491@menu 2492* Options for Specifying Filenames:: 2493* Options Affecting Scanner Behavior:: 2494* Code-Level And API Options:: 2495* Options for Scanner Speed and Size:: 2496* Debugging Options:: 2497* Miscellaneous Options:: 2498@end menu 2499 2500Even though there are many scanner options, a typical scanner might only 2501specify the following options: 2502 2503@example 2504@verbatim 2505%option 8bit reentrant bison-bridge 2506%option warn nodefault 2507%option yylineno 2508%option outfile="scanner.c" header-file="scanner.h" 2509@end verbatim 2510@end example 2511 2512The first line specifies the general type of scanner we want. The second line 2513specifies that we are being careful. The third line asks flex to track line 2514numbers. The last line tells flex what to name the files. (The options can be 2515specified in any order. We just divided them.) 2516 2517@code{flex} also provides a mechanism for controlling options within the 2518scanner specification itself, rather than from the flex command-line. 2519This is done by including @code{%option} directives in the first section 2520of the scanner specification. You can specify multiple options with a 2521single @code{%option} directive, and multiple directives in the first 2522section of your flex input file. 2523 2524Most options are given simply as names, optionally preceded by the 2525word @samp{no} (with no intervening whitespace) to negate their meaning. 2526The names are the same as their long-option equivalents (but without the 2527leading @samp{--} ). 2528 2529@code{flex} scans your rule actions to determine whether you use the 2530@code{REJECT} or @code{yymore()} features. The @code{REJECT} and 2531@code{yymore} options are available to override its decision as to 2532whether you use the options, either by setting them (e.g., @code{%option 2533reject)} to indicate the feature is indeed used, or unsetting them to 2534indicate it actually is not used (e.g., @code{%option noyymore)}. 2535 2536 2537A number of options are available for lint purists who want to suppress 2538the appearance of unneeded routines in the generated scanner. Each of 2539the following, if unset (e.g., @code{%option nounput}), results in the 2540corresponding routine not appearing in the generated scanner: 2541 2542@example 2543@verbatim 2544 input, unput 2545 yy_push_state, yy_pop_state, yy_top_state 2546 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2547 2548 yyget_extra, yyset_extra, yyget_leng, yyget_text, 2549 yyget_lineno, yyset_lineno, yyget_in, yyset_in, 2550 yyget_out, yyset_out, yyget_lval, yyset_lval, 2551 yyget_lloc, yyset_lloc, yyget_debug, yyset_debug 2552@end verbatim 2553@end example 2554 2555(though @code{yy_push_state()} and friends won't appear anyway unless 2556you use @code{%option stack)}. 2557 2558@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options 2559@section Options for Specifying Filenames 2560 2561@table @samp 2562 2563@anchor{option-header} 2564@opindex ---header-file 2565@opindex header-file 2566@item --header-file=FILE, @code{%option header-file="FILE"} 2567instructs flex to write a C header to @file{FILE}. This file contains 2568function prototypes, extern variables, and types used by the scanner. 2569Only the external API is exported by the header file. Many macros that 2570are usable from within scanner actions are not exported to the header 2571file. This is due to namespace problems and the goal of a clean 2572external API. 2573 2574While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} 2575is substituted with the appropriate prefix. 2576 2577The @samp{--header-file} option is not compatible with the @samp{--c++} option, 2578since the C++ scanner provides its own header in @file{yyFlexLexer.h}. 2579 2580 2581 2582@anchor{option-outfile} 2583@opindex -o 2584@opindex ---outfile 2585@opindex outfile 2586@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"} 2587directs flex to write the scanner to the file @file{FILE} instead of 2588@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option, 2589then the scanner is written to @file{stdout} but its @code{#line} 2590directives (see the @samp{-l} option above) refer to the file 2591@file{FILE}. 2592 2593 2594 2595@anchor{option-stdout} 2596@opindex -t 2597@opindex ---stdout 2598@opindex stdout 2599@item -t, --stdout, @code{%option stdout} 2600instructs @code{flex} to write the scanner it generates to standard 2601output instead of @file{lex.yy.c}. 2602 2603 2604 2605@opindex ---skel 2606@item -SFILE, --skel=FILE 2607overrides the default skeleton file from which 2608@code{flex} 2609constructs its scanners. You'll never need this option unless you are doing 2610@code{flex} 2611maintenance or development. 2612 2613@opindex ---tables-file 2614@opindex tables-file 2615@item --tables-file=FILE 2616Write serialized scanner dfa tables to FILE. The generated scanner will not 2617contain the tables, and requires them to be loaded at runtime. 2618@xref{serialization}. 2619 2620@opindex ---tables-verify 2621@opindex tables-verify 2622@item --tables-verify 2623This option is for flex development. We document it here in case you stumble 2624upon it by accident or in case you suspect some inconsistency in the serialized 2625tables. Flex will serialize the scanner dfa tables but will also generate the 2626in-code tables as it normally does. At runtime, the scanner will verify that 2627the serialized tables match the in-code tables, instead of loading them. 2628 2629@end table 2630 2631@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options 2632@section Options Affecting Scanner Behavior 2633 2634@table @samp 2635@anchor{option-case-insensitive} 2636@opindex -i 2637@opindex ---case-insensitive 2638@opindex case-insensitive 2639@item -i, --case-insensitive, @code{%option case-insensitive} 2640instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The 2641case of letters given in the @code{flex} input patterns will be ignored, 2642and tokens in the input will be matched regardless of case. The matched 2643text given in @code{yytext} will have the preserved case (i.e., it will 2644not be folded). For tricky behavior, see @ref{case and character ranges}. 2645 2646 2647 2648@anchor{option-lex-compat} 2649@opindex -l 2650@opindex ---lex-compat 2651@opindex lex-compat 2652@item -l, --lex-compat, @code{%option lex-compat} 2653turns on maximum compatibility with the original AT&T @code{lex} 2654implementation. Note that this does not mean @emph{full} compatibility. 2655Use of this option costs a considerable amount of performance, and it 2656cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or 2657@samp{-CF} options. For details on the compatibilities it provides, see 2658@ref{Lex and Posix}. This option also results in the name 2659@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. 2660 2661 2662 2663@anchor{option-batch} 2664@opindex -B 2665@opindex ---batch 2666@opindex batch 2667@item -B, --batch, @code{%option batch} 2668instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of 2669@emph{interactive} scanners generated by @samp{--interactive} (see below). In 2670general, you use @samp{-B} when you are @emph{certain} that your scanner 2671will never be used interactively, and you want to squeeze a 2672@emph{little} more performance out of it. If your goal is instead to 2673squeeze out a @emph{lot} more performance, you should be using the 2674@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically 2675anyway. 2676 2677 2678 2679@anchor{option-interactive} 2680@opindex -I 2681@opindex ---interactive 2682@opindex interactive 2683@item -I, --interactive, @code{%option interactive} 2684instructs @code{flex} to generate an @i{interactive} scanner. An 2685interactive scanner is one that only looks ahead to decide what token 2686has been matched if it absolutely must. It turns out that always 2687looking one extra character ahead, even if the scanner has already seen 2688enough text to disambiguate the current token, is a bit faster than only 2689looking ahead when necessary. But scanners that always look ahead give 2690dreadful interactive performance; for example, when a user types a 2691newline, it is not recognized as a newline token until they enter 2692@emph{another} token, which often means typing in another whole line. 2693 2694@code{flex} scanners default to @code{interactive} unless you use the 2695@samp{-Cf} or @samp{-CF} table-compression options 2696(@pxref{Performance}). That's because if you're looking for 2697high-performance you should be using one of these options, so if you 2698didn't, @code{flex} assumes you'd rather trade off a bit of run-time 2699performance for intuitive interactive behavior. Note also that you 2700@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or 2701@samp{-CF}. Thus, this option is not really needed; it is on by default 2702for all those cases in which it is allowed. 2703 2704You can force a scanner to 2705@emph{not} 2706be interactive by using 2707@samp{--batch} 2708 2709 2710 2711@anchor{option-7bit} 2712@opindex -7 2713@opindex ---7bit 2714@opindex 7bit 2715@item -7, --7bit, @code{%option 7bit} 2716instructs @code{flex} to generate a 7-bit scanner, i.e., one which can 2717only recognize 7-bit characters in its input. The advantage of using 2718@samp{--7bit} is that the scanner's tables can be up to half the size of 2719those generated using the @samp{--8bit}. The disadvantage is that such 2720scanners often hang or crash if their input contains an 8-bit character. 2721 2722Note, however, that unless you generate your scanner using the 2723@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} 2724will save only a small amount of table space, and make your scanner 2725considerably less portable. @code{Flex}'s default behavior is to 2726generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, 2727in which case @code{flex} defaults to generating 7-bit scanners unless 2728your site was always configured to generate 8-bit scanners (as will 2729often be the case with non-USA sites). You can tell whether flex 2730generated a 7-bit or an 8-bit scanner by inspecting the flag summary in 2731the @samp{--verbose} output as described above. 2732 2733Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still 2734defaults to generating an 8-bit scanner, since usually with these 2735compression options full 8-bit tables are not much more expensive than 27367-bit tables. 2737 2738 2739 2740@anchor{option-8bit} 2741@opindex -8 2742@opindex ---8bit 2743@opindex 8bit 2744@item -8, --8bit, @code{%option 8bit} 2745instructs @code{flex} to generate an 8-bit scanner, i.e., one which can 2746recognize 8-bit characters. This flag is only needed for scanners 2747generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to 2748generating an 8-bit scanner anyway. 2749 2750See the discussion of 2751@samp{--7bit} 2752above for @code{flex}'s default behavior and the tradeoffs between 7-bit 2753and 8-bit scanners. 2754 2755 2756 2757@anchor{option-default} 2758@opindex ---default 2759@opindex default 2760@item --default, @code{%option default} 2761generate the default rule. 2762 2763 2764 2765@anchor{option-always-interactive} 2766@opindex ---always-interactive 2767@opindex always-interactive 2768@item --always-interactive, @code{%option always-interactive} 2769instructs flex to generate a scanner which always considers its input 2770@emph{interactive}. Normally, on each new input file the scanner calls 2771@code{isatty()} in an attempt to determine whether the scanner's input 2772source is interactive and thus should be read a character at a time. 2773When this option is used, however, then no such call is made. 2774 2775 2776 2777@opindex ---never-interactive 2778@item --never-interactive, @code{--never-interactive} 2779instructs flex to generate a scanner which never considers its input 2780interactive. This is the opposite of @code{always-interactive}. 2781 2782 2783@anchor{option-posix} 2784@opindex -X 2785@opindex ---posix 2786@opindex posix 2787@item -X, --posix, @code{%option posix} 2788turns on maximum compatibility with the POSIX 1003.2-1992 definition of 2789@code{lex}. Since @code{flex} was originally designed to implement the 2790POSIX definition of @code{lex} this generally involves very few changes 2791in behavior. At the current writing the known differences between 2792@code{flex} and the POSIX standard are: 2793 2794@itemize 2795@item 2796In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower 2797precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). 2798Most POSIX utilities use an Extended Regular Expression (ERE) precedence 2799that has the precedence of the repeat operator higher than concatenation 2800(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex} 2801places the precedence of the repeat operator higher than concatenation 2802which matches the ERE processing of other POSIX utilities. When either 2803@samp{--posix} or @samp{-l} are specified, @code{flex} will use the 2804traditional AT&T and POSIX-compliant precedence for the repeat operator 2805where concatenation has higher precedence than the repeat operator. 2806@end itemize 2807 2808 2809@anchor{option-stack} 2810@opindex ---stack 2811@opindex stack 2812@item --stack, @code{%option stack} 2813enables the use of 2814start condition stacks (@pxref{Start Conditions}). 2815 2816 2817 2818@anchor{option-stdinit} 2819@opindex ---stdinit 2820@opindex stdinit 2821@item --stdinit, @code{%option stdinit} 2822if set (i.e., @b{%option stdinit)} initializes @code{yyin} and 2823@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of 2824@file{NULL}. Some existing @code{lex} programs depend on this behavior, 2825even though it is not compliant with ANSI C, which does not require 2826@file{stdin} and @file{stdout} to be compile-time constant. In a 2827reentrant scanner, however, this is not a problem since initialization 2828is performed in @code{yylex_init} at runtime. 2829 2830 2831 2832@anchor{option-yylineno} 2833@opindex ---yylineno 2834@opindex yylineno 2835@item --yylineno, @code{%option yylineno} 2836directs @code{flex} to generate a scanner 2837that maintains the number of the current line read from its input in the 2838global variable @code{yylineno}. This option is implied by @code{%option 2839lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is 2840accessible regardless of the value of @code{%option yylineno}, however, its 2841value is not modified by @code{flex} unless @code{%option yylineno} is enabled. 2842 2843 2844 2845@anchor{option-yywrap} 2846@opindex ---yywrap 2847@opindex yywrap 2848@item --yywrap, @code{%option yywrap} 2849if unset (i.e., @code{--noyywrap)}, makes the scanner not call 2850@code{yywrap()} upon an end-of-file, but simply assume that there are no 2851more files to scan (until the user points @file{yyin} at a new file and 2852calls @code{yylex()} again). 2853 2854@end table 2855 2856@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options 2857@section Code-Level And API Options 2858 2859@table @samp 2860 2861@anchor{option-ansi-definitions} 2862@opindex ---option-ansi-definitions 2863@opindex ansi-definitions 2864@item --ansi-definitions, @code{%option ansi-definitions} 2865instruct flex to generate ANSI C99 definitions for functions. 2866This option is enabled by default. 2867If @code{%option noansi-definitions} is specified, then the obsolete style 2868is generated. 2869 2870@anchor{option-ansi-prototypes} 2871@opindex ---option-ansi-prototypes 2872@opindex ansi-prototypes 2873@item --ansi-prototypes, @code{%option ansi-prototypes} 2874instructs flex to generate ANSI C99 prototypes for functions. 2875This option is enabled by default. 2876If @code{noansi-prototypes} is specified, then 2877prototypes will have empty parameter lists. 2878 2879@anchor{option-bison-bridge} 2880@opindex ---bison-bridge 2881@opindex bison-bridge 2882@item --bison-bridge, @code{%option bison-bridge} 2883instructs flex to generate a C scanner that is 2884meant to be called by a 2885@code{GNU bison} 2886parser. The scanner has minor API changes for 2887@code{bison} 2888compatibility. In particular, the declaration of 2889@code{yylex} 2890is modified to take an additional parameter, 2891@code{yylval}. 2892@xref{Bison Bridge}. 2893 2894@anchor{option-bison-locations} 2895@opindex ---bison-locations 2896@opindex bison-locations 2897@item --bison-locations, @code{%option bison-locations} 2898instruct flex that 2899@code{GNU bison} @code{%locations} are being used. 2900This means @code{yylex} will be passed 2901an additional parameter, @code{yylloc}. This option 2902implies @code{%option bison-bridge}. 2903@xref{Bison Bridge}. 2904 2905@anchor{option-noline} 2906@opindex -L 2907@opindex ---noline 2908@opindex noline 2909@item -L, --noline, @code{%option noline} 2910instructs 2911@code{flex} 2912not to generate 2913@code{#line} 2914directives. Without this option, 2915@code{flex} 2916peppers the generated scanner 2917with @code{#line} directives so error messages in the actions will be correctly 2918located with respect to either the original 2919@code{flex} 2920input file (if the errors are due to code in the input file), or 2921@file{lex.yy.c} 2922(if the errors are 2923@code{flex}'s 2924fault -- you should report these sorts of errors to the email address 2925given in @ref{Reporting Bugs}). 2926 2927 2928 2929@anchor{option-reentrant} 2930@opindex -R 2931@opindex ---reentrant 2932@opindex reentrant 2933@item -R, --reentrant, @code{%option reentrant} 2934instructs flex to generate a reentrant C scanner. The generated scanner 2935may safely be used in a multi-threaded environment. The API for a 2936reentrant scanner is different than for a non-reentrant scanner 2937@pxref{Reentrant}). Because of the API difference between 2938reentrant and non-reentrant @code{flex} scanners, non-reentrant flex 2939code must be modified before it is suitable for use with this option. 2940This option is not compatible with the @samp{--c++} option. 2941 2942The option @samp{--reentrant} does not affect the performance of 2943the scanner. 2944 2945 2946 2947@anchor{option-c++} 2948@opindex -+ 2949@opindex ---c++ 2950@opindex c++ 2951@item -+, --c++, @code{%option c++} 2952specifies that you want flex to generate a C++ 2953scanner class. @xref{Cxx}, for 2954details. 2955 2956 2957 2958@anchor{option-array} 2959@opindex ---array 2960@opindex array 2961@item --array, @code{%option array} 2962specifies that you want yytext to be an array instead of a char* 2963 2964 2965 2966@anchor{option-pointer} 2967@opindex ---pointer 2968@opindex pointer 2969@item --pointer, @code{%option pointer} 2970specify that @code{yytext} should be a @code{char *}, not an array. 2971This default is @code{char *}. 2972 2973 2974 2975@anchor{option-prefix} 2976@opindex -P 2977@opindex ---prefix 2978@opindex prefix 2979@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"} 2980changes the default @samp{yy} prefix used by @code{flex} for all 2981globally-visible variable and function names to instead be 2982@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of 2983@code{yytext} to @code{footext}. It also changes the name of the default 2984output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial 2985list of the names affected: 2986 2987@example 2988@verbatim 2989 yy_create_buffer 2990 yy_delete_buffer 2991 yy_flex_debug 2992 yy_init_buffer 2993 yy_flush_buffer 2994 yy_load_buffer_state 2995 yy_switch_to_buffer 2996 yyin 2997 yyleng 2998 yylex 2999 yylineno 3000 yyout 3001 yyrestart 3002 yytext 3003 yywrap 3004 yyalloc 3005 yyrealloc 3006 yyfree 3007@end verbatim 3008@end example 3009 3010(If you are using a C++ scanner, then only @code{yywrap} and 3011@code{yyFlexLexer} are affected.) Within your scanner itself, you can 3012still refer to the global variables and functions using either version 3013of their name; but externally, they have the modified name. 3014 3015This option lets you easily link together multiple 3016@code{flex} 3017programs into the same executable. Note, though, that using this 3018option also renames 3019@code{yywrap()}, 3020so you now 3021@emph{must} 3022either 3023provide your own (appropriately-named) version of the routine for your 3024scanner, or use 3025@code{%option noyywrap}, 3026as linking with 3027@samp{-lfl} 3028no longer provides one for you by default. 3029 3030 3031 3032@anchor{option-main} 3033@opindex ---main 3034@opindex main 3035@item --main, @code{%option main} 3036 directs flex to provide a default @code{main()} program for the 3037scanner, which simply calls @code{yylex()}. This option implies 3038@code{noyywrap} (see below). 3039 3040 3041 3042@anchor{option-nounistd} 3043@opindex ---nounistd 3044@opindex nounistd 3045@item --nounistd, @code{%option nounistd} 3046suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option 3047is meant to target environments in which @file{unistd.h} does not exist. Be aware 3048that certain options may cause flex to generate code that relies on functions 3049normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.) 3050If you wish to use these functions, you will have to inform your compiler where 3051to find them. 3052@xref{option-always-interactive}. @xref{option-read}. 3053 3054 3055 3056@anchor{option-yyclass} 3057@opindex ---yyclass 3058@opindex yyclass 3059@item --yyclass=NAME, @code{%option yyclass="NAME"} 3060only applies when generating a C++ scanner (the @samp{--c++} option). It 3061informs @code{flex} that you have derived @code{NAME} as a subclass of 3062@code{yyFlexLexer}, so @code{flex} will place your actions in the member 3063function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It 3064also generates a @code{yyFlexLexer::yylex()} member function that emits 3065a run-time error (by invoking @code{yyFlexLexer::LexerError())} if 3066called. @xref{Cxx}. 3067 3068@end table 3069 3070@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options 3071@section Options for Scanner Speed and Size 3072 3073@table @samp 3074 3075@item -C[aefFmr] 3076controls the degree of table compression and, more generally, trade-offs 3077between small scanners and fast scanners. 3078 3079@table @samp 3080@opindex -C 3081@item -C 3082A lone @samp{-C} specifies that the scanner tables should be compressed 3083but neither equivalence classes nor meta-equivalence classes should be 3084used. 3085 3086@anchor{option-align} 3087@opindex -Ca 3088@opindex ---align 3089@opindex align 3090@item -Ca, --align, @code{%option align} 3091(``align'') instructs flex to trade off larger tables in the 3092generated scanner for faster performance because the elements of 3093the tables are better aligned for memory access and computation. On some 3094RISC architectures, fetching and manipulating longwords is more efficient 3095than with smaller-sized units such as shortwords. This option can 3096quadruple the size of the tables used by your scanner. 3097 3098@anchor{option-ecs} 3099@opindex -Ce 3100@opindex ---ecs 3101@opindex ecs 3102@item -Ce, --ecs, @code{%option ecs} 3103directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets 3104of characters which have identical lexical properties (for example, if 3105the only appearance of digits in the @code{flex} input is in the 3106character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be 3107put in the same equivalence class). Equivalence classes usually give 3108dramatic reductions in the final table/object file sizes (typically a 3109factor of 2-5) and are pretty cheap performance-wise (one array look-up 3110per character scanned). 3111 3112@opindex -Cf 3113@item -Cf 3114specifies that the @dfn{full} scanner tables should be generated - 3115@code{flex} should not compress the tables by taking advantages of 3116similar transition functions for different states. 3117 3118@opindex -CF 3119@item -CF 3120specifies that the alternate fast scanner representation (described 3121above under the @samp{--fast} flag) should be used. This option cannot be 3122used with @samp{--c++}. 3123 3124@anchor{option-meta-ecs} 3125@opindex -Cm 3126@opindex ---meta-ecs 3127@opindex meta-ecs 3128@item -Cm, --meta-ecs, @code{%option meta-ecs} 3129directs 3130@code{flex} 3131to construct 3132@dfn{meta-equivalence classes}, 3133which are sets of equivalence classes (or characters, if equivalence 3134classes are not being used) that are commonly used together. Meta-equivalence 3135classes are often a big win when using compressed tables, but they 3136have a moderate performance impact (one or two @code{if} tests and one 3137array look-up per character scanned). 3138 3139@anchor{option-read} 3140@opindex -Cr 3141@opindex ---read 3142@opindex read 3143@item -Cr, --read, @code{%option read} 3144causes the generated scanner to @emph{bypass} use of the standard I/O 3145library (@code{stdio}) for input. Instead of calling @code{fread()} or 3146@code{getc()}, the scanner will use the @code{read()} system call, 3147resulting in a performance gain which varies from system to system, but 3148in general is probably negligible unless you are also using @samp{-Cf} 3149or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for 3150example, you read from @file{yyin} using @code{stdio} prior to calling 3151the scanner (because the scanner will miss whatever text your previous 3152reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect 3153if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). 3154@end table 3155 3156The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense 3157together - there is no opportunity for meta-equivalence classes if the 3158table is not being compressed. Otherwise the options may be freely 3159mixed, and are cumulative. 3160 3161The default setting is @samp{-Cem}, which specifies that @code{flex} 3162should generate equivalence classes and meta-equivalence classes. This 3163setting provides the highest degree of table compression. You can trade 3164off faster-executing scanners at the cost of larger tables with the 3165following generally being true: 3166 3167@example 3168@verbatim 3169 slowest & smallest 3170 -Cem 3171 -Cm 3172 -Ce 3173 -C 3174 -C{f,F}e 3175 -C{f,F} 3176 -C{f,F}a 3177 fastest & largest 3178@end verbatim 3179@end example 3180 3181Note that scanners with the smallest tables are usually generated and 3182compiled the quickest, so during development you will usually want to 3183use the default, maximal compression. 3184 3185@samp{-Cfe} is often a good compromise between speed and size for 3186production scanners. 3187 3188@anchor{option-full} 3189@opindex -f 3190@opindex ---full 3191@opindex full 3192@item -f, --full, @code{%option full} 3193specifies 3194@dfn{fast scanner}. 3195No table compression is done and @code{stdio} is bypassed. 3196The result is large but fast. This option is equivalent to 3197@samp{--Cfr} 3198 3199 3200@anchor{option-fast} 3201@opindex -F 3202@opindex ---fast 3203@opindex fast 3204@item -F, --fast, @code{%option fast} 3205specifies that the @emph{fast} scanner table representation should be 3206used (and @code{stdio} bypassed). This representation is about as fast 3207as the full table representation @samp{--full}, and for some sets of 3208patterns will be considerably smaller (and for others, larger). In 3209general, if the pattern set contains both @emph{keywords} and a 3210catch-all, @emph{identifier} rule, such as in the set: 3211 3212@example 3213@verbatim 3214 "case" return TOK_CASE; 3215 "switch" return TOK_SWITCH; 3216 ... 3217 "default" return TOK_DEFAULT; 3218 [a-z]+ return TOK_ID; 3219@end verbatim 3220@end example 3221 3222then you're better off using the full table representation. If only 3223the @emph{identifier} rule is present and you then use a hash table or some such 3224to detect the keywords, you're better off using 3225@samp{--fast}. 3226 3227This option is equivalent to @samp{-CFr}. It cannot be used 3228with @samp{--c++}. 3229 3230@end table 3231 3232@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options 3233@section Debugging Options 3234 3235@table @samp 3236 3237@anchor{option-backup} 3238@opindex -b 3239@opindex ---backup 3240@opindex backup 3241@item -b, --backup, @code{%option backup} 3242Generate backing-up information to @file{lex.backup}. This is a list of 3243scanner states which require backing up and the input characters on 3244which they do so. By adding rules one can remove backing-up states. If 3245@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} 3246is used, the generated scanner will run faster (see the @samp{--perf-report} flag). 3247Only users who wish to squeeze every last cycle out of their scanners 3248need worry about this option. (@pxref{Performance}). 3249 3250 3251 3252@anchor{option-debug} 3253@opindex -d 3254@opindex ---debug 3255@opindex debug 3256@item -d, --debug, @code{%option debug} 3257makes the generated scanner run in @dfn{debug} mode. Whenever a pattern 3258is recognized and the global variable @code{yy_flex_debug} is non-zero 3259(which is the default), the scanner will write to @file{stderr} a line 3260of the form: 3261 3262@example 3263@verbatim 3264 -accepting rule at line 53 ("the matched text") 3265@end verbatim 3266@end example 3267 3268The line number refers to the location of the rule in the file defining 3269the scanner (i.e., the file that was fed to flex). Messages are also 3270generated when the scanner backs up, accepts the default rule, reaches 3271the end of its input buffer (or encounters a NUL; at this point, the two 3272look the same as far as the scanner's concerned), or reaches an 3273end-of-file. 3274 3275 3276 3277@anchor{option-perf-report} 3278@opindex -p 3279@opindex ---perf-report 3280@opindex perf-report 3281@item -p, --perf-report, @code{%option perf-report} 3282generates a performance report to @file{stderr}. The report consists of 3283comments regarding features of the @code{flex} input file which will 3284cause a serious loss of performance in the resulting scanner. If you 3285give the flag twice, you will also get comments regarding features that 3286lead to minor performance losses. 3287 3288Note that the use of @code{REJECT}, and 3289variable trailing context (@pxref{Limitations}) entails a substantial 3290performance penalty; use of @code{yymore()}, the @samp{^} operator, and 3291the @samp{--interactive} flag entail minor performance penalties. 3292 3293 3294 3295@anchor{option-nodefault} 3296@opindex -s 3297@opindex ---nodefault 3298@opindex nodefault 3299@item -s, --nodefault, @code{%option nodefault} 3300causes the @emph{default rule} (that unmatched scanner input is echoed 3301to @file{stdout)} to be suppressed. If the scanner encounters input 3302that does not match any of its rules, it aborts with an error. This 3303option is useful for finding holes in a scanner's rule set. 3304 3305 3306 3307@anchor{option-trace} 3308@opindex -T 3309@opindex ---trace 3310@opindex trace 3311@item -T, --trace, @code{%option trace} 3312makes @code{flex} run in @dfn{trace} mode. It will generate a lot of 3313messages to @file{stderr} concerning the form of the input and the 3314resultant non-deterministic and deterministic finite automata. This 3315option is mostly for use in maintaining @code{flex}. 3316 3317 3318 3319@anchor{option-nowarn} 3320@opindex -w 3321@opindex ---nowarn 3322@opindex nowarn 3323@item -w, --nowarn, @code{%option nowarn} 3324suppresses warning messages. 3325 3326 3327 3328@anchor{option-verbose} 3329@opindex -v 3330@opindex ---verbose 3331@opindex verbose 3332@item -v, --verbose, @code{%option verbose} 3333specifies that @code{flex} should write to @file{stderr} a summary of 3334statistics regarding the scanner it generates. Most of the statistics 3335are meaningless to the casual @code{flex} user, but the first line 3336identifies the version of @code{flex} (same as reported by @samp{--version}), 3337and the next line the flags used when generating the scanner, including 3338those that are on by default. 3339 3340 3341 3342@anchor{option-warn} 3343@opindex ---warn 3344@opindex warn 3345@item --warn, @code{%option warn} 3346warn about certain things. In particular, if the default rule can be 3347matched but no default rule has been given, the flex will warn you. 3348We recommend using this option always. 3349 3350@end table 3351 3352@node Miscellaneous Options, , Debugging Options, Scanner Options 3353@section Miscellaneous Options 3354 3355@table @samp 3356@opindex -c 3357@item -c 3358A do-nothing option included for POSIX compliance. 3359 3360@opindex -h 3361@opindex ---help 3362@item -h, -?, --help 3363generates a ``help'' summary of @code{flex}'s options to @file{stdout} 3364and then exits. 3365 3366@opindex -n 3367@item -n 3368Another do-nothing option included for 3369POSIX compliance. 3370 3371@opindex -V 3372@opindex ---version 3373@item -V, --version 3374prints the version number to @file{stdout} and exits. 3375 3376@end table 3377 3378 3379@node Performance, Cxx, Scanner Options, Top 3380@chapter Performance Considerations 3381 3382@cindex performance, considerations 3383The main design goal of @code{flex} is that it generate high-performance 3384scanners. It has been optimized for dealing well with large sets of 3385rules. Aside from the effects on scanner speed of the table compression 3386@samp{-C} options outlined above, there are a number of options/actions 3387which degrade performance. These are, from most expensive to least: 3388 3389@cindex REJECT, performance costs 3390@cindex yylineno, performance costs 3391@cindex trailing context, performance costs 3392@example 3393@verbatim 3394 REJECT 3395 arbitrary trailing context 3396 3397 pattern sets that require backing up 3398 %option yylineno 3399 %array 3400 3401 %option interactive 3402 %option always-interactive 3403 3404 ^ beginning-of-line operator 3405 yymore() 3406@end verbatim 3407@end example 3408 3409with the first two all being quite expensive and the last two being 3410quite cheap. Note also that @code{unput()} is implemented as a routine 3411call that potentially does quite a bit of work, while @code{yyless()} is 3412a quite-cheap macro. So if you are just putting back some excess text 3413you scanned, use @code{yyless()}. 3414 3415@code{REJECT} should be avoided at all costs when performance is 3416important. It is a particularly expensive option. 3417 3418There is one case when @code{%option yylineno} can be expensive. That is when 3419your patterns match long tokens that could @emph{possibly} contain a newline 3420character. There is no performance penalty for rules that can not possibly 3421match newlines, since flex does not need to check them for newlines. In 3422general, you should avoid rules such as @code{[^f]+}, which match very long 3423tokens, including newlines, and may possibly match your entire file! A better 3424approach is to separate @code{[^f]+} into two rules: 3425 3426@example 3427@verbatim 3428%option yylineno 3429%% 3430 [^f\n]+ 3431 \n+ 3432@end verbatim 3433@end example 3434 3435The above scanner does not incur a performance penalty. 3436 3437@cindex patterns, tuning for performance 3438@cindex performance, backing up 3439@cindex backing up, example of eliminating 3440Getting rid of backing up is messy and often may be an enormous amount 3441of work for a complicated scanner. In principal, one begins by using 3442the @samp{-b} flag to generate a @file{lex.backup} file. For example, 3443on the input: 3444 3445@cindex backing up, eliminating 3446@example 3447@verbatim 3448 %% 3449 foo return TOK_KEYWORD; 3450 foobar return TOK_KEYWORD; 3451@end verbatim 3452@end example 3453 3454the file looks like: 3455 3456@example 3457@verbatim 3458 State #6 is non-accepting - 3459 associated rule line numbers: 3460 2 3 3461 out-transitions: [ o ] 3462 jam-transitions: EOF [ \001-n p-\177 ] 3463 3464 State #8 is non-accepting - 3465 associated rule line numbers: 3466 3 3467 out-transitions: [ a ] 3468 jam-transitions: EOF [ \001-` b-\177 ] 3469 3470 State #9 is non-accepting - 3471 associated rule line numbers: 3472 3 3473 out-transitions: [ r ] 3474 jam-transitions: EOF [ \001-q s-\177 ] 3475 3476 Compressed tables always back up. 3477@end verbatim 3478@end example 3479 3480The first few lines tell us that there's a scanner state in which it can 3481make a transition on an 'o' but not on any other character, and that in 3482that state the currently scanned text does not match any rule. The 3483state occurs when trying to match the rules found at lines 2 and 3 in 3484the input file. If the scanner is in that state and then reads 3485something other than an 'o', it will have to back up to find a rule 3486which is matched. With a bit of headscratching one can see that this 3487must be the state it's in when it has seen @samp{fo}. When this has 3488happened, if anything other than another @samp{o} is seen, the scanner 3489will have to back up to simply match the @samp{f} (by the default rule). 3490 3491The comment regarding State #8 indicates there's a problem when 3492@samp{foob} has been scanned. Indeed, on any character other than an 3493@samp{a}, the scanner will have to back up to accept "foo". Similarly, 3494the comment for State #9 concerns when @samp{fooba} has been scanned and 3495an @samp{r} does not follow. 3496 3497The final comment reminds us that there's no point going to all the 3498trouble of removing backing up from the rules unless we're using 3499@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so 3500with compressed scanners. 3501 3502@cindex error rules, to eliminate backing up 3503The way to remove the backing up is to add ``error'' rules: 3504 3505@cindex backing up, eliminating by adding error rules 3506@example 3507@verbatim 3508 %% 3509 foo return TOK_KEYWORD; 3510 foobar return TOK_KEYWORD; 3511 3512 fooba | 3513 foob | 3514 fo { 3515 /* false alarm, not really a keyword */ 3516 return TOK_ID; 3517 } 3518@end verbatim 3519@end example 3520 3521Eliminating backing up among a list of keywords can also be done using a 3522``catch-all'' rule: 3523 3524@cindex backing up, eliminating with catch-all rule 3525@example 3526@verbatim 3527 %% 3528 foo return TOK_KEYWORD; 3529 foobar return TOK_KEYWORD; 3530 3531 [a-z]+ return TOK_ID; 3532@end verbatim 3533@end example 3534 3535This is usually the best solution when appropriate. 3536 3537Backing up messages tend to cascade. With a complicated set of rules 3538it's not uncommon to get hundreds of messages. If one can decipher 3539them, though, it often only takes a dozen or so rules to eliminate the 3540backing up (though it's easy to make a mistake and have an error rule 3541accidentally match a valid token. A possible future @code{flex} feature 3542will be to automatically add rules to eliminate backing up). 3543 3544It's important to keep in mind that you gain the benefits of eliminating 3545backing up only if you eliminate @emph{every} instance of backing up. 3546Leaving just one means you gain nothing. 3547 3548@emph{Variable} trailing context (where both the leading and trailing 3549parts do not have a fixed length) entails almost the same performance 3550loss as @code{REJECT} (i.e., substantial). So when possible a rule 3551like: 3552 3553@cindex trailing context, variable length 3554@example 3555@verbatim 3556 %% 3557 mouse|rat/(cat|dog) run(); 3558@end verbatim 3559@end example 3560 3561is better written: 3562 3563@example 3564@verbatim 3565 %% 3566 mouse/cat|dog run(); 3567 rat/cat|dog run(); 3568@end verbatim 3569@end example 3570 3571or as 3572 3573@example 3574@verbatim 3575 %% 3576 mouse|rat/cat run(); 3577 mouse|rat/dog run(); 3578@end verbatim 3579@end example 3580 3581Note that here the special '|' action does @emph{not} provide any 3582savings, and can even make things worse (@pxref{Limitations}). 3583 3584Another area where the user can increase a scanner's performance (and 3585one that's easier to implement) arises from the fact that the longer the 3586tokens matched, the faster the scanner will run. This is because with 3587long tokens the processing of most input characters takes place in the 3588(short) inner scanning loop, and does not often have to go through the 3589additional work of setting up the scanning environment (e.g., 3590@code{yytext}) for the action. Recall the scanner for C comments: 3591 3592@cindex performance optimization, matching longer tokens 3593@example 3594@verbatim 3595 %x comment 3596 %% 3597 int line_num = 1; 3598 3599 "/*" BEGIN(comment); 3600 3601 <comment>[^*\n]* 3602 <comment>"*"+[^*/\n]* 3603 <comment>\n ++line_num; 3604 <comment>"*"+"/" BEGIN(INITIAL); 3605@end verbatim 3606@end example 3607 3608This could be sped up by writing it as: 3609 3610@example 3611@verbatim 3612 %x comment 3613 %% 3614 int line_num = 1; 3615 3616 "/*" BEGIN(comment); 3617 3618 <comment>[^*\n]* 3619 <comment>[^*\n]*\n ++line_num; 3620 <comment>"*"+[^*/\n]* 3621 <comment>"*"+[^*/\n]*\n ++line_num; 3622 <comment>"*"+"/" BEGIN(INITIAL); 3623@end verbatim 3624@end example 3625 3626Now instead of each newline requiring the processing of another action, 3627recognizing the newlines is distributed over the other rules to keep the 3628matched text as long as possible. Note that @emph{adding} rules does 3629@emph{not} slow down the scanner! The speed of the scanner is 3630independent of the number of rules or (modulo the considerations given 3631at the beginning of this section) how complicated the rules are with 3632regard to operators such as @samp{*} and @samp{|}. 3633 3634@cindex keywords, for performance 3635@cindex performance, using keywords 3636A final example in speeding up a scanner: suppose you want to scan 3637through a file containing identifiers and keywords, one per line 3638and with no other extraneous characters, and recognize all the 3639keywords. A natural first approach is: 3640 3641@cindex performance optimization, recognizing keywords 3642@example 3643@verbatim 3644 %% 3645 asm | 3646 auto | 3647 break | 3648 ... etc ... 3649 volatile | 3650 while /* it's a keyword */ 3651 3652 .|\n /* it's not a keyword */ 3653@end verbatim 3654@end example 3655 3656To eliminate the back-tracking, introduce a catch-all rule: 3657 3658@example 3659@verbatim 3660 %% 3661 asm | 3662 auto | 3663 break | 3664 ... etc ... 3665 volatile | 3666 while /* it's a keyword */ 3667 3668 [a-z]+ | 3669 .|\n /* it's not a keyword */ 3670@end verbatim 3671@end example 3672 3673Now, if it's guaranteed that there's exactly one word per line, then we 3674can reduce the total number of matches by a half by merging in the 3675recognition of newlines with that of the other tokens: 3676 3677@example 3678@verbatim 3679 %% 3680 asm\n | 3681 auto\n | 3682 break\n | 3683 ... etc ... 3684 volatile\n | 3685 while\n /* it's a keyword */ 3686 3687 [a-z]+\n | 3688 .|\n /* it's not a keyword */ 3689@end verbatim 3690@end example 3691 3692One has to be careful here, as we have now reintroduced backing up 3693into the scanner. In particular, while 3694@emph{we} 3695know that there will never be any characters in the input stream 3696other than letters or newlines, 3697@code{flex} 3698can't figure this out, and it will plan for possibly needing to back up 3699when it has scanned a token like @samp{auto} and then the next character 3700is something other than a newline or a letter. Previously it would 3701then just match the @samp{auto} rule and be done, but now it has no @samp{auto} 3702rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, 3703we could either duplicate all rules but without final newlines, or, 3704since we never expect to encounter such an input and therefore don't 3705how it's classified, we can introduce one more catch-all rule, this 3706one which doesn't include a newline: 3707 3708@example 3709@verbatim 3710 %% 3711 asm\n | 3712 auto\n | 3713 break\n | 3714 ... etc ... 3715 volatile\n | 3716 while\n /* it's a keyword */ 3717 3718 [a-z]+\n | 3719 [a-z]+ | 3720 .|\n /* it's not a keyword */ 3721@end verbatim 3722@end example 3723 3724Compiled with @samp{-Cf}, this is about as fast as one can get a 3725@code{flex} scanner to go for this particular problem. 3726 3727A final note: @code{flex} is slow when matching @code{NUL}s, 3728particularly when a token contains multiple @code{NUL}s. It's best to 3729write rules which match @emph{short} amounts of text if it's anticipated 3730that the text will often include @code{NUL}s. 3731 3732Another final note regarding performance: as mentioned in 3733@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge 3734tokens is a slow process because it presently requires that the (huge) 3735token be rescanned from the beginning. Thus if performance is vital, 3736you should attempt to match ``large'' quantities of text but not 3737``huge'' quantities, where the cutoff between the two is at about 8K 3738characters per token. 3739 3740@node Cxx, Reentrant, Performance, Top 3741@chapter Generating C++ Scanners 3742 3743@cindex c++, experimental form of scanner class 3744@cindex experimental form of c++ scanner class 3745@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} 3746and may change considerably between major releases. 3747 3748@cindex C++ 3749@cindex member functions, C++ 3750@cindex methods, c++ 3751@code{flex} provides two different ways to generate scanners for use 3752with C++. The first way is to simply compile a scanner generated by 3753@code{flex} using a C++ compiler instead of a C compiler. You should 3754not encounter any compilation errors (@pxref{Reporting Bugs}). You can 3755then use C++ code in your rule actions instead of C code. Note that the 3756default input source for your scanner remains @file{yyin}, and default 3757echoing is still done to @file{yyout}. Both of these remain @code{FILE 3758*} variables and not C++ @emph{streams}. 3759 3760You can also use @code{flex} to generate a C++ scanner class, using the 3761@samp{-+} option (or, equivalently, @code{%option c++)}, which is 3762automatically specified if the name of the @code{flex} executable ends 3763in a '+', such as @code{flex++}. When using this option, @code{flex} 3764defaults to generating the scanner to the file @file{lex.yy.cc} instead 3765of @file{lex.yy.c}. The generated scanner includes the header file 3766@file{FlexLexer.h}, which defines the interface to two C++ classes. 3767 3768The first class, 3769@code{FlexLexer}, 3770provides an abstract base class defining the general scanner class 3771interface. It provides the following member functions: 3772 3773@table @code 3774@findex YYText (C++ only) 3775@item const char* YYText() 3776returns the text of the most recently matched token, the equivalent of 3777@code{yytext}. 3778 3779@findex YYLeng (C++ only) 3780@item int YYLeng() 3781returns the length of the most recently matched token, the equivalent of 3782@code{yyleng}. 3783 3784@findex lineno (C++ only) 3785@item int lineno() const 3786returns the current input line number (see @code{%option yylineno)}, or 3787@code{1} if @code{%option yylineno} was not used. 3788 3789@findex set_debug (C++ only) 3790@item void set_debug( int flag ) 3791sets the debugging flag for the scanner, equivalent to assigning to 3792@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build 3793the scanner using @code{%option debug} to include debugging information 3794in it. 3795 3796@findex debug (C++ only) 3797@item int debug() const 3798returns the current setting of the debugging flag. 3799@end table 3800 3801Also provided are member functions equivalent to 3802@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the 3803first argument is an @code{istream*} object pointer and not a 3804@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and 3805@code{yyrestart()} (again, the first argument is a @code{istream*} 3806object pointer). 3807 3808@tindex yyFlexLexer (C++ only) 3809@tindex FlexLexer (C++ only) 3810The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, 3811which is derived from @code{FlexLexer}. It defines the following 3812additional member functions: 3813 3814@table @code 3815@findex yyFlexLexer constructor (C++ only) 3816@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3817constructs a @code{yyFlexLexer} object using the given streams for input 3818and output. If not specified, the streams default to @code{cin} and 3819@code{cout}, respectively. 3820 3821@findex yylex (C++ version) 3822@item virtual int yylex() 3823performs the same role is @code{yylex()} does for ordinary @code{flex} 3824scanners: it scans the input stream, consuming tokens, until a rule's 3825action returns a value. If you derive a subclass @code{S} from 3826@code{yyFlexLexer} and want to access the member functions and variables 3827of @code{S} inside @code{yylex()}, then you need to use @code{%option 3828yyclass="S"} to inform @code{flex} that you will be using that subclass 3829instead of @code{yyFlexLexer}. In this case, rather than generating 3830@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()} 3831(and also generates a dummy @code{yyFlexLexer::yylex()} that calls 3832@code{yyFlexLexer::LexerError()} if called). 3833 3834@findex switch_streams (C++ only) 3835@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) 3836reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to 3837@code{new_out} (if non-null), deleting the previous input buffer if 3838@code{yyin} is reassigned. 3839 3840@item int yylex( istream* new_in, ostream* new_out = 0 ) 3841first switches the input streams via @code{switch_streams( new_in, 3842new_out )} and then returns the value of @code{yylex()}. 3843@end table 3844 3845In addition, @code{yyFlexLexer} defines the following protected virtual 3846functions which you can redefine in derived classes to tailor the 3847scanner: 3848 3849@table @code 3850@findex LexerInput (C++ only) 3851@item virtual int LexerInput( char* buf, int max_size ) 3852reads up to @code{max_size} characters into @code{buf} and returns the 3853number of characters read. To indicate end-of-input, return 0 3854characters. Note that @code{interactive} scanners (see the @samp{-B} 3855and @samp{-I} flags in @ref{Scanner Options}) define the macro 3856@code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to 3857take different actions depending on whether or not the scanner might be 3858scanning an interactive input source, you can test for the presence of 3859this name via @code{#ifdef} statements. 3860 3861@findex LexerOutput (C++ only) 3862@item virtual void LexerOutput( const char* buf, int size ) 3863writes out @code{size} characters from the buffer @code{buf}, which, while 3864@code{NUL}-terminated, may also contain internal @code{NUL}s if the 3865scanner's rules can match text with @code{NUL}s in them. 3866 3867@cindex error reporting, in C++ 3868@findex LexerError (C++ only) 3869@item virtual void LexerError( const char* msg ) 3870reports a fatal error message. The default version of this function 3871writes the message to the stream @code{cerr} and exits. 3872@end table 3873 3874Note that a @code{yyFlexLexer} object contains its @emph{entire} 3875scanning state. Thus you can use such objects to create reentrant 3876scanners, but see also @ref{Reentrant}. You can instantiate multiple 3877instances of the same @code{yyFlexLexer} class, and you can also combine 3878multiple C++ scanner classes together in the same program using the 3879@samp{-P} option discussed above. 3880 3881Finally, note that the @code{%array} feature is not available to C++ 3882scanner classes; you must use @code{%pointer} (the default). 3883 3884Here is an example of a simple C++ scanner: 3885 3886@cindex C++ scanners, use of 3887@example 3888@verbatim 3889 // An example of using the flex C++ scanner class. 3890 3891 %{ 3892 #include <iostream> 3893 using namespace std; 3894 int mylineno = 0; 3895 %} 3896 3897 %option noyywrap 3898 3899 string \"[^\n"]+\" 3900 3901 ws [ \t]+ 3902 3903 alpha [A-Za-z] 3904 dig [0-9] 3905 name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* 3906 num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? 3907 num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? 3908 number {num1}|{num2} 3909 3910 %% 3911 3912 {ws} /* skip blanks and tabs */ 3913 3914 "/*" { 3915 int c; 3916 3917 while((c = yyinput()) != 0) 3918 { 3919 if(c == '\n') 3920 ++mylineno; 3921 3922 else if(c == '*') 3923 { 3924 if((c = yyinput()) == '/') 3925 break; 3926 else 3927 unput(c); 3928 } 3929 } 3930 } 3931 3932 {number} cout << "number " << YYText() << '\n'; 3933 3934 \n mylineno++; 3935 3936 {name} cout << "name " << YYText() << '\n'; 3937 3938 {string} cout << "string " << YYText() << '\n'; 3939 3940 %% 3941 3942 int main( int /* argc */, char** /* argv */ ) 3943 { 3944 FlexLexer* lexer = new yyFlexLexer; 3945 while(lexer->yylex() != 0) 3946 ; 3947 return 0; 3948 } 3949@end verbatim 3950@end example 3951 3952@cindex C++, multiple different scanners 3953If you want to create multiple (different) lexer classes, you use the 3954@samp{-P} flag (or the @code{prefix=} option) to rename each 3955@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can 3956include @file{<FlexLexer.h>} in your other sources once per lexer class, 3957first renaming @code{yyFlexLexer} as follows: 3958 3959@cindex include files, with C++ 3960@cindex header files, with C++ 3961@cindex C++ scanners, including multiple scanners 3962@example 3963@verbatim 3964 #undef yyFlexLexer 3965 #define yyFlexLexer xxFlexLexer 3966 #include <FlexLexer.h> 3967 3968 #undef yyFlexLexer 3969 #define yyFlexLexer zzFlexLexer 3970 #include <FlexLexer.h> 3971@end verbatim 3972@end example 3973 3974if, for example, you used @code{%option prefix="xx"} for one of your 3975scanners and @code{%option prefix="zz"} for the other. 3976 3977@node Reentrant, Lex and Posix, Cxx, Top 3978@chapter Reentrant C Scanners 3979 3980@cindex reentrant, explanation 3981@code{flex} has the ability to generate a reentrant C scanner. This is 3982accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated 3983scanner is both portable, and safe to use in one or more separate threads of 3984control. The most common use for reentrant scanners is from within 3985multi-threaded applications. Any thread may create and execute a reentrant 3986@code{flex} scanner without the need for synchronization with other threads. 3987 3988@menu 3989* Reentrant Uses:: 3990* Reentrant Overview:: 3991* Reentrant Example:: 3992* Reentrant Detail:: 3993* Reentrant Functions:: 3994@end menu 3995 3996@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant 3997@section Uses for Reentrant Scanners 3998 3999However, there are other uses for a reentrant scanner. For example, you 4000could scan two or more files simultaneously to implement a @code{diff} at 4001the token level (i.e., instead of at the character level): 4002 4003@cindex reentrant scanners, multiple interleaved scanners 4004@example 4005@verbatim 4006 /* Example of maintaining more than one active scanner. */ 4007 4008 do { 4009 int tok1, tok2; 4010 4011 tok1 = yylex( scanner_1 ); 4012 tok2 = yylex( scanner_2 ); 4013 4014 if( tok1 != tok2 ) 4015 printf("Files are different."); 4016 4017 } while ( tok1 && tok2 ); 4018@end verbatim 4019@end example 4020 4021Another use for a reentrant scanner is recursion. 4022(Note that a recursive scanner can also be created using a non-reentrant scanner and 4023buffer states. @xref{Multiple Input Buffers}.) 4024 4025The following crude scanner supports the @samp{eval} command by invoking 4026another instance of itself. 4027 4028@cindex reentrant scanners, recursive invocation 4029@example 4030@verbatim 4031 /* Example of recursive invocation. */ 4032 4033 %option reentrant 4034 4035 %% 4036 "eval(".+")" { 4037 yyscan_t scanner; 4038 YY_BUFFER_STATE buf; 4039 4040 yylex_init( &scanner ); 4041 yytext[yyleng-1] = ' '; 4042 4043 buf = yy_scan_string( yytext + 5, scanner ); 4044 yylex( scanner ); 4045 4046 yy_delete_buffer(buf,scanner); 4047 yylex_destroy( scanner ); 4048 } 4049 ... 4050 %% 4051@end verbatim 4052@end example 4053 4054@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant 4055@section An Overview of the Reentrant API 4056 4057@cindex reentrant, API explanation 4058The API for reentrant scanners is different than for non-reentrant 4059scanners. Here is a quick overview of the API: 4060 4061@itemize 4062@code{%option reentrant} must be specified. 4063 4064@item 4065All functions take one additional argument: @code{yyscanner} 4066 4067@item 4068All global variables are replaced by their macro equivalents. 4069(We tell you this because it may be important to you during debugging.) 4070 4071@item 4072@code{yylex_init} and @code{yylex_destroy} must be called before and 4073after @code{yylex}, respectively. 4074 4075@item 4076Accessor methods (get/set functions) provide access to common 4077@code{flex} variables. 4078 4079@item 4080User-specific data can be stored in @code{yyextra}. 4081@end itemize 4082 4083@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant 4084@section Reentrant Example 4085 4086First, an example of a reentrant scanner: 4087@cindex reentrant, example of 4088@example 4089@verbatim 4090 /* This scanner prints "//" comments. */ 4091 4092 %option reentrant stack noyywrap 4093 %x COMMENT 4094 4095 %% 4096 4097 "//" yy_push_state( COMMENT, yyscanner); 4098 .|\n 4099 4100 <COMMENT>\n yy_pop_state( yyscanner ); 4101 <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); 4102 4103 %% 4104 4105 int main ( int argc, char * argv[] ) 4106 { 4107 yyscan_t scanner; 4108 4109 yylex_init ( &scanner ); 4110 yylex ( scanner ); 4111 yylex_destroy ( scanner ); 4112 return 0; 4113 } 4114@end verbatim 4115@end example 4116 4117@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant 4118@section The Reentrant API in Detail 4119 4120Here are the things you need to do or know to use the reentrant C API of 4121@code{flex}. 4122 4123@menu 4124* Specify Reentrant:: 4125* Extra Reentrant Argument:: 4126* Global Replacement:: 4127* Init and Destroy Functions:: 4128* Accessor Methods:: 4129* Extra Data:: 4130* About yyscan_t:: 4131@end menu 4132 4133@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail 4134@subsection Declaring a Scanner As Reentrant 4135 4136 %option reentrant (--reentrant) must be specified. 4137 4138Notice that @code{%option reentrant} is specified in the above example 4139(@pxref{Reentrant Example}. Had this option not been specified, 4140@code{flex} would have happily generated a non-reentrant scanner without 4141complaining. You may explicitly specify @code{%option noreentrant}, if 4142you do @emph{not} want a reentrant scanner, although it is not 4143necessary. The default is to generate a non-reentrant scanner. 4144 4145@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail 4146@subsection The Extra Argument 4147 4148@cindex reentrant, calling functions 4149@vindex yyscanner (reentrant only) 4150All functions take one additional argument: @code{yyscanner}. 4151 4152Notice that the calls to @code{yy_push_state} and @code{yy_pop_state} 4153both have an argument, @code{yyscanner} , that is not present in a 4154non-reentrant scanner. Here are the declarations of 4155@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner: 4156 4157@example 4158@verbatim 4159 static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; 4160 static void yy_pop_state ( yyscan_t yyscanner ) ; 4161@end verbatim 4162@end example 4163 4164Notice that the argument @code{yyscanner} appears in the declaration of 4165both functions. In fact, all @code{flex} functions in a reentrant 4166scanner have this additional argument. It is always the last argument 4167in the argument list, it is always of type @code{yyscan_t} (which is 4168typedef'd to @code{void *}) and it is 4169always named @code{yyscanner}. As you may have guessed, 4170@code{yyscanner} is a pointer to an opaque data structure encapsulating 4171the current state of the scanner. For a list of function declarations, 4172see @ref{Reentrant Functions}. Note that preprocessor macros, such as 4173@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this 4174additional argument. 4175 4176@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail 4177@subsection Global Variables Replaced By Macros 4178 4179@cindex reentrant, accessing flex variables 4180All global variables in traditional flex have been replaced by macro equivalents. 4181 4182Note that in the above example, @code{yyout} and @code{yytext} are 4183not plain variables. These are macros that will expand to their equivalent lvalue. 4184All of the familiar @code{flex} globals have been replaced by their macro 4185equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno}, 4186@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc} 4187are macros. You may safely use these macros in actions as if they were plain 4188variables. We only tell you this so you don't expect to link to these variables 4189externally. Currently, each macro expands to a member of an internal struct, e.g., 4190 4191@example 4192@verbatim 4193#define yytext (((struct yyguts_t*)yyscanner)->yytext_r) 4194@end verbatim 4195@end example 4196 4197One important thing to remember about 4198@code{yytext} 4199and friends is that 4200@code{yytext} 4201is not a global variable in a reentrant 4202scanner, you can not access it directly from outside an action or from 4203other functions. You must use an accessor method, e.g., 4204@code{yyget_text}, 4205to accomplish this. (See below). 4206 4207@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail 4208@subsection Init and Destroy Functions 4209 4210@cindex memory, considerations for reentrant scanners 4211@cindex reentrant, initialization 4212@findex yylex_init 4213@findex yylex_destroy 4214 4215@code{yylex_init} and @code{yylex_destroy} must be called before and 4216after @code{yylex}, respectively. 4217 4218@example 4219@verbatim 4220 int yylex_init ( yyscan_t * ptr_yy_globals ) ; 4221 int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; 4222 int yylex ( yyscan_t yyscanner ) ; 4223 int yylex_destroy ( yyscan_t yyscanner ) ; 4224@end verbatim 4225@end example 4226 4227The function @code{yylex_init} must be called before calling any other 4228function. The argument to @code{yylex_init} is the address of an 4229uninitialized pointer to be filled in by @code{yylex_init}, overwriting 4230any previous contents. The function @code{yylex_init_extra} may be used 4231instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}. 4232See the section on yyextra, below, for more details. 4233 4234The value stored in @code{ptr_yy_globals} should 4235thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex 4236does not save the argument passed to @code{yylex_init}, so it is safe to 4237pass the address of a local pointer to @code{yylex_init} so long as it remains 4238in scope for the duration of all calls to the scanner, up to and including 4239the call to @code{yylex_destroy}. 4240 4241The function 4242@code{yylex} should be familiar to you by now. The reentrant version 4243takes one argument, which is the value returned (via an argument) by 4244@code{yylex_init}. Otherwise, it behaves the same as the non-reentrant 4245version of @code{yylex}. 4246 4247Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success, 4248or non-zero on failure, in which case errno is set to one of the following values: 4249 4250@itemize 4251@item ENOMEM 4252Memory allocation error. @xref{memory-management}. 4253@item EINVAL 4254Invalid argument. 4255@end itemize 4256 4257 4258The function @code{yylex_destroy} should be 4259called to free resources used by the scanner. After @code{yylex_destroy} 4260is called, the contents of @code{yyscanner} should not be used. Of 4261course, there is no need to destroy a scanner if you plan to reuse it. 4262A @code{flex} scanner (both reentrant and non-reentrant) may be 4263restarted by calling @code{yyrestart}. 4264 4265Below is an example of a program that creates a scanner, uses it, then destroys 4266it when done: 4267 4268@example 4269@verbatim 4270 int main () 4271 { 4272 yyscan_t scanner; 4273 int tok; 4274 4275 yylex_init(&scanner); 4276 4277 while ((tok=yylex(scanner)) > 0) 4278 printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); 4279 4280 yylex_destroy(scanner); 4281 return 0; 4282 } 4283@end verbatim 4284@end example 4285 4286@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail 4287@subsection Accessing Variables with Reentrant Scanners 4288 4289@cindex reentrant, accessor functions 4290Accessor methods (get/set functions) provide access to common 4291@code{flex} variables. 4292 4293Many scanners that you build will be part of a larger project. Portions 4294of your project will need access to @code{flex} values, such as 4295@code{yytext}. In a non-reentrant scanner, these values are global, so 4296there is no problem accessing them. However, in a reentrant scanner, there are no 4297global @code{flex} values. You can not access them directly. Instead, 4298you must access @code{flex} values using accessor methods (get/set 4299functions). Each accessor method is named @code{yyget_NAME} or 4300@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex} 4301variable you want. For example: 4302 4303@cindex accessor functions, use of 4304@example 4305@verbatim 4306 /* Set the last character of yytext to NULL. */ 4307 void chop ( yyscan_t scanner ) 4308 { 4309 int len = yyget_leng( scanner ); 4310 yyget_text( scanner )[len - 1] = '\0'; 4311 } 4312@end verbatim 4313@end example 4314 4315The above code may be called from within an action like this: 4316 4317@example 4318@verbatim 4319 %% 4320 .+\n { chop( yyscanner );} 4321@end verbatim 4322@end example 4323 4324You may find that @code{%option header-file} is particularly useful for generating 4325prototypes of all the accessor functions. @xref{option-header}. 4326 4327@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail 4328@subsection Extra Data 4329 4330@cindex reentrant, extra data 4331@vindex yyextra 4332User-specific data can be stored in @code{yyextra}. 4333 4334In a reentrant scanner, it is unwise to use global variables to 4335communicate with or maintain state between different pieces of your program. 4336However, you may need access to external data or invoke external functions 4337from within the scanner actions. 4338Likewise, you may need to pass information to your scanner 4339(e.g., open file descriptors, or database connections). 4340In a non-reentrant scanner, the only way to do this would be through the 4341use of global variables. 4342@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner. 4343This data is accessible through the accessor methods 4344@code{yyget_extra} and @code{yyset_extra} 4345from outside the scanner, and through the shortcut macro 4346@code{yyextra} 4347from within the scanner itself. They are defined as follows: 4348 4349@tindex YY_EXTRA_TYPE (reentrant only) 4350@findex yyget_extra 4351@findex yyset_extra 4352@example 4353@verbatim 4354 #define YY_EXTRA_TYPE void* 4355 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4356 void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); 4357@end verbatim 4358@end example 4359 4360In addition, an extra form of @code{yylex_init} is provided, 4361@code{yylex_init_extra}. This function is provided so that the yyextra value can 4362be accessed from within the very first yyalloc, used to allocate 4363the scanner itself. 4364 4365By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You 4366may redefine this type using @code{%option extra-type="your_type"} in 4367the scanner: 4368 4369@cindex YY_EXTRA_TYPE, defining your own type 4370@example 4371@verbatim 4372 /* An example of overriding YY_EXTRA_TYPE. */ 4373 %{ 4374 #include <sys/stat.h> 4375 #include <unistd.h> 4376 %} 4377 %option reentrant 4378 %option extra-type="struct stat *" 4379 %% 4380 4381 __filesize__ printf( "%ld", yyextra->st_size ); 4382 __lastmod__ printf( "%ld", yyextra->st_mtime ); 4383 %% 4384 void scan_file( char* filename ) 4385 { 4386 yyscan_t scanner; 4387 struct stat buf; 4388 FILE *in; 4389 4390 in = fopen( filename, "r" ); 4391 stat( filename, &buf ); 4392 4393 yylex_init_extra( buf, &scanner ); 4394 yyset_in( in, scanner ); 4395 yylex( scanner ); 4396 yylex_destroy( scanner ); 4397 4398 fclose( in ); 4399 } 4400@end verbatim 4401@end example 4402 4403 4404@node About yyscan_t, , Extra Data, Reentrant Detail 4405@subsection About yyscan_t 4406 4407@tindex yyscan_t (reentrant only) 4408@code{yyscan_t} is defined as: 4409 4410@example 4411@verbatim 4412 typedef void* yyscan_t; 4413@end verbatim 4414@end example 4415 4416It is initialized by @code{yylex_init()} to point to 4417an internal structure. You should never access this value 4418directly. In particular, you should never attempt to free it 4419(use @code{yylex_destroy()} instead.) 4420 4421@node Reentrant Functions, , Reentrant Detail, Reentrant 4422@section Functions and Macros Available in Reentrant C Scanners 4423 4424The following Functions are available in a reentrant scanner: 4425 4426@findex yyget_text 4427@findex yyget_leng 4428@findex yyget_in 4429@findex yyget_out 4430@findex yyget_lineno 4431@findex yyset_in 4432@findex yyset_out 4433@findex yyset_lineno 4434@findex yyget_debug 4435@findex yyset_debug 4436@findex yyget_extra 4437@findex yyset_extra 4438 4439@example 4440@verbatim 4441 char *yyget_text ( yyscan_t scanner ); 4442 int yyget_leng ( yyscan_t scanner ); 4443 FILE *yyget_in ( yyscan_t scanner ); 4444 FILE *yyget_out ( yyscan_t scanner ); 4445 int yyget_lineno ( yyscan_t scanner ); 4446 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4447 int yyget_debug ( yyscan_t scanner ); 4448 4449 void yyset_debug ( int flag, yyscan_t scanner ); 4450 void yyset_in ( FILE * in_str , yyscan_t scanner ); 4451 void yyset_out ( FILE * out_str , yyscan_t scanner ); 4452 void yyset_lineno ( int line_number , yyscan_t scanner ); 4453 void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); 4454@end verbatim 4455@end example 4456 4457There are no ``set'' functions for yytext and yyleng. This is intentional. 4458 4459The following Macro shortcuts are available in actions in a reentrant 4460scanner: 4461 4462@example 4463@verbatim 4464 yytext 4465 yyleng 4466 yyin 4467 yyout 4468 yylineno 4469 yyextra 4470 yy_flex_debug 4471@end verbatim 4472@end example 4473 4474@cindex yylineno, in a reentrant scanner 4475In a reentrant C scanner, support for yylineno is always present 4476(i.e., you may access yylineno), but the value is never modified by 4477@code{flex} unless @code{%option yylineno} is enabled. This is to allow 4478the user to maintain the line count independently of @code{flex}. 4479 4480@anchor{bison-functions} 4481The following functions and macros are made available when @code{%option 4482bison-bridge} (@samp{--bison-bridge}) is specified: 4483 4484@example 4485@verbatim 4486 YYSTYPE * yyget_lval ( yyscan_t scanner ); 4487 void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); 4488 yylval 4489@end verbatim 4490@end example 4491 4492The following functions and macros are made available 4493when @code{%option bison-locations} (@samp{--bison-locations}) is specified: 4494 4495@example 4496@verbatim 4497 YYLTYPE *yyget_lloc ( yyscan_t scanner ); 4498 void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); 4499 yylloc 4500@end verbatim 4501@end example 4502 4503Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for 4504yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are 4505generated by @code{bison}, and are included in section 1 of the @code{flex} 4506input. 4507 4508@node Lex and Posix, Memory Management, Reentrant, Top 4509@chapter Incompatibilities with Lex and Posix 4510 4511@cindex POSIX and lex 4512@cindex lex (traditional) and POSIX 4513 4514@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two 4515implementations do not share any code, though), with some extensions and 4516incompatibilities, both of which are of concern to those who wish to 4517write scanners acceptable to both implementations. @code{flex} is fully 4518compliant with the POSIX @code{lex} specification, except that when 4519using @code{%pointer} (the default), a call to @code{unput()} destroys 4520the contents of @code{yytext}, which is counter to the POSIX 4521specification. In this section we discuss all of the known areas of 4522incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX 4523specification. @code{flex}'s @samp{-l} option turns on maximum 4524compatibility with the original AT&T @code{lex} implementation, at the 4525cost of a major loss in the generated scanner's performance. We note 4526below which incompatibilities can be overcome using the @samp{-l} 4527option. @code{flex} is fully compatible with @code{lex} with the 4528following exceptions: 4529 4530@itemize 4531@item 4532The undocumented @code{lex} scanner internal variable @code{yylineno} is 4533not supported unless @samp{-l} or @code{%option yylineno} is used. 4534 4535@item 4536@code{yylineno} should be maintained on a per-buffer basis, rather than 4537a per-scanner (single global variable) basis. 4538 4539@item 4540@code{yylineno} is not part of the POSIX specification. 4541 4542@item 4543The @code{input()} routine is not redefinable, though it may be called 4544to read characters following whatever has been matched by a rule. If 4545@code{input()} encounters an end-of-file the normal @code{yywrap()} 4546processing is done. A ``real'' end-of-file is returned by 4547@code{input()} as @code{EOF}. 4548 4549@item 4550Input is instead controlled by defining the @code{YY_INPUT()} macro. 4551 4552@item 4553The @code{flex} restriction that @code{input()} cannot be redefined is 4554in accordance with the POSIX specification, which simply does not 4555specify any way of controlling the scanner's input other than by making 4556an initial assignment to @file{yyin}. 4557 4558@item 4559The @code{unput()} routine is not redefinable. This restriction is in 4560accordance with POSIX. 4561 4562@item 4563@code{flex} scanners are not as reentrant as @code{lex} scanners. In 4564particular, if you have an interactive scanner and an interrupt handler 4565which long-jumps out of the scanner, and the scanner is subsequently 4566called again, you may get the following message: 4567 4568@cindex error messages, end of buffer missed 4569@example 4570@verbatim 4571 fatal flex scanner internal error--end of buffer missed 4572@end verbatim 4573@end example 4574 4575To reenter the scanner, first use: 4576 4577@cindex restarting the scanner 4578@example 4579@verbatim 4580 yyrestart( yyin ); 4581@end verbatim 4582@end example 4583 4584Note that this call will throw away any buffered input; usually this 4585isn't a problem with an interactive scanner. @xref{Reentrant}, for 4586@code{flex}'s reentrant API. 4587 4588@item 4589Also note that @code{flex} C++ scanner classes 4590@emph{are} 4591reentrant, so if using C++ is an option for you, you should use 4592them instead. @xref{Cxx}, and @ref{Reentrant} for details. 4593 4594@item 4595@code{output()} is not supported. Output from the @b{ECHO} macro is 4596done to the file-pointer @code{yyout} (default @file{stdout)}. 4597 4598@item 4599@code{output()} is not part of the POSIX specification. 4600 4601@item 4602@code{lex} does not support exclusive start conditions (%x), though they 4603are in the POSIX specification. 4604 4605@item 4606When definitions are expanded, @code{flex} encloses them in parentheses. 4607With @code{lex}, the following: 4608 4609@cindex name definitions, not POSIX 4610@example 4611@verbatim 4612 NAME [A-Z][A-Z0-9]* 4613 %% 4614 foo{NAME}? printf( "Found it\n" ); 4615 %% 4616@end verbatim 4617@end example 4618 4619will not match the string @samp{foo} because when the macro is expanded 4620the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence 4621is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With 4622@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} 4623and so the string @samp{foo} will match. 4624 4625@item 4626Note that if the definition begins with @samp{^} or ends with @samp{$} 4627then it is @emph{not} expanded with parentheses, to allow these 4628operators to appear in definitions without losing their special 4629meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators 4630cannot be used in a @code{flex} definition. 4631 4632@item 4633Using @samp{-l} results in the @code{lex} behavior of no parentheses 4634around the definition. 4635 4636@item 4637The POSIX specification is that the definition be enclosed in parentheses. 4638 4639@item 4640Some implementations of @code{lex} allow a rule's action to begin on a 4641separate line, if the rule's pattern has trailing whitespace: 4642 4643@cindex patterns and actions on different lines 4644@example 4645@verbatim 4646 %% 4647 foo|bar<space here> 4648 { foobar_action();} 4649@end verbatim 4650@end example 4651 4652@code{flex} does not support this feature. 4653 4654@item 4655The @code{lex} @code{%r} (generate a Ratfor scanner) option is not 4656supported. It is not part of the POSIX specification. 4657 4658@item 4659After a call to @code{unput()}, @emph{yytext} is undefined until the 4660next token is matched, unless the scanner was built using @code{%array}. 4661This is not the case with @code{lex} or the POSIX specification. The 4662@samp{-l} option does away with this incompatibility. 4663 4664@item 4665The precedence of the @samp{@{,@}} (numeric range) operator is 4666different. The AT&T and POSIX specifications of @code{lex} 4667interpret @samp{abc@{1,3@}} as match one, two, 4668or three occurrences of @samp{abc}'', whereas @code{flex} interprets it 4669as ``match @samp{ab} followed by one, two, or three occurrences of 4670@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this 4671incompatibility. 4672 4673@item 4674The precedence of the @samp{^} operator is different. @code{lex} 4675interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a 4676line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match 4677either @samp{foo} or @samp{bar} if they come at the beginning of a 4678line''. The latter is in agreement with the POSIX specification. 4679 4680@item 4681The special table-size declarations such as @code{%a} supported by 4682@code{lex} are not required by @code{flex} scanners.. @code{flex} 4683ignores them. 4684@item 4685The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be 4686written for use with either @code{flex} or @code{lex}. Scanners also 4687include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} 4688and @code{YY_FLEX_SUBMINOR_VERSION} 4689indicating which version of @code{flex} generated the scanner. For 4690example, for the 2.5.22 release, these defines would be 2, 5 and 22 4691respectively. If the version of @code{flex} being used is a beta 4692version, then the symbol @code{FLEX_BETA} is defined. 4693 4694@item 4695The symbols @samp{[[} and @samp{]]} in the code sections of the input 4696may conflict with the m4 delimiters. @xref{M4 Dependency}. 4697 4698 4699@end itemize 4700 4701@cindex POSIX comp;compliance 4702@cindex non-POSIX features of flex 4703The following @code{flex} features are not included in @code{lex} or the 4704POSIX specification: 4705 4706@itemize 4707@item 4708C++ scanners 4709@item 4710%option 4711@item 4712start condition scopes 4713@item 4714start condition stacks 4715@item 4716interactive/non-interactive scanners 4717@item 4718yy_scan_string() and friends 4719@item 4720yyterminate() 4721@item 4722yy_set_interactive() 4723@item 4724yy_set_bol() 4725@item 4726YY_AT_BOL() 4727 <<EOF>> 4728@item 4729<*> 4730@item 4731YY_DECL 4732@item 4733YY_START 4734@item 4735YY_USER_ACTION 4736@item 4737YY_USER_INIT 4738@item 4739#line directives 4740@item 4741%@{@}'s around actions 4742@item 4743reentrant C API 4744@item 4745multiple actions on a line 4746@item 4747almost all of the @code{flex} command-line options 4748@end itemize 4749 4750The feature ``multiple actions on a line'' 4751refers to the fact that with @code{flex} you can put multiple actions on 4752the same line, separated with semi-colons, while with @code{lex}, the 4753following: 4754 4755@example 4756@verbatim 4757 foo handle_foo(); ++num_foos_seen; 4758@end verbatim 4759@end example 4760 4761is (rather surprisingly) truncated to 4762 4763@example 4764@verbatim 4765 foo handle_foo(); 4766@end verbatim 4767@end example 4768 4769@code{flex} does not truncate the action. Actions that are not enclosed 4770in braces are simply terminated at the end of the line. 4771 4772@node Memory Management, Serialized Tables, Lex and Posix, Top 4773@chapter Memory Management 4774 4775@cindex memory management 4776@anchor{memory-management} 4777This chapter describes how flex handles dynamic memory, and how you can 4778override the default behavior. 4779 4780@menu 4781* The Default Memory Management:: 4782* Overriding The Default Memory Management:: 4783* A Note About yytext And Memory:: 4784@end menu 4785 4786@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management 4787@section The Default Memory Management 4788 4789Flex allocates dynamic memory during initialization, and once in a while from 4790within a call to yylex(). Initialization takes place during the first call to 4791yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a 4792buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} 4793@xref{faq-memory-leak}. 4794 4795Flex allocates dynamic memory for four purposes, listed below @footnote{The 4796quantities given here are approximate, and may vary due to host architecture, 4797compiler configuration, or due to future enhancements to flex.} 4798 4799@table @asis 4800 4801@item 16kB for the input buffer. 4802Flex allocates memory for the character buffer used to perform pattern 4803matching. Flex must read ahead from the input stream and store it in a large 4804character buffer. This buffer is typically the largest chunk of dynamic memory 4805flex consumes. This buffer will grow if necessary, doubling the size each time. 4806Flex frees this memory when you call yylex_destroy(). The default size of this 4807buffer (16384 bytes) is almost always too large. The ideal size for this 4808buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few 4809extra bytes for housekeeping. Currently, to override the size of the input buffer 4810you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan 4811to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management 4812API. 4813 4814@item 64kb for the REJECT state. This will only be allocated if you use REJECT. 4815The size is large enough to hold the same number of states as characters in the input buffer. If you override the size of the 4816input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well. 4817 4818@item 100 bytes for the start condition stack. 4819Flex allocates memory for the start condition stack. This is the stack used 4820for pushing start states, i.e., with yy_push_state(). It will grow if 4821necessary. Since the states are simply integers, this stack doesn't consume 4822much memory. This stack is not present if @code{%option stack} is not 4823specified. You will rarely need to tune this buffer. The ideal size for this 4824stack is the maximum depth expected. The memory for this stack is 4825automatically destroyed when you call yylex_destroy(). @xref{option-stack}. 4826 4827@item 40 bytes for each YY_BUFFER_STATE. 4828Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself 4829is about 40 bytes, plus an additional large character buffer (described above.) 4830The initial buffer state is created during initialization, and with each call 4831to yy_create_buffer(). You can't tune the size of this, but you can tune the 4832character buffer as described above. Any buffer state that you explicitly 4833create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You 4834must call yy_delete_buffer() to free the memory. The exception to this rule is 4835that flex will delete the current buffer automatically when you call 4836yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. 4837That way, flex will not try to delete the buffer a second time (possibly 4838crashing your program!) At the time of this writing, flex does not provide a 4839growable stack for the buffer states. You have to manage that yourself. 4840@xref{Multiple Input Buffers}. 4841 4842@item 84 bytes for the reentrant scanner guts 4843Flex allocates about 84 bytes for the reentrant scanner structure when 4844you call yylex_init(). It is destroyed when the user calls yylex_destroy(). 4845 4846@end table 4847 4848 4849@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management 4850@section Overriding The Default Memory Management 4851 4852@cindex yyalloc, overriding 4853@cindex yyrealloc, overriding 4854@cindex yyfree, overriding 4855 4856Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree} 4857when it needs to allocate or free memory. By default, these functions are 4858wrappers around the standard C functions, @code{malloc}, @code{realloc}, and 4859@code{free}, respectively. You can override the default implementations by telling 4860flex that you will provide your own implementations. 4861 4862To override the default implementations, you must do two things: 4863 4864@enumerate 4865 4866@item Suppress the default implementations by specifying one or more of the 4867following options: 4868 4869@itemize 4870@opindex noyyalloc 4871@item @code{%option noyyalloc} 4872@item @code{%option noyyrealloc} 4873@item @code{%option noyyfree}. 4874@end itemize 4875 4876@item Provide your own implementation of the following functions: @footnote{It 4877is not necessary to override all (or any) of the memory management routines. 4878You may, for example, override @code{yyrealloc}, but not @code{yyfree} or 4879@code{yyalloc}.} 4880 4881@example 4882@verbatim 4883// For a non-reentrant scanner 4884void * yyalloc (size_t bytes); 4885void * yyrealloc (void * ptr, size_t bytes); 4886void yyfree (void * ptr); 4887 4888// For a reentrant scanner 4889void * yyalloc (size_t bytes, void * yyscanner); 4890void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); 4891void yyfree (void * ptr, void * yyscanner); 4892@end verbatim 4893@end example 4894 4895@end enumerate 4896 4897In the following example, we will override all three memory routines. We assume 4898that there is a custom allocator with garbage collection. In order to make this 4899example interesting, we will use a reentrant scanner, passing a pointer to the 4900custom allocator through @code{yyextra}. 4901 4902@cindex overriding the memory routines 4903@example 4904@verbatim 4905%{ 4906#include "some_allocator.h" 4907%} 4908 4909/* Suppress the default implementations. */ 4910%option noyyalloc noyyrealloc noyyfree 4911%option reentrant 4912 4913/* Initialize the allocator. */ 4914#define YY_EXTRA_TYPE struct allocator* 4915#define YY_USER_INIT yyextra = allocator_create(); 4916 4917%% 4918.|\n ; 4919%% 4920 4921/* Provide our own implementations. */ 4922void * yyalloc (size_t bytes, void* yyscanner) { 4923 return allocator_alloc (yyextra, bytes); 4924} 4925 4926void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { 4927 return allocator_realloc (yyextra, bytes); 4928} 4929 4930void yyfree (void * ptr, void * yyscanner) { 4931 /* Do nothing -- we leave it to the garbage collector. */ 4932} 4933 4934@end verbatim 4935@end example 4936 4937 4938@node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management 4939@section A Note About yytext And Memory 4940 4941@cindex yytext, memory considerations 4942 4943When flex finds a match, @code{yytext} points to the first character of the 4944match in the input buffer. The string itself is part of the input buffer, and 4945is @emph{NOT} allocated separately. The value of yytext will be overwritten the next 4946time yylex() is called. In short, the value of yytext is only valid from within 4947the matched rule's action. 4948 4949Often, you want the value of yytext to persist for later processing, i.e., by a 4950parser with non-zero lookahead. In order to preserve yytext, you will have to 4951copy it with strdup() or a similar function. But this introduces some headache 4952because your parser is now responsible for freeing the copy of yytext. If you 4953use a yacc or bison parser, (commonly used with flex), you will discover that 4954the error recovery mechanisms can cause memory to be leaked. 4955 4956To prevent memory leaks from strdup'd yytext, you will have to track the memory 4957somehow. Our experience has shown that a garbage collection mechanism or a 4958pooled memory mechanism will save you a lot of grief when writing parsers. 4959 4960@node Serialized Tables, Diagnostics, Memory Management, Top 4961@chapter Serialized Tables 4962@cindex serialization 4963@cindex memory, serialized tables 4964 4965@anchor{serialization} 4966A @code{flex} scanner has the ability to save the DFA tables to a file, and 4967load them at runtime when needed. The motivation for this feature is to reduce 4968the runtime memory footprint. Traditionally, these tables have been compiled into 4969the scanner as C arrays, and are sometimes quite large. Since the tables are 4970compiled into the scanner, the memory used by the tables can never be freed. 4971This is a waste of memory, especially if an application uses several scanners, 4972but none of them at the same time. 4973 4974The serialization feature allows the tables to be loaded at runtime, before 4975scanning begins. The tables may be discarded when scanning is finished. 4976 4977@menu 4978* Creating Serialized Tables:: 4979* Loading and Unloading Serialized Tables:: 4980* Tables File Format:: 4981@end menu 4982 4983@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables 4984@section Creating Serialized Tables 4985@cindex tables, creating serialized 4986@cindex serialization of tables 4987 4988You may create a scanner with serialized tables by specifying: 4989 4990@example 4991@verbatim 4992 %option tables-file=FILE 4993or 4994 --tables-file=FILE 4995@end verbatim 4996@end example 4997 4998These options instruct flex to save the DFA tables to the file @var{FILE}. The tables 4999will @emph{not} be embedded in the generated scanner. The scanner will not 5000function on its own. The scanner will be dependent upon the serialized tables. You must 5001load the tables from this file at runtime before you can scan anything. 5002 5003If you do not specify a filename to @code{--tables-file}, the tables will be 5004saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix. 5005 5006If your project uses several different scanners, you can concatenate the 5007serialized tables into one file, and flex will find the correct set of tables, 5008using the scanner prefix as part of the lookup key. An example follows: 5009 5010@cindex serialized tables, multiple scanners 5011@example 5012@verbatim 5013$ flex --tables-file --prefix=cpp cpp.l 5014$ flex --tables-file --prefix=c c.l 5015$ cat lex.cpp.tables lex.c.tables > all.tables 5016@end verbatim 5017@end example 5018 5019The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did 5020not specify a filename, the tables were serialized to @file{lex.c.tables} and 5021@file{lex.cpp.tables}, respectively. Then, we concatenated the two files 5022together into @file{all.tables}, which we will distribute with our project. At 5023runtime, we will open the file and tell flex to load the tables from it. Flex 5024will find the correct tables automatically. (See next section). 5025 5026@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables 5027@section Loading and Unloading Serialized Tables 5028@cindex tables, loading and unloading 5029@cindex loading tables at runtime 5030@cindex tables, freeing 5031@cindex freeing tables 5032@cindex memory, serialized tables 5033 5034If you've built your scanner with @code{%option tables-file}, then you must 5035load the scanner tables at runtime. This can be accomplished with the following 5036function: 5037 5038@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) 5039Locates scanner tables in the stream pointed to by @var{fp} and loads them. 5040Memory for the tables is allocated via @code{yyalloc}. You must call this 5041function before the first call to @code{yylex}. The argument @var{scanner} 5042only appears in the reentrant scanner. 5043This function returns @samp{0} (zero) on success, or non-zero on error. 5044@end deftypefun 5045 5046The loaded tables are @strong{not} automatically destroyed (unloaded) when you 5047call @code{yylex_destroy}. The reason is that you may create several scanners 5048of the same type (in a reentrant scanner), each of which needs access to these 5049tables. To avoid a nasty memory leak, you must call the following function: 5050 5051@deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) 5052Unloads the scanner tables. The tables must be loaded again before you can scan 5053any more data. The argument @var{scanner} only appears in the reentrant 5054scanner. This function returns @samp{0} (zero) on success, or non-zero on 5055error. 5056@end deftypefun 5057 5058@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not 5059thread-safe.} You must ensure that these functions are called exactly once (for 5060each scanner type) in a threaded program, before any thread calls @code{yylex}. 5061After the tables are loaded, they are never written to, and no thread 5062protection is required thereafter -- until you destroy them. 5063 5064@node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables 5065@section Tables File Format 5066@cindex tables, file format 5067@cindex file format, serialized tables 5068 5069This section defines the file format of serialized @code{flex} tables. 5070 5071The tables format allows for one or more sets of tables to be 5072specified, where each set corresponds to a given scanner. Scanners are 5073indexed by name, as described below. The file format is as follows: 5074 5075@example 5076@verbatim 5077 TABLE SET 1 5078 +-------------------------------+ 5079 Header | uint32 th_magic; | 5080 | uint32 th_hsize; | 5081 | uint32 th_ssize; | 5082 | uint16 th_flags; | 5083 | char th_version[]; | 5084 | char th_name[]; | 5085 | uint8 th_pad64[]; | 5086 +-------------------------------+ 5087 Table 1 | uint16 td_id; | 5088 | uint16 td_flags; | 5089 | uint32 td_hilen; | 5090 | uint32 td_lolen; | 5091 | void td_data[]; | 5092 | uint8 td_pad64[]; | 5093 +-------------------------------+ 5094 Table 2 | | 5095 . . . 5096 . . . 5097 . . . 5098 . . . 5099 Table n | | 5100 +-------------------------------+ 5101 TABLE SET 2 5102 . 5103 . 5104 . 5105 TABLE SET N 5106@end verbatim 5107@end example 5108 5109The above diagram shows that a complete set of tables consists of a header 5110followed by multiple individual tables. Furthermore, multiple complete sets may 5111be present in the same file, each set with its own header and tables. The sets 5112are contiguous in the file. The only way to know if another set follows is to 5113check the next four bytes for the magic number (or check for EOF). The header 5114and tables sections are padded to 64-bit boundaries. Below we describe each 5115field in detail. This format does not specify how the scanner will expand the 5116given data, i.e., data may be serialized as int8, but expanded to an int32 5117array at runtime. This is to reduce the size of the serialized data where 5118possible. Remember, @emph{all integer values are in network byte order}. 5119 5120@noindent 5121Fields of a table header: 5122 5123@table @code 5124@item th_magic 5125Magic number, always 0xF13C57B1. 5126 5127@item th_hsize 5128Size of this entire header, in bytes, including all fields plus any padding. 5129 5130@item th_ssize 5131Size of this entire set, in bytes, including the header, all tables, plus 5132any padding. 5133 5134@item th_flags 5135Bit flags for this table set. Currently unused. 5136 5137@item th_version[] 5138Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is 5139the version of flex that was used to create the serialized tables. 5140 5141@item th_name[] 5142Contains the name of this table set. The default is @samp{yytables}, 5143and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. 5144 5145@item th_pad64[] 5146Zero or more NULL bytes, padding the entire header to the next 64-bit boundary 5147as calculated from the beginning of the header. 5148@end table 5149 5150@noindent 5151Fields of a table: 5152 5153@table @code 5154@item td_id 5155Specifies the table identifier. Possible values are: 5156@table @code 5157@item YYTD_ID_ACCEPT (0x01) 5158@code{yy_accept} 5159@item YYTD_ID_BASE (0x02) 5160@code{yy_base} 5161@item YYTD_ID_CHK (0x03) 5162@code{yy_chk} 5163@item YYTD_ID_DEF (0x04) 5164@code{yy_def} 5165@item YYTD_ID_EC (0x05) 5166@code{yy_ec } 5167@item YYTD_ID_META (0x06) 5168@code{yy_meta} 5169@item YYTD_ID_NUL_TRANS (0x07) 5170@code{yy_NUL_trans} 5171@item YYTD_ID_NXT (0x08) 5172@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen} 5173field below. 5174@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09) 5175@code{yy_rule_can_match_eol} 5176@item YYTD_ID_START_STATE_LIST (0x0A) 5177@code{yy_start_state_list}. This array is handled specially because it is an 5178array of pointers to structs. See the @code{td_flags} field below. 5179@item YYTD_ID_TRANSITION (0x0B) 5180@code{yy_transition}. This array is handled specially because it is an array of 5181structs. See the @code{td_lolen} field below. 5182@item YYTD_ID_ACCLIST (0x0C) 5183@code{yy_acclist} 5184@end table 5185 5186@item td_flags 5187Bit flags describing how to interpret the data in @code{td_data}. 5188The data arrays are one-dimensional by default, but may be 5189two dimensional as specified in the @code{td_hilen} field. 5190 5191@table @code 5192@item YYTD_DATA8 (0x01) 5193The data is serialized as an array of type int8. 5194@item YYTD_DATA16 (0x02) 5195The data is serialized as an array of type int16. 5196@item YYTD_DATA32 (0x04) 5197The data is serialized as an array of type int32. 5198@item YYTD_PTRANS (0x08) 5199The data is a list of indexes of entries in the expanded @code{yy_transition} 5200array. Each index should be expanded to a pointer to the corresponding entry 5201in the @code{yy_transition} array. We count on the fact that the 5202@code{yy_transition} array has already been seen. 5203@item YYTD_STRUCT (0x10) 5204The data is a list of yy_trans_info structs, each of which consists of 5205two integers. There is no padding between struct elements or between structs. 5206The type of each member is determined by the @code{YYTD_DATA*} bits. 5207@end table 5208 5209@item td_hilen 5210If @code{td_hilen} is non-zero, then the data is a two-dimensional array. 5211Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the 5212number of elements in the higher dimensional array, and @code{td_lolen} contains 5213the number of elements in the lowest dimension. 5214 5215Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or 5216@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified 5217by the @code{td_flags} field. It is possible for both @code{td_lolen} and 5218@code{td_hilen} to be zero, in which case @code{td_data} is a zero length 5219array, and no data is loaded, i.e., this table is simply skipped. Flex does not 5220currently generate tables of zero length. 5221 5222@item td_lolen 5223Specifies the number of elements in the lowest dimension array. If this is 5224a one-dimensional array, then it is simply the number of elements in this array. 5225The element size is determined by the @code{td_flags} field. 5226 5227@item td_data[] 5228The table data. This array may be a one- or two-dimensional array, of type 5229@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or 5230@code{struct yy_trans_info*}, depending upon the values in the 5231@code{td_flags}, @code{td_hilen}, and @code{td_lolen} fields. 5232 5233@item td_pad64[] 5234Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as 5235calculated from the beginning of this table. 5236@end table 5237 5238@node Diagnostics, Limitations, Serialized Tables, Top 5239@chapter Diagnostics 5240 5241@cindex error reporting, diagnostic messages 5242@cindex warnings, diagnostic messages 5243 5244The following is a list of @code{flex} diagnostic messages: 5245 5246@itemize 5247@item 5248@samp{warning, rule cannot be matched} indicates that the given rule 5249cannot be matched because it follows other rules that will always match 5250the same text as it. For example, in the following @samp{foo} cannot be 5251matched because it comes after an identifier ``catch-all'' rule: 5252 5253@cindex warning, rule cannot be matched 5254@example 5255@verbatim 5256 [a-z]+ got_identifier(); 5257 foo got_foo(); 5258@end verbatim 5259@end example 5260 5261Using @code{REJECT} in a scanner suppresses this warning. 5262 5263@item 5264@samp{warning, -s option given but default rule can be matched} means 5265that it is possible (perhaps only in a particular start condition) that 5266the default rule (match any single character) is the only one that will 5267match a particular input. Since @samp{-s} was given, presumably this is 5268not intended. 5269 5270@item 5271@code{reject_used_but_not_detected undefined} or 5272@code{yymore_used_but_not_detected undefined}. These errors can occur 5273at compile time. They indicate that the scanner uses @code{REJECT} or 5274@code{yymore()} but that @code{flex} failed to notice the fact, meaning 5275that @code{flex} scanned the first two sections looking for occurrences 5276of these actions and failed to find any, but somehow you snuck some in 5277(via a #include file, for example). Use @code{%option reject} or 5278@code{%option yymore} to indicate to @code{flex} that you really do use 5279these features. 5280 5281@item 5282@samp{flex scanner jammed}. a scanner compiled with 5283@samp{-s} has encountered an input string which wasn't matched by any of 5284its rules. This error can also occur due to internal problems. 5285 5286@item 5287@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} 5288and one of its rules matched a string longer than the @code{YYLMAX} 5289constant (8K bytes by default). You can increase the value by 5290#define'ing @code{YYLMAX} in the definitions section of your @code{flex} 5291input. 5292 5293@item 5294@samp{scanner requires -8 flag to use the character 'x'}. Your scanner 5295specification includes recognizing the 8-bit character @samp{'x'} and 5296you did not specify the -8 flag, and your scanner defaulted to 7-bit 5297because you used the @samp{-Cf} or @samp{-CF} table compression options. 5298See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for 5299details. 5300 5301@item 5302@samp{flex scanner push-back overflow}. you used @code{unput()} to push 5303back so much text that the scanner's buffer could not hold both the 5304pushed-back text and the current token in @code{yytext}. Ideally the 5305scanner should dynamically resize the buffer in this case, but at 5306present it does not. 5307 5308@item 5309@samp{input buffer overflow, can't enlarge buffer because scanner uses 5310REJECT}. the scanner was working on matching an extremely large token 5311and needed to expand the input buffer. This doesn't work with scanners 5312that use @code{REJECT}. 5313 5314@item 5315@samp{fatal flex scanner internal error--end of buffer missed}. This can 5316occur in a scanner which is reentered after a long-jump has jumped out 5317(or over) the scanner's activation frame. Before reentering the 5318scanner, use: 5319@example 5320@verbatim 5321 yyrestart( yyin ); 5322@end verbatim 5323@end example 5324or, as noted above, switch to using the C++ scanner class. 5325 5326@item 5327@samp{too many start conditions in <> construct!} you listed more start 5328conditions in a <> construct than exist (so you must have listed at 5329least one of them twice). 5330@end itemize 5331 5332@node Limitations, Bibliography, Diagnostics, Top 5333@chapter Limitations 5334 5335@cindex limitations of flex 5336 5337Some trailing context patterns cannot be properly matched and generate 5338warning messages (@samp{dangerous trailing context}). These are 5339patterns where the ending of the first part of the rule matches the 5340beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' 5341matches the 'x' at the beginning of the trailing context. (Note that 5342the POSIX draft states that the text matched by such patterns is 5343undefined.) For some trailing context rules, parts which are actually 5344fixed-length are not recognized as such, leading to the abovementioned 5345performance loss. In particular, parts using @samp{|} or @samp{@{n@}} 5346(such as @samp{foo@{3@}}) are always considered variable-length. 5347Combining trailing context with the special @samp{|} action can result 5348in @emph{fixed} trailing context being turned into the more expensive 5349@emph{variable} trailing context. For example, in the following: 5350 5351@cindex warning, dangerous trailing context 5352@example 5353@verbatim 5354 %% 5355 abc | 5356 xyz/def 5357@end verbatim 5358@end example 5359 5360Use of @code{unput()} invalidates yytext and yyleng, unless the 5361@code{%array} directive or the @samp{-l} option has been used. 5362Pattern-matching of @code{NUL}s is substantially slower than matching 5363other characters. Dynamic resizing of the input buffer is slow, as it 5364entails rescanning all the text matched so far by the current (generally 5365huge) token. Due to both buffering of input and read-ahead, you cannot 5366intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()}, 5367with @code{flex} rules and expect it to work. Call @code{input()} 5368instead. The total table entries listed by the @samp{-v} flag excludes 5369the number of table entries needed to determine what rule has been 5370matched. The number of entries is equal to the number of DFA states if 5371the scanner does not use @code{REJECT}, and somewhat greater than the 5372number of states if it does. @code{REJECT} cannot be used with the 5373@samp{-f} or @samp{-F} options. 5374 5375The @code{flex} internal algorithms need documentation. 5376 5377@node Bibliography, FAQ, Limitations, Top 5378@chapter Additional Reading 5379 5380You may wish to read more about the following programs: 5381@itemize 5382@item lex 5383@item yacc 5384@item sed 5385@item awk 5386@end itemize 5387 5388The following books may contain material of interest: 5389 5390John Levine, Tony Mason, and Doug Brown, 5391@emph{Lex & Yacc}, 5392O'Reilly and Associates. Be sure to get the 2nd edition. 5393 5394M. E. Lesk and E. Schmidt, 5395@emph{LEX -- Lexical Analyzer Generator} 5396 5397Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, 5398Techniques and Tools}, Addison-Wesley (1986). Describes the 5399pattern-matching techniques used by @code{flex} (deterministic finite 5400automata). 5401 5402@node FAQ, Appendices, Bibliography, Top 5403@unnumbered FAQ 5404 5405From time to time, the @code{flex} maintainer receives certain 5406questions. Rather than repeat answers to well-understood problems, we 5407publish them here. 5408 5409@menu 5410* When was flex born?:: 5411* How do I expand backslash-escape sequences in C-style quoted strings?:: 5412* Why do flex scanners call fileno if it is not ANSI compatible?:: 5413* Does flex support recursive pattern definitions?:: 5414* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 5415* Flex is not matching my patterns in the same order that I defined them.:: 5416* My actions are executing out of order or sometimes not at all.:: 5417* How can I have multiple input sources feed into the same scanner at the same time?:: 5418* Can I build nested parsers that work with the same input file?:: 5419* How can I match text only at the end of a file?:: 5420* How can I make REJECT cascade across start condition boundaries?:: 5421* Why cant I use fast or full tables with interactive mode?:: 5422* How much faster is -F or -f than -C?:: 5423* If I have a simple grammar cant I just parse it with flex?:: 5424* Why doesn't yyrestart() set the start state back to INITIAL?:: 5425* How can I match C-style comments?:: 5426* The period isn't working the way I expected.:: 5427* Can I get the flex manual in another format?:: 5428* Does there exist a "faster" NDFA->DFA algorithm?:: 5429* How does flex compile the DFA so quickly?:: 5430* How can I use more than 8192 rules?:: 5431* How do I abandon a file in the middle of a scan and switch to a new file?:: 5432* How do I execute code only during initialization (only before the first scan)?:: 5433* How do I execute code at termination?:: 5434* Where else can I find help?:: 5435* Can I include comments in the "rules" section of the file?:: 5436* I get an error about undefined yywrap().:: 5437* How can I change the matching pattern at run time?:: 5438* How can I expand macros in the input?:: 5439* How can I build a two-pass scanner?:: 5440* How do I match any string not matched in the preceding rules?:: 5441* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 5442* Is there a way to make flex treat NULL like a regular character?:: 5443* Whenever flex can not match the input it says "flex scanner jammed".:: 5444* Why doesn't flex have non-greedy operators like perl does?:: 5445* Memory leak - 16386 bytes allocated by malloc.:: 5446* How do I track the byte offset for lseek()?:: 5447* How do I use my own I/O classes in a C++ scanner?:: 5448* How do I skip as many chars as possible?:: 5449* deleteme00:: 5450* Are certain equivalent patterns faster than others?:: 5451* Is backing up a big deal?:: 5452* Can I fake multi-byte character support?:: 5453* deleteme01:: 5454* Can you discuss some flex internals?:: 5455* unput() messes up yy_at_bol:: 5456* The | operator is not doing what I want:: 5457* Why can't flex understand this variable trailing context pattern?:: 5458* The ^ operator isn't working:: 5459* Trailing context is getting confused with trailing optional patterns:: 5460* Is flex GNU or not?:: 5461* ERASEME53:: 5462* I need to scan if-then-else blocks and while loops:: 5463* ERASEME55:: 5464* ERASEME56:: 5465* ERASEME57:: 5466* Is there a repository for flex scanners?:: 5467* How can I conditionally compile or preprocess my flex input file?:: 5468* Where can I find grammars for lex and yacc?:: 5469* I get an end-of-buffer message for each character scanned.:: 5470* unnamed-faq-62:: 5471* unnamed-faq-63:: 5472* unnamed-faq-64:: 5473* unnamed-faq-65:: 5474* unnamed-faq-66:: 5475* unnamed-faq-67:: 5476* unnamed-faq-68:: 5477* unnamed-faq-69:: 5478* unnamed-faq-70:: 5479* unnamed-faq-71:: 5480* unnamed-faq-72:: 5481* unnamed-faq-73:: 5482* unnamed-faq-74:: 5483* unnamed-faq-75:: 5484* unnamed-faq-76:: 5485* unnamed-faq-77:: 5486* unnamed-faq-78:: 5487* unnamed-faq-79:: 5488* unnamed-faq-80:: 5489* unnamed-faq-81:: 5490* unnamed-faq-82:: 5491* unnamed-faq-83:: 5492* unnamed-faq-84:: 5493* unnamed-faq-85:: 5494* unnamed-faq-86:: 5495* unnamed-faq-87:: 5496* unnamed-faq-88:: 5497* unnamed-faq-90:: 5498* unnamed-faq-91:: 5499* unnamed-faq-92:: 5500* unnamed-faq-93:: 5501* unnamed-faq-94:: 5502* unnamed-faq-95:: 5503* unnamed-faq-96:: 5504* unnamed-faq-97:: 5505* unnamed-faq-98:: 5506* unnamed-faq-99:: 5507* unnamed-faq-100:: 5508* unnamed-faq-101:: 5509* What is the difference between YYLEX_PARAM and YY_DECL?:: 5510* Why do I get "conflicting types for yylex" error?:: 5511* How do I access the values set in a Flex action from within a Bison action?:: 5512@end menu 5513 5514@node When was flex born? 5515@unnumberedsec When was flex born? 5516 5517Vern Paxson took over 5518the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it 5519was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 5520a legend was born :-). 5521 5522@node How do I expand backslash-escape sequences in C-style quoted strings? 5523@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings? 5524 5525A key point when scanning quoted strings is that you cannot (easily) write 5526a single rule that will precisely match the string if you allow things 5527like embedded escape sequences and newlines. If you try to match strings 5528with a single rule then you'll wind up having to rescan the string anyway 5529to find any escape sequences. 5530 5531Instead you can use exclusive start conditions and a set of rules, one for 5532matching non-escaped text, one for matching a single escape, one for 5533matching an embedded newline, and one for recognizing the end of the 5534string. Each of these rules is then faced with the question of where to 5535put its intermediary results. The best solution is for the rules to 5536append their local value of @code{yytext} to the end of a ``string literal'' 5537buffer. A rule like the escape-matcher will append to the buffer the 5538meaning of the escape sequence rather than the literal text in @code{yytext}. 5539In this way, @code{yytext} does not need to be modified at all. 5540 5541@node Why do flex scanners call fileno if it is not ANSI compatible? 5542@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? 5543 5544Flex scanners call @code{fileno()} in order to get the file descriptor 5545corresponding to @code{yyin}. The file descriptor may be passed to 5546@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified. 5547If your system does not have @code{fileno()} support, to get rid of the 5548@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()} 5549call, you must specify one of @code{%option always-interactive} or 5550@code{%option never-interactive}. 5551 5552@node Does flex support recursive pattern definitions? 5553@unnumberedsec Does flex support recursive pattern definitions? 5554 5555e.g., 5556 5557@example 5558@verbatim 5559%% 5560block "{"({block}|{statement})*"}" 5561@end verbatim 5562@end example 5563 5564No. You cannot have recursive definitions. The pattern-matching power of 5565regular expressions in general (and therefore flex scanners, too) is 5566limited. In particular, regular expressions cannot ``balance'' parentheses 5567to an arbitrary degree. For example, it's impossible to write a regular 5568expression that matches all strings containing the same number of '@{'s 5569as '@}'s. For more powerful pattern matching, you need a parser, such 5570as @cite{GNU bison}. 5571 5572@node How do I skip huge chunks of input (tens of megabytes) while using flex? 5573@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex? 5574 5575Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}. 5576 5577@node Flex is not matching my patterns in the same order that I defined them. 5578@unnumberedsec Flex is not matching my patterns in the same order that I defined them. 5579 5580@code{flex} picks the 5581rule that matches the most text (i.e., the longest possible input string). 5582This is because @code{flex} uses an entirely different matching technique 5583(``deterministic finite automata'') that actually does all of the matching 5584simultaneously, in parallel. (Seems impossible, but it's actually a fairly 5585simple technique once you understand the principles.) 5586 5587A side-effect of this parallel matching is that when the input matches more 5588than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This 5589is explained further in the manual, in the section @xref{Matching}. 5590 5591If you want @code{flex} to choose a shorter match, then you can work around this 5592behavior by expanding your short 5593rule to match more text, then put back the extra: 5594 5595@example 5596@verbatim 5597data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; 5598@end verbatim 5599@end example 5600 5601Another fix would be to make the second rule active only during the 5602@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive 5603by declaring it with @code{%x} instead of @code{%s}. 5604 5605A final fix is to change the input language so that the ambiguity for 5606@samp{data_} is removed, by adding characters to it that don't match the 5607identifier rule, or by removing characters (such as @samp{_}) from the 5608identifier rule so it no longer matches @samp{data_}. (Of course, you might 5609also not have the option of changing the input language.) 5610 5611@node My actions are executing out of order or sometimes not at all. 5612@unnumberedsec My actions are executing out of order or sometimes not at all. 5613 5614Most likely, you have (in error) placed the opening @samp{@{} of the action 5615block on a different line than the rule, e.g., 5616 5617@example 5618@verbatim 5619^(foo|bar) 5620{ <<<--- WRONG! 5621 5622} 5623@end verbatim 5624@end example 5625 5626@code{flex} requires that the opening @samp{@{} of an action associated with a rule 5627begin on the same line as does the rule. You need instead to write your rules 5628as follows: 5629 5630@example 5631@verbatim 5632^(foo|bar) { // CORRECT! 5633 5634} 5635@end verbatim 5636@end example 5637 5638@node How can I have multiple input sources feed into the same scanner at the same time? 5639@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? 5640 5641If @dots{} 5642@itemize 5643@item 5644your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag), 5645@item 5646AND you run your scanner interactively (@samp{-I} option; default unless using special table 5647compression options), 5648@item 5649AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, 5650@end itemize 5651 5652then every time it matches a token, it will have exhausted its input 5653buffer (because the scanner is free of backtracking). This means you 5654can safely use @code{select()} at the point and only call @code{yylex()} for another 5655token if @code{select()} indicates there's data available. 5656 5657That is, move the @code{select()} out from the input function to a point where 5658it determines whether @code{yylex()} gets called for the next token. 5659 5660With this approach, you will still have problems if your input can arrive 5661piecemeal; @code{select()} could inform you that the beginning of a token is 5662available, you call @code{yylex()} to get it, but it winds up blocking waiting 5663for the later characters in the token. 5664 5665Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That 5666is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is 5667available. If input is available for the scanner, it reads and returns the 5668next byte. If input is available from another source, it calls whatever 5669function is responsible for reading from that source. (If no input is 5670available, it blocks until some input is available.) I've used this technique in an 5671interpreter I wrote that both reads keyboard input using a @code{flex} scanner and 5672IPC traffic from sockets, and it works fine. 5673 5674@node Can I build nested parsers that work with the same input file? 5675@unnumberedsec Can I build nested parsers that work with the same input file? 5676 5677This is not going to work without some additional effort. The reason is 5678that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the 5679``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K 5680of input available on yyin, and subsequent calls to other @code{yylex()}'s won't 5681see that input. You might be tempted to work around this problem by 5682redefining @code{YY_INPUT} to only return a small amount of text, but it turns out 5683that that approach is quite difficult. Instead, the best solution is to 5684combine all of your scanners into one large scanner, using a different 5685exclusive start condition for each. 5686 5687@node How can I match text only at the end of a file? 5688@unnumberedsec How can I match text only at the end of a file? 5689 5690There is no way to write a rule which is ``match this text, but only if 5691it comes at the end of the file''. You can fake it, though, if you happen 5692to have a character lying around that you don't allow in your input. 5693Then you redefine @code{YY_INPUT} to call your own routine which, if it sees 5694an @samp{EOF}, returns the magic character first (and remembers to return a 5695real @code{EOF} next time it's called). Then you could write: 5696 5697@example 5698@verbatim 5699<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ 5700@end verbatim 5701@end example 5702 5703@node How can I make REJECT cascade across start condition boundaries? 5704@unnumberedsec How can I make REJECT cascade across start condition boundaries? 5705 5706You can do this as follows. Suppose you have a start condition @samp{A}, and 5707after exhausting all of the possible matches in @samp{<A>}, you want to try 5708matches in @samp{<INITIAL>}. Then you could use the following: 5709 5710@example 5711@verbatim 5712%x A 5713%% 5714<A>rule_that_is_long ...; REJECT; 5715<A>rule ...; REJECT; /* shorter rule */ 5716<A>etc. 5717... 5718<A>.|\n { 5719/* Shortest and last rule in <A>, so 5720* cascaded REJECTs will eventually 5721* wind up matching this rule. We want 5722* to now switch to the initial state 5723* and try matching from there instead. 5724*/ 5725yyless(0); /* put back matched text */ 5726BEGIN(INITIAL); 5727} 5728@end verbatim 5729@end example 5730 5731@node Why cant I use fast or full tables with interactive mode? 5732@unnumberedsec Why can't I use fast or full tables with interactive mode? 5733 5734One of the assumptions 5735flex makes is that interactive applications are inherently slow (they're 5736waiting on a human after all). 5737It has to do with how the scanner detects that it must be finished scanning 5738a token. For interactive scanners, after scanning each character the current 5739state is looked up in a table (essentially) to see whether there's a chance 5740of another input character possibly extending the length of the match. If 5741not, the scanner halts. For non-interactive scanners, the end-of-token test 5742is much simpler, basically a compare with 0, so no memory bus cycles. Since 5743the test occurs in the innermost scanning loop, one would like to make it go 5744as fast as possible. 5745 5746Still, it seems reasonable to allow the user to choose to trade off a bit 5747of performance in this area to gain the corresponding flexibility. There 5748might be another reason, though, why fast scanners don't support the 5749interactive option. 5750 5751@node How much faster is -F or -f than -C? 5752@unnumberedsec How much faster is -F or -f than -C? 5753 5754Much faster (factor of 2-3). 5755 5756@node If I have a simple grammar cant I just parse it with flex? 5757@unnumberedsec If I have a simple grammar can't I just parse it with flex? 5758 5759Is your grammar recursive? That's almost always a sign that you're 5760better off using a parser/scanner rather than just trying to use a scanner 5761alone. 5762 5763@node Why doesn't yyrestart() set the start state back to INITIAL? 5764@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? 5765 5766There are two reasons. The first is that there might 5767be programs that rely on the start state not changing across file changes. 5768The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required, 5769so fixing the problem there doesn't solve the more general problem. 5770 5771@node How can I match C-style comments? 5772@unnumberedsec How can I match C-style comments? 5773 5774You might be tempted to try something like this: 5775 5776@example 5777@verbatim 5778"/*".*"*/" // WRONG! 5779@end verbatim 5780@end example 5781 5782or, worse, this: 5783 5784@example 5785@verbatim 5786"/*"(.|\n)"*/" // WRONG! 5787@end verbatim 5788@end example 5789 5790The above rules will eat too much input, and blow up on things like: 5791 5792@example 5793@verbatim 5794/* a comment */ do_my_thing( "oops */" ); 5795@end verbatim 5796@end example 5797 5798Here is one way which allows you to track line information: 5799 5800@example 5801@verbatim 5802<INITIAL>{ 5803"/*" BEGIN(IN_COMMENT); 5804} 5805<IN_COMMENT>{ 5806"*/" BEGIN(INITIAL); 5807[^*\n]+ // eat comment in chunks 5808"*" // eat the lone star 5809\n yylineno++; 5810} 5811@end verbatim 5812@end example 5813 5814@node The period isn't working the way I expected. 5815@unnumberedsec The '.' isn't working the way I expected. 5816 5817Here are some tips for using @samp{.}: 5818 5819@itemize 5820@item 5821A common mistake is to place the grouping parenthesis AFTER an operator, when 5822you really meant to place the parenthesis BEFORE the operator, e.g., you 5823probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. 5824 5825The first pattern matches the words @samp{foo} or @samp{bar} any number of 5826times, e.g., it matches the text @samp{barfoofoobarfoo}. The 5827second pattern matches a single instance of @code{foo} or a single instance of 5828@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . 5829@item 5830A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), 5831and NOT ``any character except newline''. 5832@item 5833Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). 5834If you really want to match ANY character, including newlines, then use @code{(.|\n)} 5835Beware that the regex @code{(.|\n)+} will match your entire input! 5836@item 5837Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} 5838@end itemize 5839 5840@node Can I get the flex manual in another format? 5841@unnumberedsec Can I get the flex manual in another format? 5842 5843The @code{flex} source distribution includes a texinfo manual. You are 5844free to convert that texinfo into whatever format you desire. The 5845@code{texinfo} package includes tools for conversion to a number of formats. 5846 5847@node Does there exist a "faster" NDFA->DFA algorithm? 5848@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? 5849 5850There's no way around the potential exponential running time - it 5851can take you exponential time just to enumerate all of the DFA states. 5852In practice, though, the running time is closer to linear, or sometimes 5853quadratic. 5854 5855@node How does flex compile the DFA so quickly? 5856@unnumberedsec How does flex compile the DFA so quickly? 5857 5858There are two big speed wins that @code{flex} uses: 5859 5860@enumerate 5861@item 5862It analyzes the input rules to construct equivalence classes for those 5863characters that always make the same transitions. It then rewrites the NFA 5864using equivalence classes for transitions instead of characters. This cuts 5865down the NFA->DFA computation time dramatically, to the point where, for 5866uncompressed DFA tables, the DFA generation is often I/O bound in writing out 5867the tables. 5868@item 5869It maintains hash values for previously computed DFA states, so testing 5870whether a newly constructed DFA state is equivalent to a previously constructed 5871state can be done very quickly, by first comparing hash values. 5872@end enumerate 5873 5874@node How can I use more than 8192 rules? 5875@unnumberedsec How can I use more than 8192 rules? 5876 5877@code{Flex} is compiled with an upper limit of 8192 rules per scanner. 5878If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex} 5879with the following changes in @file{flexdef.h}: 5880 5881@example 5882@verbatim 5883< #define YY_TRAILING_MASK 0x2000 5884< #define YY_TRAILING_HEAD_MASK 0x4000 5885-- 5886> #define YY_TRAILING_MASK 0x20000000 5887> #define YY_TRAILING_HEAD_MASK 0x40000000 5888@end verbatim 5889@end example 5890 5891This should work okay as long as your C compiler uses 32 bit integers. 5892But you might want to think about whether using such a huge number of rules 5893is the best way to solve your problem. 5894 5895The following may also be relevant: 5896 5897With luck, you should be able to increase the definitions in flexdef.h for: 5898 5899@example 5900@verbatim 5901#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 5902#define MAXIMUM_MNS 31999 5903#define BAD_SUBSCRIPT -32767 5904@end verbatim 5905@end example 5906 5907recompile everything, and it'll all work. Flex only has these 16-bit-like 5908values built into it because a long time ago it was developed on a machine 5909with 16-bit ints. I've given this advice to others in the past but haven't 5910heard back from them whether it worked okay or not... 5911 5912@node How do I abandon a file in the middle of a scan and switch to a new file? 5913@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? 5914 5915Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a 5916``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}. 5917 5918@node How do I execute code only during initialization (only before the first scan)? 5919@unnumberedsec How do I execute code only during initialization (only before the first scan)? 5920 5921You can specify an initial action by defining the macro @code{YY_USER_INIT} (though 5922note that @code{yyout} may not be available at the time this macro is executed). Or you 5923can add to the beginning of your rules section: 5924 5925@example 5926@verbatim 5927%% 5928 /* Must be indented! */ 5929 static int did_init = 0; 5930 5931 if ( ! did_init ){ 5932do_my_init(); 5933 did_init = 1; 5934 } 5935@end verbatim 5936@end example 5937 5938@node How do I execute code at termination? 5939@unnumberedsec How do I execute code at termination? 5940 5941You can specify an action for the @code{<<EOF>>} rule. 5942 5943@node Where else can I find help? 5944@unnumberedsec Where else can I find help? 5945 5946You can find the flex homepage on the web at 5947@uref{http://flex.sourceforge.net/}. See that page for details about flex 5948mailing lists as well. 5949 5950@node Can I include comments in the "rules" section of the file? 5951@unnumberedsec Can I include comments in the "rules" section of the file? 5952 5953Yes, just about anywhere you want to. See the manual for the specific syntax. 5954 5955@node I get an error about undefined yywrap(). 5956@unnumberedsec I get an error about undefined yywrap(). 5957 5958You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a} 5959(which provides one), or use 5960 5961@example 5962@verbatim 5963%option noyywrap 5964@end verbatim 5965@end example 5966 5967in your source to say you don't want a @code{yywrap()} function. 5968 5969@node How can I change the matching pattern at run time? 5970@unnumberedsec How can I change the matching pattern at run time? 5971 5972You can't, it's compiled into a static table when flex builds the scanner. 5973 5974@node How can I expand macros in the input? 5975@unnumberedsec How can I expand macros in the input? 5976 5977The best way to approach this problem is at a higher level, e.g., in the parser. 5978 5979However, you can do this using multiple input buffers. 5980 5981@example 5982@verbatim 5983%% 5984macro/[a-z]+ { 5985/* Saw the macro "macro" followed by extra stuff. */ 5986main_buffer = YY_CURRENT_BUFFER; 5987expansion_buffer = yy_scan_string(expand(yytext)); 5988yy_switch_to_buffer(expansion_buffer); 5989} 5990 5991<<EOF>> { 5992if ( expansion_buffer ) 5993{ 5994// We were doing an expansion, return to where 5995// we were. 5996yy_switch_to_buffer(main_buffer); 5997yy_delete_buffer(expansion_buffer); 5998expansion_buffer = 0; 5999} 6000else 6001yyterminate(); 6002} 6003@end verbatim 6004@end example 6005 6006You probably will want a stack of expansion buffers to allow nested macros. 6007From the above though hopefully the idea is clear. 6008 6009@node How can I build a two-pass scanner? 6010@unnumberedsec How can I build a two-pass scanner? 6011 6012One way to do it is to filter the first pass to a temporary file, 6013then process the temporary file on the second pass. You will probably see a 6014performance hit, due to all the disk I/O. 6015 6016When you need to look ahead far forward like this, it almost always means 6017that the right solution is to build a parse tree of the entire input, then 6018walk it after the parse in order to generate the output. In a sense, this 6019is a two-pass approach, once through the text and once through the parse 6020tree, but the performance hit for the latter is usually an order of magnitude 6021smaller, since everything is already classified, in binary format, and 6022residing in memory. 6023 6024@node How do I match any string not matched in the preceding rules? 6025@unnumberedsec How do I match any string not matched in the preceding rules? 6026 6027One way to assign precedence, is to place the more specific rules first. If 6028two rules would match the same input (same sequence of characters) then the 6029first rule listed in the @code{flex} input wins, e.g., 6030 6031@example 6032@verbatim 6033%% 6034foo[a-zA-Z_]+ return FOO_ID; 6035bar[a-zA-Z_]+ return BAR_ID; 6036[a-zA-Z_]+ return GENERIC_ID; 6037@end verbatim 6038@end example 6039 6040Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the 6041same amount of text as the more specific rules, and in that case the 6042@code{flex} scanner will pick the first rule listed in your scanner as the 6043one to match. 6044 6045@node I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6046@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6047 6048Those are internal variables pointing into the AT&T scanner's input buffer. I 6049imagine they're being manipulated in user versions of the @code{input()} and @code{unput()} 6050functions. If so, what you need to do is analyze those functions to figure out 6051what they're doing, and then replace @code{input()} with an appropriate definition of 6052@code{YY_INPUT}. You shouldn't need to (and must not) replace 6053@code{flex}'s @code{unput()} function. 6054 6055@node Is there a way to make flex treat NULL like a regular character? 6056@unnumberedsec Is there a way to make flex treat NULL like a regular character? 6057 6058Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient 6059version of @code{flex}. The latest release is version @value{VERSION}. 6060 6061@node Whenever flex can not match the input it says "flex scanner jammed". 6062@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". 6063 6064You need to add a rule that matches the otherwise-unmatched text, 6065e.g., 6066 6067@example 6068@verbatim 6069%option yylineno 6070%% 6071[[a bunch of rules here]] 6072 6073. printf("bad input character '%s' at line %d\n", yytext, yylineno); 6074@end verbatim 6075@end example 6076 6077See @code{%option default} for more information. 6078 6079@node Why doesn't flex have non-greedy operators like perl does? 6080@unnumberedsec Why doesn't flex have non-greedy operators like perl does? 6081 6082A DFA can do a non-greedy match by stopping 6083the first time it enters an accepting state, instead of consuming input until 6084it determines that no further matching is possible (a ``jam'' state). This 6085is actually easier to implement than longest leftmost match (which flex does). 6086 6087But it's also much less useful than longest leftmost match. In general, 6088when you find yourself wishing for non-greedy matching, that's usually a 6089sign that you're trying to make the scanner do some parsing. That's 6090generally the wrong approach, since it lacks the power to do a decent job. 6091Better is to either introduce a separate parser, or to split the scanner 6092into multiple scanners using (exclusive) start conditions. 6093 6094You might have 6095a separate start state once you've seen the @samp{BEGIN}. In that state, you 6096might then have a regex that will match @samp{END} (to kick you out of the 6097state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... 6098 6099This approach also has much better error-reporting properties. 6100 6101@node Memory leak - 16386 bytes allocated by malloc. 6102@unnumberedsec Memory leak - 16386 bytes allocated by malloc. 6103@anchor{faq-memory-leak} 6104 6105UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not 6106call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read 6107on. 6108 6109The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and 6110about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in 6111the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ 6112scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed. 6113 6114However, the leak won't multiply since the buffer is reused no matter how many 6115times you call @code{yylex()}. 6116 6117If you want to reclaim the memory when you are completely done scanning, then 6118you might try this: 6119 6120@example 6121@verbatim 6122/* For non-reentrant C scanner only. */ 6123yy_delete_buffer(YY_CURRENT_BUFFER); 6124yy_init = 1; 6125@end verbatim 6126@end example 6127 6128Note: @code{yy_init} is an "internal variable", and hasn't been tested in this 6129situation. It is possible that some other globals may need resetting as well. 6130 6131@node How do I track the byte offset for lseek()? 6132@unnumberedsec How do I track the byte offset for lseek()? 6133 6134@example 6135@verbatim 6136> We thought that it would be possible to have this number through the 6137> evaluation of the following expression: 6138> 6139> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf 6140@end verbatim 6141@end example 6142 6143While this is the right idea, it has two problems. The first is that 6144it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during 6145an invocation of @code{YY_INPUT} (or that your input source will return less 6146even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem 6147is that when refilling its internal buffer, @code{flex} keeps some characters 6148from the previous buffer (because usually it's in the middle of a match, 6149and needs those characters to construct @code{yytext} for the match once it's 6150done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't 6151be exactly the number of characters already read from the current buffer. 6152 6153An alternative solution is to count the number of characters you've matched 6154since starting to scan. This can be done by using @code{YY_USER_ACTION}. For 6155example, 6156 6157@example 6158@verbatim 6159#define YY_USER_ACTION num_chars += yyleng; 6160@end verbatim 6161@end example 6162 6163(You need to be careful to update your bookkeeping if you use @code{yymore(}), 6164@code{yyless()}, @code{unput()}, or @code{input()}.) 6165 6166@node How do I use my own I/O classes in a C++ scanner? 6167@section How do I use my own I/O classes in a C++ scanner? 6168 6169When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. 6170 6171@cindex LexerOutput, overriding 6172@cindex LexerInput, overriding 6173@cindex overriding LexerOutput 6174@cindex overriding LexerInput 6175@cindex customizing I/O in C++ scanners 6176@cindex C++ I/O, customizing 6177You can do this by passing the various functions (such as @code{LexerInput()} 6178and @code{LexerOutput()}) NULL @code{iostream*}'s, and then 6179dealing with your own I/O classes surreptitiously (i.e., stashing them in 6180special member variables). This works because the only assumption about 6181the lexer regarding what's done with the iostream's is that they're 6182ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever 6183is necessary with them. 6184 6185@c faq edit stopped here 6186@node How do I skip as many chars as possible? 6187@unnumberedsec How do I skip as many chars as possible? 6188 6189How do I skip as many chars as possible -- without interfering with the other 6190patterns? 6191 6192In the example below, we want to skip over characters until we see the phrase 6193"endskip". The following will @emph{NOT} work correctly (do you see why not?) 6194 6195@example 6196@verbatim 6197/* INCORRECT SCANNER */ 6198%x SKIP 6199%% 6200<INITIAL>startskip BEGIN(SKIP); 6201... 6202<SKIP>"endskip" BEGIN(INITIAL); 6203<SKIP>.* ; 6204@end verbatim 6205@end example 6206 6207The problem is that the pattern .* will eat up the word "endskip." 6208The simplest (but slow) fix is: 6209 6210@example 6211@verbatim 6212<SKIP>"endskip" BEGIN(INITIAL); 6213<SKIP>. ; 6214@end verbatim 6215@end example 6216 6217The fix involves making the second rule match more, without 6218making it match "endskip" plus something else. So for example: 6219 6220@example 6221@verbatim 6222<SKIP>"endskip" BEGIN(INITIAL); 6223<SKIP>[^e]+ ; 6224<SKIP>. ;/* so you eat up e's, too */ 6225@end verbatim 6226@end example 6227 6228@c TODO: Evaluate this faq. 6229@node deleteme00 6230@unnumberedsec deleteme00 6231@example 6232@verbatim 6233QUESTION: 6234When was flex born? 6235 6236Vern Paxson took over 6237the Software Tools lex project from Jef Poskanzer in 1982. At that point it 6238was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 6239a legend was born :-). 6240@end verbatim 6241@end example 6242 6243@c TODO: Evaluate this faq. 6244@node Are certain equivalent patterns faster than others? 6245@unnumberedsec Are certain equivalent patterns faster than others? 6246@example 6247@verbatim 6248To: Adoram Rogel <adoram@orna.hybridge.com> 6249Subject: Re: Flex 2.5.2 performance questions 6250In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. 6251Date: Wed, 18 Sep 96 10:51:02 PDT 6252From: Vern Paxson <vern> 6253 6254[Note, the most recent flex release is 2.5.4, which you can get from 6255ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] 6256 6257> 1. Using the pattern 6258> ([Ff](oot)?)?[Nn](ote)?(\.)? 6259> instead of 6260> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) 6261> (in a very complicated flex program) caused the program to slow from 6262> 300K+/min to 100K/min (no other changes were done). 6263 6264These two are not equivalent. For example, the first can match "footnote." 6265but the second can only match "footnote". This is almost certainly the 6266cause in the discrepancy - the slower scanner run is matching more tokens, 6267and/or having to do more backing up. 6268 6269> 2. Which of these two are better: [Ff]oot or (F|f)oot ? 6270 6271From a performance point of view, they're equivalent (modulo presumably 6272minor effects such as memory cache hit rates; and the presence of trailing 6273context, see below). From a space point of view, the first is slightly 6274preferable. 6275 6276> 3. I have a pattern that look like this: 6277> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) 6278> 6279> running yet another complicated program that includes the following rule: 6280> <snext>{and}/{no4}{bb}{pats} 6281> 6282> gets me to "too complicated - over 32,000 states"... 6283 6284I can't tell from this example whether the trailing context is variable-length 6285or fixed-length (it could be the latter if {and} is fixed-length). If it's 6286variable length, which flex -p will tell you, then this reflects a basic 6287performance problem, and if you can eliminate it by restructuring your 6288scanner, you will see significant improvement. 6289 6290> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about 6291> 10 patterns and changed the rule to be 5 rules. 6292> This did compile, but what is the rule of thumb here ? 6293 6294The rule is to avoid trailing context other than fixed-length, in which for 6295a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use 6296of the '|' operator automatically makes the pattern variable length, so in 6297this case '[Ff]oot' is preferred to '(F|f)oot'. 6298 6299> 4. I changed a rule that looked like this: 6300> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... 6301> 6302> to the next 2 rules: 6303> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} 6304> <snext8>{and}{bb}/{ROMAN} { BEGIN... 6305> 6306> Again, I understand the using [^...] will cause a great performance loss 6307 6308Actually, it doesn't cause any sort of performance loss. It's a surprising 6309fact about regular expressions that they always match in linear time 6310regardless of how complex they are. 6311 6312> but are there any specific rules about it ? 6313 6314See the "Performance Considerations" section of the man page, and also 6315the example in MISC/fastwc/. 6316 6317 Vern 6318@end verbatim 6319@end example 6320 6321@c TODO: Evaluate this faq. 6322@node Is backing up a big deal? 6323@unnumberedsec Is backing up a big deal? 6324@example 6325@verbatim 6326To: Adoram Rogel <adoram@hybridge.com> 6327Subject: Re: Flex 2.5.2 performance questions 6328In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. 6329Date: Thu, 19 Sep 96 09:58:00 PDT 6330From: Vern Paxson <vern> 6331 6332> a lot about the backing up problem. 6333> I believe that there lies my biggest problem, and I'll try to improve 6334> it. 6335 6336Since you have variable trailing context, this is a bigger performance 6337problem. Fixing it is usually easier than fixing backing up, which in a 6338complicated scanner (yours seems to fit the bill) can be extremely 6339difficult to do correctly. 6340 6341You also don't mention what flags you are using for your scanner. 6342-f makes a large speed difference, and -Cfe buys you nearly as much 6343speed but the resulting scanner is considerably smaller. 6344 6345> I have an | operator in {and} and in {pats} so both of them are variable 6346> length. 6347 6348-p should have reported this. 6349 6350> Is changing one of them to fixed-length is enough ? 6351 6352Yes. 6353 6354> Is it possible to change the 32,000 states limit ? 6355 6356Yes. I've appended instructions on how. Before you make this change, 6357though, you should think about whether there are ways to fundamentally 6358simplify your scanner - those are certainly preferable! 6359 6360 Vern 6361 6362To increase the 32K limit (on a machine with 32 bit integers), you increase 6363the magnitude of the following in flexdef.h: 6364 6365#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6366#define MAXIMUM_MNS 31999 6367#define BAD_SUBSCRIPT -32767 6368#define MAX_SHORT 32700 6369 6370Adding a 0 or two after each should do the trick. 6371@end verbatim 6372@end example 6373 6374@c TODO: Evaluate this faq. 6375@node Can I fake multi-byte character support? 6376@unnumberedsec Can I fake multi-byte character support? 6377@example 6378@verbatim 6379To: Heeman_Lee@hp.com 6380Subject: Re: flex - multi-byte support? 6381In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. 6382Date: Fri, 04 Oct 1996 11:42:18 PDT 6383From: Vern Paxson <vern> 6384 6385> I assume as long as my *.l file defines the 6386> range of expected character code values (in octal format), flex will 6387> scan the file and read multi-byte characters correctly. But I have no 6388> confidence in this assumption. 6389 6390Your lack of confidence is justified - this won't work. 6391 6392Flex has in it a widespread assumption that the input is processed 6393one byte at a time. Fixing this is on the to-do list, but is involved, 6394so it won't happen any time soon. In the interim, the best I can suggest 6395(unless you want to try fixing it yourself) is to write your rules in 6396terms of pairs of bytes, using definitions in the first section: 6397 6398 X \xfe\xc2 6399 ... 6400 %% 6401 foo{X}bar found_foo_fe_c2_bar(); 6402 6403etc. Definitely a pain - sorry about that. 6404 6405By the way, the email address you used for me is ancient, indicating you 6406have a very old version of flex. You can get the most recent, 2.5.4, from 6407ftp.ee.lbl.gov. 6408 6409 Vern 6410@end verbatim 6411@end example 6412 6413@c TODO: Evaluate this faq. 6414@node deleteme01 6415@unnumberedsec deleteme01 6416@example 6417@verbatim 6418To: moleary@primus.com 6419Subject: Re: Flex / Unicode compatibility question 6420In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. 6421Date: Tue, 22 Oct 1996 11:06:13 PDT 6422From: Vern Paxson <vern> 6423 6424Unfortunately flex at the moment has a widespread assumption within it 6425that characters are processed 8 bits at a time. I don't see any easy 6426fix for this (other than writing your rules in terms of double characters - 6427a pain). I also don't know of a wider lex, though you might try surfing 6428the Plan 9 stuff because I know it's a Unicode system, and also the PCCT 6429toolkit (try searching say Alta Vista for "Purdue Compiler Construction 6430Toolkit"). 6431 6432Fixing flex to handle wider characters is on the long-term to-do list. 6433But since flex is a strictly spare-time project these days, this probably 6434won't happen for quite a while, unless someone else does it first. 6435 6436 Vern 6437@end verbatim 6438@end example 6439 6440@c TODO: Evaluate this faq. 6441@node Can you discuss some flex internals? 6442@unnumberedsec Can you discuss some flex internals? 6443@example 6444@verbatim 6445To: Johan Linde <jl@theophys.kth.se> 6446Subject: Re: translation of flex 6447In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. 6448Date: Mon, 11 Nov 1996 10:33:50 PST 6449From: Vern Paxson <vern> 6450 6451> I'm working for the Swedish team translating GNU program, and I'm currently 6452> working with flex. I have a few questions about some of the messages which 6453> I hope you can answer. 6454 6455All of the things you're wondering about, by the way, concerning flex 6456internals - probably the only person who understands what they mean in 6457English is me! So I wouldn't worry too much about getting them right. 6458That said ... 6459 6460> #: main.c:545 6461> msgid " %d protos created\n" 6462> 6463> Does proto mean prototype? 6464 6465Yes - prototypes of state compression tables. 6466 6467> #: main.c:539 6468> msgid " %d/%d (peak %d) template nxt-chk entries created\n" 6469> 6470> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) 6471> However, 'template next-check entries' doesn't make much sense to me. To be 6472> able to find a good translation I need to know a little bit more about it. 6473 6474There is a scheme in the Aho/Sethi/Ullman compiler book for compressing 6475scanner tables. It involves creating two pairs of tables. The first has 6476"base" and "default" entries, the second has "next" and "check" entries. 6477The "base" entry is indexed by the current state and yields an index into 6478the next/check table. The "default" entry gives what to do if the state 6479transition isn't found in next/check. The "next" entry gives the next 6480state to enter, but only if the "check" entry verifies that this entry is 6481correct for the current state. Flex creates templates of series of 6482next/check entries and then encodes differences from these templates as a 6483way to compress the tables. 6484 6485> #: main.c:533 6486> msgid " %d/%d base-def entries created\n" 6487> 6488> The same problem here for 'base-def'. 6489 6490See above. 6491 6492 Vern 6493@end verbatim 6494@end example 6495 6496@c TODO: Evaluate this faq. 6497@node unput() messes up yy_at_bol 6498@unnumberedsec unput() messes up yy_at_bol 6499@example 6500@verbatim 6501To: Xinying Li <xli@npac.syr.edu> 6502Subject: Re: FLEX ? 6503In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. 6504Date: Wed, 13 Nov 1996 19:51:54 PST 6505From: Vern Paxson <vern> 6506 6507> "unput()" them to input flow, question occurs. If I do this after I scan 6508> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That 6509> means the carriage flag has gone. 6510 6511You can control this by calling yy_set_bol(). It's described in the manual. 6512 6513> And if in pre-reading it goes to the end of file, is anything done 6514> to control the end of curren buffer and end of file? 6515 6516No, there's no way to put back an end-of-file. 6517 6518> By the way I am using flex 2.5.2 and using the "-l". 6519 6520The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and 65212.5.3. You can get it from ftp.ee.lbl.gov. 6522 6523 Vern 6524@end verbatim 6525@end example 6526 6527@c TODO: Evaluate this faq. 6528@node The | operator is not doing what I want 6529@unnumberedsec The | operator is not doing what I want 6530@example 6531@verbatim 6532To: Alain.ISSARD@st.com 6533Subject: Re: Start condition with FLEX 6534In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. 6535Date: Mon, 18 Nov 1996 10:41:34 PST 6536From: Vern Paxson <vern> 6537 6538> I am not able to use the start condition scope and to use the | (OR) with 6539> rules having start conditions. 6540 6541The problem is that if you use '|' as a regular expression operator, for 6542example "a|b" meaning "match either 'a' or 'b'", then it must *not* have 6543any blanks around it. If you instead want the special '|' *action* (which 6544from your scanner appears to be the case), which is a way of giving two 6545different rules the same action: 6546 6547 foo | 6548 bar matched_foo_or_bar(); 6549 6550then '|' *must* be separated from the first rule by whitespace and *must* 6551be followed by a new line. You *cannot* write it as: 6552 6553 foo | bar matched_foo_or_bar(); 6554 6555even though you might think you could because yacc supports this syntax. 6556The reason for this unfortunately incompatibility is historical, but it's 6557unlikely to be changed. 6558 6559Your problems with start condition scope are simply due to syntax errors 6560from your use of '|' later confusing flex. 6561 6562Let me know if you still have problems. 6563 6564 Vern 6565@end verbatim 6566@end example 6567 6568@c TODO: Evaluate this faq. 6569@node Why can't flex understand this variable trailing context pattern? 6570@unnumberedsec Why can't flex understand this variable trailing context pattern? 6571@example 6572@verbatim 6573To: Gregory Margo <gmargo@newton.vip.best.com> 6574Subject: Re: flex-2.5.3 bug report 6575In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. 6576Date: Sat, 23 Nov 1996 17:07:32 PST 6577From: Vern Paxson <vern> 6578 6579> Enclosed is a lex file that "real" lex will process, but I cannot get 6580> flex to process it. Could you try it and maybe point me in the right direction? 6581 6582Your problem is that some of the definitions in the scanner use the '/' 6583trailing context operator, and have it enclosed in ()'s. Flex does not 6584allow this operator to be enclosed in ()'s because doing so allows undefined 6585regular expressions such as "(a/b)+". So the solution is to remove the 6586parentheses. Note that you must also be building the scanner with the -l 6587option for AT&T lex compatibility. Without this option, flex automatically 6588encloses the definitions in parentheses. 6589 6590 Vern 6591@end verbatim 6592@end example 6593 6594@c TODO: Evaluate this faq. 6595@node The ^ operator isn't working 6596@unnumberedsec The ^ operator isn't working 6597@example 6598@verbatim 6599To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> 6600Subject: Re: Flex Bug ? 6601In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. 6602Date: Tue, 26 Nov 1996 11:15:05 PST 6603From: Vern Paxson <vern> 6604 6605> In my lexer code, i have the line : 6606> ^\*.* { } 6607> 6608> Thus all lines starting with an astrix (*) are comment lines. 6609> This does not work ! 6610 6611I can't get this problem to reproduce - it works fine for me. Note 6612though that if what you have is slightly different: 6613 6614 COMMENT ^\*.* 6615 %% 6616 {COMMENT} { } 6617 6618then it won't work, because flex pushes back macro definitions enclosed 6619in ()'s, so the rule becomes 6620 6621 (^\*.*) { } 6622 6623and now that the '^' operator is not at the immediate beginning of the 6624line, it's interpreted as just a regular character. You can avoid this 6625behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". 6626 6627 Vern 6628@end verbatim 6629@end example 6630 6631@c TODO: Evaluate this faq. 6632@node Trailing context is getting confused with trailing optional patterns 6633@unnumberedsec Trailing context is getting confused with trailing optional patterns 6634@example 6635@verbatim 6636To: Adoram Rogel <adoram@hybridge.com> 6637Subject: Re: Flex 2.5.4 BOF ??? 6638In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. 6639Date: Wed, 27 Nov 1996 10:56:25 PST 6640From: Vern Paxson <vern> 6641 6642> Organization(s)?/[a-z] 6643> 6644> This matched "Organizations" (looking in debug mode, the trailing s 6645> was matched with trailing context instead of the optional (s) in the 6646> end of the word. 6647 6648That should only happen with lex. Flex can properly match this pattern. 6649(That might be what you're saying, I'm just not sure.) 6650 6651> Is there a way to avoid this dangerous trailing context problem ? 6652 6653Unfortunately, there's no easy way. On the other hand, I don't see why 6654it should be a problem. Lex's matching is clearly wrong, and I'd hope 6655that usually the intent remains the same as expressed with the pattern, 6656so flex's matching will be correct. 6657 6658 Vern 6659@end verbatim 6660@end example 6661 6662@c TODO: Evaluate this faq. 6663@node Is flex GNU or not? 6664@unnumberedsec Is flex GNU or not? 6665@example 6666@verbatim 6667To: Cameron MacKinnon <mackin@interlog.com> 6668Subject: Re: Flex documentation bug 6669In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. 6670Date: Sun, 01 Dec 1996 22:29:39 PST 6671From: Vern Paxson <vern> 6672 6673> I'm not sure how or where to submit bug reports (documentation or 6674> otherwise) for the GNU project stuff ... 6675 6676Well, strictly speaking flex isn't part of the GNU project. They just 6677distribute it because no one's written a decent GPL'd lex replacement. 6678So you should send bugs directly to me. Those sent to the GNU folks 6679sometimes find there way to me, but some may drop between the cracks. 6680 6681> In GNU Info, under the section 'Start Conditions', and also in the man 6682> page (mine's dated April '95) is a nice little snippet showing how to 6683> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in 6684> size. Unfortunately, no overflow checking is ever done ... 6685 6686This is already mentioned in the manual: 6687 6688Finally, here's an example of how to match C-style quoted 6689strings using exclusive start conditions, including expanded 6690escape sequences (but not including checking for a string 6691that's too long): 6692 6693The reason for not doing the overflow checking is that it will needlessly 6694clutter up an example whose main purpose is just to demonstrate how to 6695use flex. 6696 6697The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. 6698 6699 Vern 6700@end verbatim 6701@end example 6702 6703@c TODO: Evaluate this faq. 6704@node ERASEME53 6705@unnumberedsec ERASEME53 6706@example 6707@verbatim 6708To: tsv@cs.UManitoba.CA 6709Subject: Re: Flex (reg).. 6710In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. 6711Date: Thu, 06 Mar 1997 15:54:19 PST 6712From: Vern Paxson <vern> 6713 6714> [:alpha:] ([:alnum:] | \\_)* 6715 6716If your rule really has embedded blanks as shown above, then it won't 6717work, as the first blank delimits the rule from the action. (It wouldn't 6718even compile ...) You need instead: 6719 6720[:alpha:]([:alnum:]|\\_)* 6721 6722and that should work fine - there's no restriction on what can go inside 6723of ()'s except for the trailing context operator, '/'. 6724 6725 Vern 6726@end verbatim 6727@end example 6728 6729@c TODO: Evaluate this faq. 6730@node I need to scan if-then-else blocks and while loops 6731@unnumberedsec I need to scan if-then-else blocks and while loops 6732@example 6733@verbatim 6734To: "Mike Stolnicki" <mstolnic@ford.com> 6735Subject: Re: FLEX help 6736In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. 6737Date: Fri, 30 May 1997 10:46:35 PDT 6738From: Vern Paxson <vern> 6739 6740> We'd like to add "if-then-else", "while", and "for" statements to our 6741> language ... 6742> We've investigated many possible solutions. The one solution that seems 6743> the most reasonable involves knowing the position of a TOKEN in yyin. 6744 6745I strongly advise you to instead build a parse tree (abstract syntax tree) 6746and loop over that instead. You'll find this has major benefits in keeping 6747your interpreter simple and extensible. 6748 6749That said, the functionality you mention for get_position and set_position 6750have been on the to-do list for a while. As flex is a purely spare-time 6751project for me, no guarantees when this will be added (in particular, it 6752for sure won't be for many months to come). 6753 6754 Vern 6755@end verbatim 6756@end example 6757 6758@c TODO: Evaluate this faq. 6759@node ERASEME55 6760@unnumberedsec ERASEME55 6761@example 6762@verbatim 6763To: Colin Paul Adams <colin@colina.demon.co.uk> 6764Subject: Re: Flex C++ classes and Bison 6765In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. 6766Date: Fri, 15 Aug 1997 10:48:19 PDT 6767From: Vern Paxson <vern> 6768 6769> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control 6770> *parm) 6771> 6772> I have been trying to get this to work as a C++ scanner, but it does 6773> not appear to be possible (warning that it matches no declarations in 6774> yyFlexLexer, or something like that). 6775> 6776> Is this supposed to be possible, or is it being worked on (I DID 6777> notice the comment that scanner classes are still experimental, so I'm 6778> not too hopeful)? 6779 6780What you need to do is derive a subclass from yyFlexLexer that provides 6781the above yylex() method, squirrels away lvalp and parm into member 6782variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. 6783 6784 Vern 6785@end verbatim 6786@end example 6787 6788@c TODO: Evaluate this faq. 6789@node ERASEME56 6790@unnumberedsec ERASEME56 6791@example 6792@verbatim 6793To: Mikael.Latvala@lmf.ericsson.se 6794Subject: Re: Possible mistake in Flex v2.5 document 6795In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. 6796Date: Fri, 05 Sep 1997 10:01:54 PDT 6797From: Vern Paxson <vern> 6798 6799> In that example you show how to count comment lines when using 6800> C style /* ... */ comments. My question is, shouldn't you take into 6801> account a scenario where end of a comment marker occurs inside 6802> character or string literals? 6803 6804The scanner certainly needs to also scan character and string literals. 6805However it does that (there's an example in the man page for strings), the 6806lexer will recognize the beginning of the literal before it runs across the 6807embedded "/*". Consequently, it will finish scanning the literal before it 6808even considers the possibility of matching "/*". 6809 6810Example: 6811 6812 '([^']*|{ESCAPE_SEQUENCE})' 6813 6814will match all the text between the ''s (inclusive). So the lexer 6815considers this as a token beginning at the first ', and doesn't even 6816attempt to match other tokens inside it. 6817 6818I thinnk this subtlety is not worth putting in the manual, as I suspect 6819it would confuse more people than it would enlighten. 6820 6821 Vern 6822@end verbatim 6823@end example 6824 6825@c TODO: Evaluate this faq. 6826@node ERASEME57 6827@unnumberedsec ERASEME57 6828@example 6829@verbatim 6830To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> 6831Subject: Re: flex limitations 6832In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. 6833Date: Mon, 08 Sep 1997 11:38:08 PDT 6834From: Vern Paxson <vern> 6835 6836> %% 6837> [a-zA-Z]+ /* skip a line */ 6838> { printf("got %s\n", yytext); } 6839> %% 6840 6841What version of flex are you using? If I feed this to 2.5.4, it complains: 6842 6843 "bug.l", line 5: EOF encountered inside an action 6844 "bug.l", line 5: unrecognized rule 6845 "bug.l", line 5: fatal parse error 6846 6847Not the world's greatest error message, but it manages to flag the problem. 6848 6849(With the introduction of start condition scopes, flex can't accommodate 6850an action on a separate line, since it's ambiguous with an indented rule.) 6851 6852You can get 2.5.4 from ftp.ee.lbl.gov. 6853 6854 Vern 6855@end verbatim 6856@end example 6857 6858@c TODO: Evaluate this faq. 6859@node Is there a repository for flex scanners? 6860@unnumberedsec Is there a repository for flex scanners? 6861 6862Not that we know of. You might try asking on comp.compilers. 6863 6864@c TODO: Evaluate this faq. 6865@node How can I conditionally compile or preprocess my flex input file? 6866@unnumberedsec How can I conditionally compile or preprocess my flex input file? 6867 6868 6869Flex doesn't have a preprocessor like C does. You might try using m4, or the C 6870preprocessor plus a sed script to clean up the result. 6871 6872 6873@c TODO: Evaluate this faq. 6874@node Where can I find grammars for lex and yacc? 6875@unnumberedsec Where can I find grammars for lex and yacc? 6876 6877In the sources for flex and bison. 6878 6879@c TODO: Evaluate this faq. 6880@node I get an end-of-buffer message for each character scanned. 6881@unnumberedsec I get an end-of-buffer message for each character scanned. 6882 6883This will happen if your LexerInput() function returns only one character 6884at a time, which can happen either if you're scanner is "interactive", or 6885if the streams library on your platform always returns 1 for yyin->gcount(). 6886 6887Solution: override LexerInput() with a version that returns whole buffers. 6888 6889@c TODO: Evaluate this faq. 6890@node unnamed-faq-62 6891@unnumberedsec unnamed-faq-62 6892@example 6893@verbatim 6894To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 6895Subject: Re: Flex maximums 6896In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. 6897Date: Mon, 17 Nov 1997 17:16:15 PST 6898From: Vern Paxson <vern> 6899 6900> I took a quick look into the flex-sources and altered some #defines in 6901> flexdefs.h: 6902> 6903> #define INITIAL_MNS 64000 6904> #define MNS_INCREMENT 1024000 6905> #define MAXIMUM_MNS 64000 6906 6907The things to fix are to add a couple of zeroes to: 6908 6909#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6910#define MAXIMUM_MNS 31999 6911#define BAD_SUBSCRIPT -32767 6912#define MAX_SHORT 32700 6913 6914and, if you get complaints about too many rules, make the following change too: 6915 6916 #define YY_TRAILING_MASK 0x200000 6917 #define YY_TRAILING_HEAD_MASK 0x400000 6918 6919- Vern 6920@end verbatim 6921@end example 6922 6923@c TODO: Evaluate this faq. 6924@node unnamed-faq-63 6925@unnumberedsec unnamed-faq-63 6926@example 6927@verbatim 6928To: jimmey@lexis-nexis.com (Jimmey Todd) 6929Subject: Re: FLEX question regarding istream vs ifstream 6930In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. 6931Date: Mon, 15 Dec 1997 13:21:35 PST 6932From: Vern Paxson <vern> 6933 6934> stdin_handle = YY_CURRENT_BUFFER; 6935> ifstream fin( "aFile" ); 6936> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); 6937> 6938> What I'm wanting to do, is pass the contents of a file thru one set 6939> of rules and then pass stdin thru another set... It works great if, I 6940> don't use the C++ classes. But since everything else that I'm doing is 6941> in C++, I thought I'd be consistent. 6942> 6943> The problem is that 'yy_create_buffer' is expecting an istream* as it's 6944> first argument (as stated in the man page). However, fin is a ifstream 6945> object. Any ideas on what I might be doing wrong? Any help would be 6946> appreciated. Thanks!! 6947 6948You need to pass &fin, to turn it into an ifstream* instead of an ifstream. 6949Then its type will be compatible with the expected istream*, because ifstream 6950is derived from istream. 6951 6952 Vern 6953@end verbatim 6954@end example 6955 6956@c TODO: Evaluate this faq. 6957@node unnamed-faq-64 6958@unnumberedsec unnamed-faq-64 6959@example 6960@verbatim 6961To: Enda Fadian <fadiane@piercom.ie> 6962Subject: Re: Question related to Flex man page? 6963In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. 6964Date: Tue, 16 Dec 1997 14:17:09 PST 6965From: Vern Paxson <vern> 6966 6967> Can you explain to me what is ment by a long-jump in relation to flex? 6968 6969Using the longjmp() function while inside yylex() or a routine called by it. 6970 6971> what is the flex activation frame. 6972 6973Just yylex()'s stack frame. 6974 6975> As far as I can see yyrestart will bring me back to the sart of the input 6976> file and using flex++ isnot really an option! 6977 6978No, yyrestart() doesn't imply a rewind, even though its name might sound 6979like it does. It tells the scanner to flush its internal buffers and 6980start reading from the given file at its present location. 6981 6982 Vern 6983@end verbatim 6984@end example 6985 6986@c TODO: Evaluate this faq. 6987@node unnamed-faq-65 6988@unnumberedsec unnamed-faq-65 6989@example 6990@verbatim 6991To: hassan@larc.info.uqam.ca (Hassan Alaoui) 6992Subject: Re: Need urgent Help 6993In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. 6994Date: Sun, 21 Dec 1997 21:30:46 PST 6995From: Vern Paxson <vern> 6996 6997> /usr/lib/yaccpar: In function `int yyparse()': 6998> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' 6999> 7000> ld: Undefined symbol 7001> _yylex 7002> _yyparse 7003> _yyin 7004 7005This is a known problem with Solaris C++ (and/or Solaris yacc). I believe 7006the fix is to explicitly insert some 'extern "C"' statements for the 7007corresponding routines/symbols. 7008 7009 Vern 7010@end verbatim 7011@end example 7012 7013@c TODO: Evaluate this faq. 7014@node unnamed-faq-66 7015@unnumberedsec unnamed-faq-66 7016@example 7017@verbatim 7018To: mc0307@mclink.it 7019Cc: gnu@prep.ai.mit.edu 7020Subject: Re: [mc0307@mclink.it: Help request] 7021In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. 7022Date: Sun, 21 Dec 1997 22:33:37 PST 7023From: Vern Paxson <vern> 7024 7025> This is my definition for float and integer types: 7026> . . . 7027> NZD [1-9] 7028> ... 7029> I've tested my program on other lex version (on UNIX Sun Solaris an HP 7030> UNIX) and it work well, so I think that my definitions are correct. 7031> There are any differences between Lex and Flex? 7032 7033There are indeed differences, as discussed in the man page. The one 7034you are probably running into is that when flex expands a name definition, 7035it puts parentheses around the expansion, while lex does not. There's 7036an example in the man page of how this can lead to different matching. 7037Flex's behavior complies with the POSIX standard (or at least with the 7038last POSIX draft I saw). 7039 7040 Vern 7041@end verbatim 7042@end example 7043 7044@c TODO: Evaluate this faq. 7045@node unnamed-faq-67 7046@unnumberedsec unnamed-faq-67 7047@example 7048@verbatim 7049To: hassan@larc.info.uqam.ca (Hassan Alaoui) 7050Subject: Re: Thanks 7051In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. 7052Date: Mon, 22 Dec 1997 14:35:05 PST 7053From: Vern Paxson <vern> 7054 7055> Thank you very much for your help. I compile and link well with C++ while 7056> declaring 'yylex ...' extern, But a little problem remains. I get a 7057> segmentation default when executing ( I linked with lfl library) while it 7058> works well when using LEX instead of flex. Do you have some ideas about the 7059> reason for this ? 7060 7061The one possible reason for this that comes to mind is if you've defined 7062yytext as "extern char yytext[]" (which is what lex uses) instead of 7063"extern char *yytext" (which is what flex uses). If it's not that, then 7064I'm afraid I don't know what the problem might be. 7065 7066 Vern 7067@end verbatim 7068@end example 7069 7070@c TODO: Evaluate this faq. 7071@node unnamed-faq-68 7072@unnumberedsec unnamed-faq-68 7073@example 7074@verbatim 7075To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> 7076Subject: Re: flex 2.5: c++ scanners & start conditions 7077In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. 7078Date: Tue, 06 Jan 1998 19:19:30 PST 7079From: Vern Paxson <vern> 7080 7081> The problem is that when I do this (using %option c++) start 7082> conditions seem to not apply. 7083 7084The BEGIN macro modifies the yy_start variable. For C scanners, this 7085is a static with scope visible through the whole file. For C++ scanners, 7086it's a member variable, so it only has visible scope within a member 7087function. Your lexbegin() routine is not a member function when you 7088build a C++ scanner, so it's not modifying the correct yy_start. The 7089diagnostic that indicates this is that you found you needed to add 7090a declaration of yy_start in order to get your scanner to compile when 7091using C++; instead, the correct fix is to make lexbegin() a member 7092function (by deriving from yyFlexLexer). 7093 7094 Vern 7095@end verbatim 7096@end example 7097 7098@c TODO: Evaluate this faq. 7099@node unnamed-faq-69 7100@unnumberedsec unnamed-faq-69 7101@example 7102@verbatim 7103To: "Boris Zinin" <boris@ippe.rssi.ru> 7104Subject: Re: current position in flex buffer 7105In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. 7106Date: Mon, 12 Jan 1998 12:03:15 PST 7107From: Vern Paxson <vern> 7108 7109> The problem is how to determine the current position in flex active 7110> buffer when a rule is matched.... 7111 7112You will need to keep track of this explicitly, such as by redefining 7113YY_USER_ACTION to count the number of characters matched. 7114 7115The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. 7116 7117 Vern 7118@end verbatim 7119@end example 7120 7121@c TODO: Evaluate this faq. 7122@node unnamed-faq-70 7123@unnumberedsec unnamed-faq-70 7124@example 7125@verbatim 7126To: Bik.Dhaliwal@bis.org 7127Subject: Re: Flex question 7128In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. 7129Date: Tue, 27 Jan 1998 22:41:52 PST 7130From: Vern Paxson <vern> 7131 7132> That requirement involves knowing 7133> the character position at which a particular token was matched 7134> in the lexer. 7135 7136The way you have to do this is by explicitly keeping track of where 7137you are in the file, by counting the number of characters scanned 7138for each token (available in yyleng). It may prove convenient to 7139do this by redefining YY_USER_ACTION, as described in the manual. 7140 7141 Vern 7142@end verbatim 7143@end example 7144 7145@c TODO: Evaluate this faq. 7146@node unnamed-faq-71 7147@unnumberedsec unnamed-faq-71 7148@example 7149@verbatim 7150To: Vladimir Alexiev <vladimir@cs.ualberta.ca> 7151Subject: Re: flex: how to control start condition from parser? 7152In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. 7153Date: Tue, 27 Jan 1998 22:45:37 PST 7154From: Vern Paxson <vern> 7155 7156> It seems useful for the parser to be able to tell the lexer about such 7157> context dependencies, because then they don't have to be limited to 7158> local or sequential context. 7159 7160One way to do this is to have the parser call a stub routine that's 7161included in the scanner's .l file, and consequently that has access ot 7162BEGIN. The only ugliness is that the parser can't pass in the state 7163it wants, because those aren't visible - but if you don't have many 7164such states, then using a different set of names doesn't seem like 7165to much of a burden. 7166 7167While generating a .h file like you suggests is certainly cleaner, 7168flex development has come to a virtual stand-still :-(, so a workaround 7169like the above is much more pragmatic than waiting for a new feature. 7170 7171 Vern 7172@end verbatim 7173@end example 7174 7175@c TODO: Evaluate this faq. 7176@node unnamed-faq-72 7177@unnumberedsec unnamed-faq-72 7178@example 7179@verbatim 7180To: Barbara Denny <denny@3com.com> 7181Subject: Re: freebsd flex bug? 7182In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. 7183Date: Fri, 30 Jan 1998 12:42:32 PST 7184From: Vern Paxson <vern> 7185 7186> lex.yy.c:1996: parse error before `=' 7187 7188This is the key, identifying this error. (It may help to pinpoint 7189it by using flex -L, so it doesn't generate #line directives in its 7190output.) I will bet you heavy money that you have a start condition 7191name that is also a variable name, or something like that; flex spits 7192out #define's for each start condition name, mapping them to a number, 7193so you can wind up with: 7194 7195 %x foo 7196 %% 7197 ... 7198 %% 7199 void bar() 7200 { 7201 int foo = 3; 7202 } 7203 7204and the penultimate will turn into "int 1 = 3" after C preprocessing, 7205since flex will put "#define foo 1" in the generated scanner. 7206 7207 Vern 7208@end verbatim 7209@end example 7210 7211@c TODO: Evaluate this faq. 7212@node unnamed-faq-73 7213@unnumberedsec unnamed-faq-73 7214@example 7215@verbatim 7216To: Maurice Petrie <mpetrie@infoscigroup.com> 7217Subject: Re: Lost flex .l file 7218In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. 7219Date: Mon, 02 Feb 1998 11:15:12 PST 7220From: Vern Paxson <vern> 7221 7222> I am curious as to 7223> whether there is a simple way to backtrack from the generated source to 7224> reproduce the lost list of tokens we are searching on. 7225 7226In theory, it's straight-forward to go from the DFA representation 7227back to a regular-expression representation - the two are isomorphic. 7228In practice, a huge headache, because you have to unpack all the tables 7229back into a single DFA representation, and then write a program to munch 7230on that and translate it into an RE. 7231 7232Sorry for the less-than-happy news ... 7233 7234 Vern 7235@end verbatim 7236@end example 7237 7238@c TODO: Evaluate this faq. 7239@node unnamed-faq-74 7240@unnumberedsec unnamed-faq-74 7241@example 7242@verbatim 7243To: jimmey@lexis-nexis.com (Jimmey Todd) 7244Subject: Re: Flex performance question 7245In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7246Date: Thu, 19 Feb 1998 08:48:51 PST 7247From: Vern Paxson <vern> 7248 7249> What I have found, is that the smaller the data chunk, the faster the 7250> program executes. This is the opposite of what I expected. Should this be 7251> happening this way? 7252 7253This is exactly what will happen if your input file has embedded NULs. 7254From the man page: 7255 7256A final note: flex is slow when matching NUL's, particularly 7257when a token contains multiple NUL's. It's best to write 7258rules which match short amounts of text if it's anticipated 7259that the text will often include NUL's. 7260 7261So that's the first thing to look for. 7262 7263 Vern 7264@end verbatim 7265@end example 7266 7267@c TODO: Evaluate this faq. 7268@node unnamed-faq-75 7269@unnumberedsec unnamed-faq-75 7270@example 7271@verbatim 7272To: jimmey@lexis-nexis.com (Jimmey Todd) 7273Subject: Re: Flex performance question 7274In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7275Date: Thu, 19 Feb 1998 15:42:25 PST 7276From: Vern Paxson <vern> 7277 7278So there are several problems. 7279 7280First, to go fast, you want to match as much text as possible, which 7281your scanners don't in the case that what they're scanning is *not* 7282a <RN> tag. So you want a rule like: 7283 7284 [^<]+ 7285 7286Second, C++ scanners are particularly slow if they're interactive, 7287which they are by default. Using -B speeds it up by a factor of 3-4 7288on my workstation. 7289 7290Third, C++ scanners that use the istream interface are slow, because 7291of how poorly implemented istream's are. I built two versions of 7292the following scanner: 7293 7294 %% 7295 .*\n 7296 .* 7297 %% 7298 7299and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. 7300The C++ istream version, using -B, takes 3.8 seconds. 7301 7302 Vern 7303@end verbatim 7304@end example 7305 7306@c TODO: Evaluate this faq. 7307@node unnamed-faq-76 7308@unnumberedsec unnamed-faq-76 7309@example 7310@verbatim 7311To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> 7312Subject: Re: FLEX 2.5 & THE YEAR 2000 7313In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. 7314Date: Wed, 03 Jun 1998 10:22:26 PDT 7315From: Vern Paxson <vern> 7316 7317> I am researching the Y2K problem with General Electric R&D 7318> and need to know if there are any known issues concerning 7319> the above mentioned software and Y2K regardless of version. 7320 7321There shouldn't be, all it ever does with the date is ask the system 7322for it and then print it out. 7323 7324 Vern 7325@end verbatim 7326@end example 7327 7328@c TODO: Evaluate this faq. 7329@node unnamed-faq-77 7330@unnumberedsec unnamed-faq-77 7331@example 7332@verbatim 7333To: "Hans Dermot Doran" <htd@ibhdoran.com> 7334Subject: Re: flex problem 7335In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. 7336Date: Tue, 21 Jul 1998 14:23:34 PDT 7337From: Vern Paxson <vern> 7338 7339> To overcome this, I gets() the stdin into a string and lex the string. The 7340> string is lexed OK except that the end of string isn't lexed properly 7341> (yy_scan_string()), that is the lexer dosn't recognise the end of string. 7342 7343Flex doesn't contain mechanisms for recognizing buffer endpoints. But if 7344you use fgets instead (which you should anyway, to protect against buffer 7345overflows), then the final \n will be preserved in the string, and you can 7346scan that in order to find the end of the string. 7347 7348 Vern 7349@end verbatim 7350@end example 7351 7352@c TODO: Evaluate this faq. 7353@node unnamed-faq-78 7354@unnumberedsec unnamed-faq-78 7355@example 7356@verbatim 7357To: soumen@almaden.ibm.com 7358Subject: Re: Flex++ 2.5.3 instance member vs. static member 7359In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. 7360Date: Tue, 28 Jul 1998 01:10:34 PDT 7361From: Vern Paxson <vern> 7362 7363> %{ 7364> int mylineno = 0; 7365> %} 7366> ws [ \t]+ 7367> alpha [A-Za-z] 7368> dig [0-9] 7369> %% 7370> 7371> Now you'd expect mylineno to be a member of each instance of class 7372> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to 7373> indicate otherwise; unless I am missing something the declaration of 7374> mylineno seems to be outside any class scope. 7375> 7376> How will this work if I want to run a multi-threaded application with each 7377> thread creating a FlexLexer instance? 7378 7379Derive your own subclass and make mylineno a member variable of it. 7380 7381 Vern 7382@end verbatim 7383@end example 7384 7385@c TODO: Evaluate this faq. 7386@node unnamed-faq-79 7387@unnumberedsec unnamed-faq-79 7388@example 7389@verbatim 7390To: Adoram Rogel <adoram@hybridge.com> 7391Subject: Re: More than 32K states change hangs 7392In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. 7393Date: Tue, 04 Aug 1998 22:28:45 PDT 7394From: Vern Paxson <vern> 7395 7396> Vern Paxson, 7397> 7398> I followed your advice, posted on Usenet bu you, and emailed to me 7399> personally by you, on how to overcome the 32K states limit. I'm running 7400> on Linux machines. 7401> I took the full source of version 2.5.4 and did the following changes in 7402> flexdef.h: 7403> #define JAMSTATE -327660 7404> #define MAXIMUM_MNS 319990 7405> #define BAD_SUBSCRIPT -327670 7406> #define MAX_SHORT 327000 7407> 7408> and compiled. 7409> All looked fine, including check and bigcheck, so I installed. 7410 7411Hmmm, you shouldn't increase MAX_SHORT, though looking through my email 7412archives I see that I did indeed recommend doing so. Try setting it back 7413to 32700; that should suffice that you no longer need -Ca. If it still 7414hangs, then the interesting question is - where? 7415 7416> Compiling the same hanged program with a out-of-the-box (RedHat 4.2 7417> distribution of Linux) 7418> flex 2.5.4 binary works. 7419 7420Since Linux comes with source code, you should diff it against what 7421you have to see what problems they missed. 7422 7423> Should I always compile with the -Ca option now ? even short and simple 7424> filters ? 7425 7426No, definitely not. It's meant to be for those situations where you 7427absolutely must squeeze every last cycle out of your scanner. 7428 7429 Vern 7430@end verbatim 7431@end example 7432 7433@c TODO: Evaluate this faq. 7434@node unnamed-faq-80 7435@unnumberedsec unnamed-faq-80 7436@example 7437@verbatim 7438To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> 7439Subject: Re: flex output for static code portion 7440In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. 7441Date: Mon, 17 Aug 1998 23:57:42 PDT 7442From: Vern Paxson <vern> 7443 7444> I would like to use flex under the hood to generate a binary file 7445> containing the data structures that control the parse. 7446 7447This has been on the wish-list for a long time. In principle it's 7448straight-forward - you redirect mkdata() et al's I/O to another file, 7449and modify the skeleton to have a start-up function that slurps these 7450into dynamic arrays. The concerns are (1) the scanner generation code 7451is hairy and full of corner cases, so it's easy to get surprised when 7452going down this path :-( ; and (2) being careful about buffering so 7453that when the tables change you make sure the scanner starts in the 7454correct state and reading at the right point in the input file. 7455 7456> I was wondering if you know of anyone who has used flex in this way. 7457 7458I don't - but it seems like a reasonable project to undertake (unlike 7459numerous other flex tweaks :-). 7460 7461 Vern 7462@end verbatim 7463@end example 7464 7465@c TODO: Evaluate this faq. 7466@node unnamed-faq-81 7467@unnumberedsec unnamed-faq-81 7468@example 7469@verbatim 7470Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) 7471 by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 7472 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) 7473Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) 7474 by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 7475 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 7476Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 7477From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> 7478Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> 7479Subject: "flex scanner push-back overflow" 7480To: vern@ee.lbl.gov 7481Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) 7482Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7483X-NoJunk: Do NOT send commercial mail, spam or ads to this address! 7484X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ 7485X-Mailer: ELM [version 2.4ME+ PL28 (25)] 7486MIME-Version: 1.0 7487Content-Type: text/plain; charset=US-ASCII 7488Content-Transfer-Encoding: 7bit 7489 7490Hi Vern, 7491 7492Yesterday, I encountered a strange problem: I use the macro processor m4 7493to include some lengthy lists into a .l file. Following is a flex macro 7494definition that causes some serious pain in my neck: 7495 7496AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) 7497 7498The complete list contains about 10kB. When I try to "flex" this file 7499(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased 7500some of the predefined values in flexdefs.h) I get the error: 7501 7502myflex/flex -8 sentag.tmp.l 7503flex scanner push-back overflow 7504 7505When I remove the slashes in the macro definition everything works fine. 7506As I understand it, the double quotes escape the slash-character so it 7507really means "/" and not "trailing context". Furthermore, I tried to 7508escape the slashes with backslashes, but with no use, the same error message 7509appeared when flexing the code. 7510 7511Do you have an idea what's going on here? 7512 7513Greetings from Germany, 7514 Georg 7515-- 7516Georg Rehm georg@cl-ki.uni-osnabrueck.de 7517Institute for Semantic Information Processing, University of Osnabrueck, FRG 7518@end verbatim 7519@end example 7520 7521@c TODO: Evaluate this faq. 7522@node unnamed-faq-82 7523@unnumberedsec unnamed-faq-82 7524@example 7525@verbatim 7526To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7527Subject: Re: "flex scanner push-back overflow" 7528In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. 7529Date: Thu, 20 Aug 1998 07:05:35 PDT 7530From: Vern Paxson <vern> 7531 7532> myflex/flex -8 sentag.tmp.l 7533> flex scanner push-back overflow 7534 7535Flex itself uses a flex scanner. That scanner is running out of buffer 7536space when it tries to unput() the humongous macro you've defined. When 7537you remove the '/'s, you make it small enough so that it fits in the buffer; 7538removing spaces would do the same thing. 7539 7540The fix is to either rethink how come you're using such a big macro and 7541perhaps there's another/better way to do it; or to rebuild flex's own 7542scan.c with a larger value for 7543 7544 #define YY_BUF_SIZE 16384 7545 7546- Vern 7547@end verbatim 7548@end example 7549 7550@c TODO: Evaluate this faq. 7551@node unnamed-faq-83 7552@unnumberedsec unnamed-faq-83 7553@example 7554@verbatim 7555To: Jan Kort <jan@research.techforce.nl> 7556Subject: Re: Flex 7557In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. 7558Date: Sat, 05 Sep 1998 00:59:49 PDT 7559From: Vern Paxson <vern> 7560 7561> %% 7562> 7563> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } 7564> ^\n { fprintf(stderr, "empty line\n"); } 7565> . { } 7566> \n { fprintf(stderr, "new line\n"); } 7567> 7568> %% 7569> -- input --------------------------------------- 7570> TEST1 7571> -- output -------------------------------------- 7572> TEST1 7573> empty line 7574> ------------------------------------------------ 7575 7576IMHO, it's not clear whether or not this is in fact a bug. It depends 7577on whether you view yyless() as backing up in the input stream, or as 7578pushing new characters onto the beginning of the input stream. Flex 7579interprets it as the latter (for implementation convenience, I'll admit), 7580and so considers the newline as in fact matching at the beginning of a 7581line, as after all the last token scanned an entire line and so the 7582scanner is now at the beginning of a new line. 7583 7584I agree that this is counter-intuitive for yyless(), given its 7585functional description (it's less so for unput(), depending on whether 7586you're unput()'ing new text or scanned text). But I don't plan to 7587change it any time soon, as it's a pain to do so. Consequently, 7588you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak 7589your scanner into the behavior you desire. 7590 7591Sorry for the less-than-completely-satisfactory answer. 7592 7593 Vern 7594@end verbatim 7595@end example 7596 7597@c TODO: Evaluate this faq. 7598@node unnamed-faq-84 7599@unnumberedsec unnamed-faq-84 7600@example 7601@verbatim 7602To: Patrick Krusenotto <krusenot@mac-info-link.de> 7603Subject: Re: Problems with restarting flex-2.5.2-generated scanner 7604In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. 7605Date: Thu, 24 Sep 1998 23:28:43 PDT 7606From: Vern Paxson <vern> 7607 7608> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately 7609> trying to make my scanner restart with a new file after my parser stops 7610> with a parse error. When my compiler restarts, the parser always 7611> receives the token after the token (in the old file!) that caused the 7612> parser error. 7613 7614I suspect the problem is that your parser has read ahead in order 7615to attempt to resolve an ambiguity, and when it's restarted it picks 7616up with that token rather than reading a fresh one. If you're using 7617yacc, then the special "error" production can sometimes be used to 7618consume tokens in an attempt to get the parser into a consistent state. 7619 7620 Vern 7621@end verbatim 7622@end example 7623 7624@c TODO: Evaluate this faq. 7625@node unnamed-faq-85 7626@unnumberedsec unnamed-faq-85 7627@example 7628@verbatim 7629To: Henric Jungheim <junghelh@pe-nelson.com> 7630Subject: Re: flex 2.5.4a 7631In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. 7632Date: Tue, 27 Oct 1998 16:50:14 PST 7633From: Vern Paxson <vern> 7634 7635> This brings up a feature request: How about a command line 7636> option to specify the filename when reading from stdin? That way one 7637> doesn't need to create a temporary file in order to get the "#line" 7638> directives to make sense. 7639 7640Use -o combined with -t (per the man page description of -o). 7641 7642> P.S., Is there any simple way to use non-blocking IO to parse multiple 7643> streams? 7644 7645Simple, no. 7646 7647One approach might be to return a magic character on EWOULDBLOCK and 7648have a rule 7649 7650 .*<magic-character> // put back .*, eat magic character 7651 7652This is off the top of my head, not sure it'll work. 7653 7654 Vern 7655@end verbatim 7656@end example 7657 7658@c TODO: Evaluate this faq. 7659@node unnamed-faq-86 7660@unnumberedsec unnamed-faq-86 7661@example 7662@verbatim 7663To: "Repko, Billy D" <billy.d.repko@intel.com> 7664Subject: Re: Compiling scanners 7665In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. 7666Date: Thu, 14 Jan 1999 00:25:30 PST 7667From: Vern Paxson <vern> 7668 7669> It appears that maybe it cannot find the lfl library. 7670 7671The Makefile in the distribution builds it, so you should have it. 7672It's exceedingly trivial, just a main() that calls yylex() and 7673a yyrap() that always returns 1. 7674 7675> %% 7676> \n ++num_lines; ++num_chars; 7677> . ++num_chars; 7678 7679You can't indent your rules like this - that's where the errors are coming 7680from. Flex copies indented text to the output file, it's how you do things 7681like 7682 7683 int num_lines_seen = 0; 7684 7685to declare local variables. 7686 7687 Vern 7688@end verbatim 7689@end example 7690 7691@c TODO: Evaluate this faq. 7692@node unnamed-faq-87 7693@unnumberedsec unnamed-faq-87 7694@example 7695@verbatim 7696To: Erick Branderhorst <Erick.Branderhorst@asml.nl> 7697Subject: Re: flex input buffer 7698In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. 7699Date: Tue, 09 Feb 1999 21:03:37 PST 7700From: Vern Paxson <vern> 7701 7702> In the flex.skl file the size of the default input buffers is set. Can you 7703> explain why this size is set and why it is such a high number. 7704 7705It's large to optimize performance when scanning large files. You can 7706safely make it a lot lower if needed. 7707 7708 Vern 7709@end verbatim 7710@end example 7711 7712@c TODO: Evaluate this faq. 7713@node unnamed-faq-88 7714@unnumberedsec unnamed-faq-88 7715@example 7716@verbatim 7717To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> 7718Subject: Re: Flex error message 7719In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. 7720Date: Thu, 25 Feb 1999 00:11:31 PST 7721From: Vern Paxson <vern> 7722 7723> I'm extending a larger scanner written in Flex and I keep running into 7724> problems. More specifically, I get the error message: 7725> "flex: input rules are too complicated (>= 32000 NFA states)" 7726 7727Increase the definitions in flexdef.h for: 7728 7729#define JAMSTATE -32766 /* marks a reference to the state that always j 7730ams */ 7731#define MAXIMUM_MNS 31999 7732#define BAD_SUBSCRIPT -32767 7733 7734recompile everything, and it should all work. 7735 7736 Vern 7737@end verbatim 7738@end example 7739 7740@c TODO: Evaluate this faq. 7741@node unnamed-faq-90 7742@unnumberedsec unnamed-faq-90 7743@example 7744@verbatim 7745To: "Dmitriy Goldobin" <gold@ems.chel.su> 7746Subject: Re: FLEX trouble 7747In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. 7748Date: Tue, 01 Jun 1999 00:15:07 PDT 7749From: Vern Paxson <vern> 7750 7751> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 7752> but rule "/*"(.|\n)*"*/" don't work ? 7753 7754The second of these will have to scan the entire input stream (because 7755"(.|\n)*" matches an arbitrary amount of any text) in order to see if 7756it ends with "*/", terminating the comment. That potentially will overflow 7757the input buffer. 7758 7759> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error 7760> 'unrecognized rule'. 7761 7762You can't use the '/' operator inside parentheses. It's not clear 7763what "(a/b)*" actually means. 7764 7765> I now use workaround with state <comment>, but single-rule is 7766> better, i think. 7767 7768Single-rule is nice but will always have the problem of either setting 7769restrictions on comments (like not allowing multi-line comments) and/or 7770running the risk of consuming the entire input stream, as noted above. 7771 7772 Vern 7773@end verbatim 7774@end example 7775 7776@c TODO: Evaluate this faq. 7777@node unnamed-faq-91 7778@unnumberedsec unnamed-faq-91 7779@example 7780@verbatim 7781Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) 7782 by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 7783 for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) 7784Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 7785To: vern@ee.lbl.gov 7786Date: Tue, 15 Jun 1999 08:55:43 -0700 7787From: "Aki Niimura" <neko@my-deja.com> 7788Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> 7789Mime-Version: 1.0 7790Cc: 7791X-Sent-Mail: on 7792Reply-To: 7793X-Mailer: MailCity Service 7794Subject: A question on flex C++ scanner 7795X-Sender-Ip: 12.72.207.61 7796Organization: My Deja Email (http://www.my-deja.com:80) 7797Content-Type: text/plain; charset=us-ascii 7798Content-Transfer-Encoding: 7bit 7799 7800Dear Dr. Paxon, 7801 7802I have been using flex for years. 7803It works very well on many projects. 7804Most case, I used it to generate a scanner on C language. 7805However, one project I needed to generate a scanner 7806on C++ lanuage. Thanks to your enhancement, flex did 7807the job. 7808 7809Currently, I'm working on enhancing my previous project. 7810I need to deal with multiple input streams (recursive 7811inclusion) in this scanner (C++). 7812I did similar thing for another scanner (C) as you 7813explained in your documentation. 7814 7815The generated scanner (C++) has necessary methods: 7816- switch_to_buffer(struct yy_buffer_state *b) 7817- yy_create_buffer(istream *is, int sz) 7818- yy_delete_buffer(struct yy_buffer_state *b) 7819 7820However, I couldn't figure out how to access current 7821buffer (yy_current_buffer). 7822 7823yy_current_buffer is a protected member of yyFlexLexer. 7824I can't access it directly. 7825Then, I thought yy_create_buffer() with is = 0 might 7826return current stream buffer. But it seems not as far 7827as I checked the source. (flex 2.5.4) 7828 7829I went through the Web in addition to Flex documentation. 7830However, it hasn't been successful, so far. 7831 7832It is not my intention to bother you, but, can you 7833comment about how to obtain the current stream buffer? 7834 7835Your response would be highly appreciated. 7836 7837Best regards, 7838Aki Niimura 7839 7840--== Sent via Deja.com http://www.deja.com/ ==-- 7841Share what you know. Learn what you don't. 7842@end verbatim 7843@end example 7844 7845@c TODO: Evaluate this faq. 7846@node unnamed-faq-92 7847@unnumberedsec unnamed-faq-92 7848@example 7849@verbatim 7850To: neko@my-deja.com 7851Subject: Re: A question on flex C++ scanner 7852In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. 7853Date: Tue, 15 Jun 1999 09:04:24 PDT 7854From: Vern Paxson <vern> 7855 7856> However, I couldn't figure out how to access current 7857> buffer (yy_current_buffer). 7858 7859Derive your own subclass from yyFlexLexer. 7860 7861 Vern 7862@end verbatim 7863@end example 7864 7865@c TODO: Evaluate this faq. 7866@node unnamed-faq-93 7867@unnumberedsec unnamed-faq-93 7868@example 7869@verbatim 7870To: "Stones, Darren" <Darren.Stones@nectech.co.uk> 7871Subject: Re: You're the man to see? 7872In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. 7873Date: Wed, 23 Jun 1999 09:01:40 PDT 7874From: Vern Paxson <vern> 7875 7876> I hope you can help me. I am using Flex and Bison to produce an interpreted 7877> language. However all goes well until I try to implement an IF statement or 7878> a WHILE. I cannot get this to work as the parser parses all the conditions 7879> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot 7880> make a decision!! 7881 7882You need to use the parser to build a parse tree (= abstract syntax trwee), 7883and when that's all done you recursively evaluate the tree, binding variables 7884to values at that time. 7885 7886 Vern 7887@end verbatim 7888@end example 7889 7890@c TODO: Evaluate this faq. 7891@node unnamed-faq-94 7892@unnumberedsec unnamed-faq-94 7893@example 7894@verbatim 7895To: Petr Danecek <petr@ics.cas.cz> 7896Subject: Re: flex - question 7897In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. 7898Date: Fri, 02 Jul 1999 16:52:13 PDT 7899From: Vern Paxson <vern> 7900 7901> file, it takes an enormous amount of time. It is funny, because the 7902> source code has only 12 rules!!! I think it looks like an exponencial 7903> growth. 7904 7905Right, that's the problem - some patterns (those with a lot of 7906ambiguity, where yours has because at any given time the scanner can 7907be in the middle of all sorts of combinations of the different 7908rules) blow up exponentially. 7909 7910For your rules, there is an easy fix. Change the ".*" that comes fater 7911the directory name to "[^ ]*". With that in place, the rules are no 7912longer nearly so ambiguous, because then once one of the directories 7913has been matched, no other can be matched (since they all require a 7914leading blank). 7915 7916If that's not an acceptable solution, then you can enter a start state 7917to pick up the .*\n after each directory is matched. 7918 7919Also note that for speed, you'll want to add a ".*" rule at the end, 7920otherwise rules that don't match any of the patterns will be matched 7921very slowly, a character at a time. 7922 7923 Vern 7924@end verbatim 7925@end example 7926 7927@c TODO: Evaluate this faq. 7928@node unnamed-faq-95 7929@unnumberedsec unnamed-faq-95 7930@example 7931@verbatim 7932To: Tielman Koekemoer <tielman@spi.co.za> 7933Subject: Re: Please help. 7934In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. 7935Date: Thu, 08 Jul 1999 08:20:39 PDT 7936From: Vern Paxson <vern> 7937 7938> I was hoping you could help me with my problem. 7939> 7940> I tried compiling (gnu)flex on a Solaris 2.4 machine 7941> but when I ran make (after configure) I got an error. 7942> 7943> -------------------------------------------------------------- 7944> gcc -c -I. -I. -g -O parse.c 7945> ./flex -t -p ./scan.l >scan.c 7946> sh: ./flex: not found 7947> *** Error code 1 7948> make: Fatal error: Command failed for target `scan.c' 7949> ------------------------------------------------------------- 7950> 7951> What's strange to me is that I'm only 7952> trying to install flex now. I then edited the Makefile to 7953> and changed where it says "FLEX = flex" to "FLEX = lex" 7954> ( lex: the native Solaris one ) but then it complains about 7955> the "-p" option. Is there any way I can compile flex without 7956> using flex or lex? 7957> 7958> Thanks so much for your time. 7959 7960You managed to step on the bootstrap sequence, which first copies 7961initscan.c to scan.c in order to build flex. Try fetching a fresh 7962distribution from ftp.ee.lbl.gov. (Or you can first try removing 7963".bootstrap" and doing a make again.) 7964 7965 Vern 7966@end verbatim 7967@end example 7968 7969@c TODO: Evaluate this faq. 7970@node unnamed-faq-96 7971@unnumberedsec unnamed-faq-96 7972@example 7973@verbatim 7974To: Tielman Koekemoer <tielman@spi.co.za> 7975Subject: Re: Please help. 7976In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. 7977Date: Fri, 09 Jul 1999 00:27:20 PDT 7978From: Vern Paxson <vern> 7979 7980> First I removed .bootstrap (and ran make) - no luck. I downloaded the 7981> software but I still have the same problem. Is there anything else I 7982> could try. 7983 7984Try: 7985 7986 cp initscan.c scan.c 7987 touch scan.c 7988 make scan.o 7989 7990If this last tries to first build scan.c from scan.l using ./flex, then 7991your "make" is broken, in which case compile scan.c to scan.o by hand. 7992 7993 Vern 7994@end verbatim 7995@end example 7996 7997@c TODO: Evaluate this faq. 7998@node unnamed-faq-97 7999@unnumberedsec unnamed-faq-97 8000@example 8001@verbatim 8002To: Sumanth Kamenani <skamenan@crl.nmsu.edu> 8003Subject: Re: Error 8004In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. 8005Date: Tue, 20 Jul 1999 00:18:26 PDT 8006From: Vern Paxson <vern> 8007 8008> I am getting a compilation error. The error is given as "unknown symbol- yylex". 8009 8010The parser relies on calling yylex(), but you're instead using the C++ scanning 8011class, so you need to supply a yylex() "glue" function that calls an instance 8012scanner of the scanner (e.g., "scanner->yylex()"). 8013 8014 Vern 8015@end verbatim 8016@end example 8017 8018@c TODO: Evaluate this faq. 8019@node unnamed-faq-98 8020@unnumberedsec unnamed-faq-98 8021@example 8022@verbatim 8023To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) 8024Subject: Re: lex 8025In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. 8026Date: Tue, 23 Nov 1999 15:54:30 PST 8027From: Vern Paxson <vern> 8028 8029Well, your problem is the 8030 8031switch (yybgin-yysvec-1) { /* witchcraft */ 8032 8033at the beginning of lex rules. "witchcraft" == "non-portable". It's 8034assuming knowledge of the AT&T lex's internal variables. 8035 8036For flex, you can probably do the equivalent using a switch on YYSTATE. 8037 8038 Vern 8039@end verbatim 8040@end example 8041 8042@c TODO: Evaluate this faq. 8043@node unnamed-faq-99 8044@unnumberedsec unnamed-faq-99 8045@example 8046@verbatim 8047To: archow@hss.hns.com 8048Subject: Re: Regarding distribution of flex and yacc based grammars 8049In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. 8050Date: Wed, 22 Dec 1999 01:56:24 PST 8051From: Vern Paxson <vern> 8052 8053> When we provide the customer with an object code distribution, is it 8054> necessary for us to provide source 8055> for the generated C files from flex and bison since they are generated by 8056> flex and bison ? 8057 8058For flex, no. I don't know what the current state of this is for bison. 8059 8060> Also, is there any requrirement for us to neccessarily provide source for 8061> the grammar files which are fed into flex and bison ? 8062 8063Again, for flex, no. 8064 8065See the file "COPYING" in the flex distribution for the legalese. 8066 8067 Vern 8068@end verbatim 8069@end example 8070 8071@c TODO: Evaluate this faq. 8072@node unnamed-faq-100 8073@unnumberedsec unnamed-faq-100 8074@example 8075@verbatim 8076To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> 8077Subject: Re: Flex, and self referencing rules 8078In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. 8079Date: Sat, 19 Feb 2000 18:33:16 PST 8080From: Vern Paxson <vern> 8081 8082> However, I do not use unput anywhere. I do use self-referencing 8083> rules like this: 8084> 8085> UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) 8086 8087You can't do this - flex is *not* a parser like yacc (which does indeed 8088allow recursion), it is a scanner that's confined to regular expressions. 8089 8090 Vern 8091@end verbatim 8092@end example 8093 8094@c TODO: Evaluate this faq. 8095@node unnamed-faq-101 8096@unnumberedsec unnamed-faq-101 8097@example 8098@verbatim 8099To: slg3@lehigh.edu (SAMUEL L. GULDEN) 8100Subject: Re: Flex problem 8101In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. 8102Date: Thu, 02 Mar 2000 23:00:46 PST 8103From: Vern Paxson <vern> 8104 8105If this is exactly your program: 8106 8107> digit [0-9] 8108> digits {digit}+ 8109> whitespace [ \t\n]+ 8110> 8111> %% 8112> "[" { printf("open_brac\n");} 8113> "]" { printf("close_brac\n");} 8114> "+" { printf("addop\n");} 8115> "*" { printf("multop\n");} 8116> {digits} { printf("NUMBER = %s\n", yytext);} 8117> whitespace ; 8118 8119then the problem is that the last rule needs to be "{whitespace}" ! 8120 8121 Vern 8122@end verbatim 8123@end example 8124 8125@node What is the difference between YYLEX_PARAM and YY_DECL? 8126@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL? 8127 8128YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra 8129params when it calls yylex() from the parser. 8130 8131YY_DECL is the Flex declaration of yylex. The default is similar to this: 8132 8133@example 8134@verbatim 8135#define int yy_lex () 8136@end verbatim 8137@end example 8138 8139 8140@node Why do I get "conflicting types for yylex" error? 8141@unnumberedsec Why do I get "conflicting types for yylex" error? 8142 8143This is a compiler error regarding a generated Bison parser, not a Flex scanner. 8144It means you need a prototype of yylex() in the top of the Bison file. 8145Be sure the prototype matches YY_DECL. 8146 8147@node How do I access the values set in a Flex action from within a Bison action? 8148@unnumberedsec How do I access the values set in a Flex action from within a Bison action? 8149 8150With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual. 8151See @ref{Top, , , bison, the GNU Bison Manual}. 8152 8153@node Appendices, Indices, FAQ, Top 8154@appendix Appendices 8155 8156@menu 8157* Makefiles and Flex:: 8158* Bison Bridge:: 8159* M4 Dependency:: 8160* Common Patterns:: 8161@end menu 8162 8163@node Makefiles and Flex, Bison Bridge, Appendices, Appendices 8164@appendixsec Makefiles and Flex 8165 8166@cindex Makefile, syntax 8167 8168In this appendix, we provide tips for writing Makefiles to build your scanners. 8169 8170In a traditional build environment, we say that the @file{.c} files are the 8171sources, and the @file{.o} files are the intermediate files. When using 8172@code{flex}, however, the @file{.l} files are the sources, and the generated 8173@file{.c} files (along with the @file{.o} files) are the intermediate files. 8174This requires you to carefully plan your Makefile. 8175 8176Modern @command{make} programs understand that @file{foo.l} is intended to 8177generate @file{lex.yy.c} or @file{foo.c}, and will behave 8178accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such 8179programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake} 8180may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want, 8181then you should provide an explicit rule in your Makefile.am}. The 8182following Makefile does not explicitly instruct @command{make} how to build 8183@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the 8184@command{make} program to build the intermediate file, @file{scan.c}: 8185 8186@cindex Makefile, example of implicit rules 8187@example 8188@verbatim 8189 # Basic Makefile -- relies on implicit rules 8190 # Creates "myprogram" from "scan.l" and "myprogram.c" 8191 # 8192 LEX=flex 8193 myprogram: scan.o myprogram.o 8194 scan.o: scan.l 8195 8196@end verbatim 8197@end example 8198 8199 8200For simple cases, the above may be sufficient. For other cases, 8201you may have to explicitly instruct @command{make} how to build your scanner. 8202The following is an example of a Makefile containing explicit rules: 8203 8204@cindex Makefile, explicit example 8205@example 8206@verbatim 8207 # Basic Makefile -- provides explicit rules 8208 # Creates "myprogram" from "scan.l" and "myprogram.c" 8209 # 8210 LEX=flex 8211 myprogram: scan.o myprogram.o 8212 $(CC) -o $@ $(LDFLAGS) $^ 8213 8214 myprogram.o: myprogram.c 8215 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8216 8217 scan.o: scan.c 8218 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8219 8220 scan.c: scan.l 8221 $(LEX) $(LFLAGS) -o $@ $^ 8222 8223 clean: 8224 $(RM) *.o scan.c 8225 8226@end verbatim 8227@end example 8228 8229Notice in the above example that @file{scan.c} is in the @code{clean} target. 8230This is because we consider the file @file{scan.c} to be an intermediate file. 8231 8232Finally, we provide a realistic example of a @code{flex} scanner used with a 8233@code{bison} parser@footnote{This example also applies to yacc parsers.}. 8234There is a tricky problem we have to deal with. Since a @code{flex} scanner 8235will typically include a header file (e.g., @file{y.tab.h}) generated by the 8236parser, we need to be sure that the header file is generated BEFORE the scanner 8237is compiled. We handle this case in the following example: 8238 8239@example 8240@verbatim 8241 # Makefile example -- scanner and parser. 8242 # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" 8243 # 8244 LEX = flex 8245 YACC = bison -y 8246 YFLAGS = -d 8247 objects = scan.o parse.o myprogram.o 8248 8249 myprogram: $(objects) 8250 scan.o: scan.l parse.c 8251 parse.o: parse.y 8252 myprogram.o: myprogram.c 8253 8254@end verbatim 8255@end example 8256 8257In the above example, notice the line, 8258 8259@example 8260@verbatim 8261 scan.o: scan.l parse.c 8262@end verbatim 8263@end example 8264 8265, which lists the file @file{parse.c} (the generated parser) as a dependency of 8266@file{scan.o}. We want to ensure that the parser is created before the scanner 8267is compiled, and the above line seems to do the trick. Feel free to experiment 8268with your specific implementation of @command{make}. 8269 8270 8271For more details on writing Makefiles, see @ref{Top, , , make, The 8272GNU Make Manual}. 8273 8274@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices 8275@section C Scanners with Bison Parsers 8276 8277@cindex bison, bridging with flex 8278@vindex yylval 8279@vindex yylloc 8280@tindex YYLTYPE 8281@tindex YYSTYPE 8282 8283This section describes the @code{flex} features useful when integrating 8284@code{flex} with @code{GNU bison}@footnote{The features described here are 8285purely optional, and are by no means the only way to use flex with bison. 8286We merely provide some glue to ease development of your parser-scanner pair.}. 8287Skip this section if you are not using 8288@code{bison} with your scanner. Here we discuss only the @code{flex} 8289half of the @code{flex} and @code{bison} pair. We do not discuss 8290@code{bison} in any detail. For more information about generating 8291@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}. 8292 8293A compatible @code{bison} scanner is generated by declaring @samp{%option 8294bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex} 8295from the command line. This instructs @code{flex} that the macro 8296@code{yylval} may be used. The data type for 8297@code{yylval}, @code{YYSTYPE}, 8298is typically defined in a header file, included in section 1 of the 8299@code{flex} input file. For a list of functions and macros 8300available, @xref{bison-functions}. 8301 8302The declaration of yylex becomes, 8303 8304@findex yylex (reentrant version) 8305@example 8306@verbatim 8307 int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); 8308@end verbatim 8309@end example 8310 8311If @code{%option bison-locations} is specified, then the declaration 8312becomes, 8313 8314@findex yylex (reentrant version) 8315@example 8316@verbatim 8317 int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); 8318@end verbatim 8319@end example 8320 8321Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers. 8322Support for @code{yylloc} is optional in @code{bison}, so it is optional in 8323@code{flex} as well. The following is an example of a @code{flex} scanner that 8324is compatible with @code{bison}. 8325 8326@cindex bison, scanner to be called from bison 8327@example 8328@verbatim 8329 /* Scanner for "C" assignment statements... sort of. */ 8330 %{ 8331 #include "y.tab.h" /* Generated by bison. */ 8332 %} 8333 8334 %option bison-bridge bison-locations 8335 % 8336 8337 [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} 8338 [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} 8339 "="|";" { return yytext[0];} 8340 . {} 8341 % 8342@end verbatim 8343@end example 8344 8345As you can see, there really is no magic here. We just use 8346@code{yylval} as we would any other variable. The data type of 8347@code{yylval} is generated by @code{bison}, and included in the file 8348@file{y.tab.h}. Here is the corresponding @code{bison} parser: 8349 8350@cindex bison, parser 8351@example 8352@verbatim 8353 /* Parser to convert "C" assignments to lisp. */ 8354 %{ 8355 /* Pass the argument to yyparse through to yylex. */ 8356 #define YYPARSE_PARAM scanner 8357 #define YYLEX_PARAM scanner 8358 %} 8359 %locations 8360 %pure_parser 8361 %union { 8362 int num; 8363 char* str; 8364 } 8365 %token <str> STRING 8366 %token <num> NUMBER 8367 %% 8368 assignment: 8369 STRING '=' NUMBER ';' { 8370 printf( "(setf %s %d)", $1, $3 ); 8371 } 8372 ; 8373@end verbatim 8374@end example 8375 8376@node M4 Dependency, Common Patterns, Bison Bridge, Appendices 8377@section M4 Dependency 8378@cindex m4 8379The macro processor @code{m4}@footnote{The use of m4 is subject to change in 8380future revisions of flex. It is not part of the public API of flex. Do not depend on it.} 8381must be installed wherever flex is installed. 8382@code{flex} invokes @samp{m4}, found by searching the directories in the 8383@code{PATH} environment variable. Any code you place in section 1 or in the 8384actions will be sent through m4. Please follow these rules to protect your 8385code from unwanted @code{m4} processing. 8386 8387@itemize 8388 8389@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, 8390or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for 8391some reason you need m4_ as a prefix, use a preprocessor #define to get your 8392symbol past m4 unmangled. 8393 8394@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The 8395former is not valid in C, except within comments and strings, but the latter is valid in 8396code such as @code{x[y[z]]}. The solution is simple. To get the literal string 8397@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]}, 8398use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and 8399escape them. However, it's best to avoid this complexity where possible, by 8400removing such sequences from your code. 8401 8402@end itemize 8403 8404@code{m4} is only required at the time you run @code{flex}. The generated 8405scanner is ordinary C or C++, and does @emph{not} require @code{m4}. 8406 8407@node Common Patterns, ,M4 Dependency, Appendices 8408@section Common Patterns 8409@cindex patterns, common 8410 8411This appendix provides examples of common regular expressions you might use 8412in your scanner. 8413 8414@menu 8415* Numbers:: 8416* Identifiers:: 8417* Quoted Constructs:: 8418* Addresses:: 8419@end menu 8420 8421 8422@node Numbers, Identifiers, ,Common Patterns 8423@subsection Numbers 8424 8425@table @asis 8426 8427@item C99 decimal constant 8428@code{([[:digit:]]@{-@}[0])[[:digit:]]*} 8429 8430@item C99 hexadecimal constant 8431@code{0[xX][[:xdigit:]]+} 8432 8433@item C99 octal constant 8434@code{0[01234567]*} 8435 8436@item C99 floating point constant 8437@verbatim 8438 {dseq} ([[:digit:]]+) 8439 {dseq_opt} ([[:digit:]]*) 8440 {frac} (({dseq_opt}"."{dseq})|{dseq}".") 8441 {exp} ([eE][+-]?{dseq}) 8442 {exp_opt} ({exp}?) 8443 {fsuff} [flFL] 8444 {fsuff_opt} ({fsuff}?) 8445 {hpref} (0[xX]) 8446 {hdseq} ([[:xdigit:]]+) 8447 {hdseq_opt} ([[:xdigit:]]*) 8448 {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) 8449 {bexp} ([pP][+-]?{dseq}) 8450 {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) 8451 {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) 8452 8453 {c99_floating_point_constant} ({dfc}|{hfc}) 8454@end verbatim 8455 8456See C99 section 6.4.4.2 for the gory details. 8457 8458@end table 8459 8460@node Identifiers, Quoted Constructs, Numbers, Common Patterns 8461@subsection Identifiers 8462 8463@table @asis 8464 8465@item C99 Identifier 8466@verbatim 8467ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) 8468nondigit [_[:alpha:]] 8469c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* 8470@end verbatim 8471 8472Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for 8473"implementation-defined" characters. In practice, C compilers follow the above pattern, with the 8474addition of the @samp{$} character. 8475 8476@item UTF-8 Encoded Unicode Code Point 8477@verbatim 8478[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) 8479@end verbatim 8480 8481@end table 8482 8483@node Quoted Constructs, Addresses, Identifiers, Common Patterns 8484@subsection Quoted Constructs 8485 8486@table @asis 8487@item C99 String Literal 8488@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"} 8489 8490@item C99 Comment 8491@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)} 8492 8493Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief, 8494does not include the trailing @samp{\n} character. 8495 8496A better way to scan @samp{/* */} comments is by line, rather than matching 8497possibly huge comments all at once. This will allow you to scan comments of 8498unlimited length, as long as line breaks appear at sane intervals. This is also 8499more efficient when used with automatic line number processing. @xref{option-yylineno}. 8500 8501@verbatim 8502<INITIAL>{ 8503 "/*" BEGIN(COMMENT); 8504} 8505<COMMENT>{ 8506 "*/" BEGIN(0); 8507 [^*\n]+ ; 8508 "*"[^/] ; 8509 \n ; 8510} 8511@end verbatim 8512 8513@end table 8514 8515@node Addresses, ,Quoted Constructs, Common Patterns 8516@subsection Addresses 8517 8518@table @asis 8519 8520@item IPv4 Address 8521@verbatim 8522dec-octet [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] 8523IPv4address {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet} 8524@end verbatim 8525 8526@item IPv6 Address 8527@verbatim 8528h16 [0-9A-Fa-f]{1,4} 8529ls32 {h16}:{h16}|{IPv4address} 8530IPv6address ({h16}:){6}{ls32}| 8531 ::({h16}:){5}{ls32}| 8532 ({h16})?::({h16}:){4}{ls32}| 8533 (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}| 8534 (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}| 8535 (({h16}:){0,3}{h16})?::{h16}:{ls32}| 8536 (({h16}:){0,4}{h16})?::{ls32}| 8537 (({h16}:){0,5}{h16})?::{h16}| 8538 (({h16}:){0,6}{h16})?:: 8539@end verbatim 8540 8541See @uref{http://www.ietf.org/rfc/rfc2373.txt, RFC 2373} for details. 8542Note that you have to fold the definition of @code{IPv6address} into one 8543line and that it also matches the ``unspecified address'' ``::''. 8544 8545@item URI 8546@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} 8547 8548This pattern is nearly useless, since it allows just about any character 8549to appear in a URI, including spaces and control characters. See 8550@uref{http://www.ietf.org/rfc/rfc2396.txt, RFC 2396} for details. 8551 8552@end table 8553 8554 8555@node Indices, , Appendices, Top 8556@unnumbered Indices 8557 8558@menu 8559* Concept Index:: 8560* Index of Functions and Macros:: 8561* Index of Variables:: 8562* Index of Data Types:: 8563* Index of Hooks:: 8564* Index of Scanner Options:: 8565@end menu 8566 8567@node Concept Index, Index of Functions and Macros, Indices, Indices 8568@unnumberedsec Concept Index 8569 8570@printindex cp 8571 8572@node Index of Functions and Macros, Index of Variables, Concept Index, Indices 8573@unnumberedsec Index of Functions and Macros 8574 8575This is an index of functions and preprocessor macros that look like functions. 8576For macros that expand to variables or constants, see @ref{Index of Variables}. 8577 8578@printindex fn 8579 8580@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices 8581@unnumberedsec Index of Variables 8582 8583This is an index of variables, constants, and preprocessor macros 8584that expand to variables or constants. 8585 8586@printindex vr 8587 8588@node Index of Data Types, Index of Hooks, Index of Variables, Indices 8589@unnumberedsec Index of Data Types 8590@printindex tp 8591 8592@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices 8593@unnumberedsec Index of Hooks 8594 8595This is an index of "hooks" that the user may define. These hooks typically correspond 8596to specific locations in the generated scanner, and may be used to insert arbitrary code. 8597 8598@printindex hk 8599 8600@node Index of Scanner Options, , Index of Hooks, Indices 8601@unnumberedsec Index of Scanner Options 8602 8603@printindex op 8604 8605@c A vim script to name the faq entries. delete this when faqs are no longer 8606@c named "unnamed-faq-XXX". 8607@c 8608@c fu! Faq2 () range abort 8609@c let @r=input("Rename to: ") 8610@c exe "%s/" . @w . "/" . @r . "/g" 8611@c normal 'f 8612@c endf 8613@c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> 8614 8615@bye 8616