flex.texi revision 1.1
1\input texinfo.tex @c -*-texinfo-*- 2@c $NetBSD: flex.texi,v 1.1 2009/10/26 00:27:40 christos Exp $ 3@c %**start of header 4@setfilename flex.info 5@settitle Lexical Analysis With Flex 6@include version.texi 7@set authors Vern Paxson, Will Estes and John Millaway 8@c "Macro Hooks" index 9@defindex hk 10@c "Options" index 11@defindex op 12@dircategory Programming 13@direntry 14* flex: (flex). Fast lexical analyzer generator (lex replacement). 15@end direntry 16@c %**end of header 17 18@copying 19 20The flex manual is placed under the same licensing conditions as the 21rest of flex: 22 23Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007 The Flex 24Project. 25 26Copyright @copyright{} 1990, 1997 The Regents of the University of California. 27All rights reserved. 28 29This code is derived from software contributed to Berkeley by 30Vern Paxson. 31 32The United States Government has rights in this work pursuant 33to contract no. DE-AC03-76SF00098 between the United States 34Department of Energy and the University of California. 35 36Redistribution and use in source and binary forms, with or without 37modification, are permitted provided that the following conditions 38are met: 39 40@enumerate 41@item 42 Redistributions of source code must retain the above copyright 43notice, this list of conditions and the following disclaimer. 44 45@item 46Redistributions in binary form must reproduce the above copyright 47notice, this list of conditions and the following disclaimer in the 48documentation and/or other materials provided with the distribution. 49@end enumerate 50 51Neither the name of the University nor the names of its contributors 52may be used to endorse or promote products derived from this software 53without specific prior written permission. 54 55THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 56IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 57WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 58PURPOSE. 59@end copying 60 61@titlepage 62@title @value{title} 63@subtitle Edition @value{EDITION}, @value{UPDATED} 64@author @value{authors} 65@page 66@vskip 0pt plus 1filll 67@insertcopying 68@end titlepage 69@contents 70@ifnottex 71@node Top, Copyright, (dir), (dir) 72@top flex 73 74This manual describes @code{flex}, a tool for generating programs that 75perform pattern-matching on text. The manual includes both tutorial and 76reference sections. 77 78This edition of @cite{The flex Manual} documents @code{flex} version 79@value{VERSION}. It was last updated on @value{UPDATED}. 80 81This manual was written by @value{authors}. 82 83@menu 84* Copyright:: 85* Reporting Bugs:: 86* Introduction:: 87* Simple Examples:: 88* Format:: 89* Patterns:: 90* Matching:: 91* Actions:: 92* Generated Scanner:: 93* Start Conditions:: 94* Multiple Input Buffers:: 95* EOF:: 96* Misc Macros:: 97* User Values:: 98* Yacc:: 99* Scanner Options:: 100* Performance:: 101* Cxx:: 102* Reentrant:: 103* Lex and Posix:: 104* Memory Management:: 105* Serialized Tables:: 106* Diagnostics:: 107* Limitations:: 108* Bibliography:: 109* FAQ:: 110* Appendices:: 111* Indices:: 112 113@detailmenu 114 --- The Detailed Node Listing --- 115 116Format of the Input File 117 118* Definitions Section:: 119* Rules Section:: 120* User Code Section:: 121* Comments in the Input:: 122 123Scanner Options 124 125* Options for Specifying Filenames:: 126* Options Affecting Scanner Behavior:: 127* Code-Level And API Options:: 128* Options for Scanner Speed and Size:: 129* Debugging Options:: 130* Miscellaneous Options:: 131 132Reentrant C Scanners 133 134* Reentrant Uses:: 135* Reentrant Overview:: 136* Reentrant Example:: 137* Reentrant Detail:: 138* Reentrant Functions:: 139 140The Reentrant API in Detail 141 142* Specify Reentrant:: 143* Extra Reentrant Argument:: 144* Global Replacement:: 145* Init and Destroy Functions:: 146* Accessor Methods:: 147* Extra Data:: 148* About yyscan_t:: 149 150Memory Management 151 152* The Default Memory Management:: 153* Overriding The Default Memory Management:: 154* A Note About yytext And Memory:: 155 156Serialized Tables 157 158* Creating Serialized Tables:: 159* Loading and Unloading Serialized Tables:: 160* Tables File Format:: 161 162FAQ 163 164* When was flex born?:: 165* How do I expand backslash-escape sequences in C-style quoted strings?:: 166* Why do flex scanners call fileno if it is not ANSI compatible?:: 167* Does flex support recursive pattern definitions?:: 168* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 169* Flex is not matching my patterns in the same order that I defined them.:: 170* My actions are executing out of order or sometimes not at all.:: 171* How can I have multiple input sources feed into the same scanner at the same time?:: 172* Can I build nested parsers that work with the same input file?:: 173* How can I match text only at the end of a file?:: 174* How can I make REJECT cascade across start condition boundaries?:: 175* Why cant I use fast or full tables with interactive mode?:: 176* How much faster is -F or -f than -C?:: 177* If I have a simple grammar cant I just parse it with flex?:: 178* Why doesn't yyrestart() set the start state back to INITIAL?:: 179* How can I match C-style comments?:: 180* The period isn't working the way I expected.:: 181* Can I get the flex manual in another format?:: 182* Does there exist a "faster" NDFA->DFA algorithm?:: 183* How does flex compile the DFA so quickly?:: 184* How can I use more than 8192 rules?:: 185* How do I abandon a file in the middle of a scan and switch to a new file?:: 186* How do I execute code only during initialization (only before the first scan)?:: 187* How do I execute code at termination?:: 188* Where else can I find help?:: 189* Can I include comments in the "rules" section of the file?:: 190* I get an error about undefined yywrap().:: 191* How can I change the matching pattern at run time?:: 192* How can I expand macros in the input?:: 193* How can I build a two-pass scanner?:: 194* How do I match any string not matched in the preceding rules?:: 195* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 196* Is there a way to make flex treat NULL like a regular character?:: 197* Whenever flex can not match the input it says "flex scanner jammed".:: 198* Why doesn't flex have non-greedy operators like perl does?:: 199* Memory leak - 16386 bytes allocated by malloc.:: 200* How do I track the byte offset for lseek()?:: 201* How do I use my own I/O classes in a C++ scanner?:: 202* How do I skip as many chars as possible?:: 203* deleteme00:: 204* Are certain equivalent patterns faster than others?:: 205* Is backing up a big deal?:: 206* Can I fake multi-byte character support?:: 207* deleteme01:: 208* Can you discuss some flex internals?:: 209* unput() messes up yy_at_bol:: 210* The | operator is not doing what I want:: 211* Why can't flex understand this variable trailing context pattern?:: 212* The ^ operator isn't working:: 213* Trailing context is getting confused with trailing optional patterns:: 214* Is flex GNU or not?:: 215* ERASEME53:: 216* I need to scan if-then-else blocks and while loops:: 217* ERASEME55:: 218* ERASEME56:: 219* ERASEME57:: 220* Is there a repository for flex scanners?:: 221* How can I conditionally compile or preprocess my flex input file?:: 222* Where can I find grammars for lex and yacc?:: 223* I get an end-of-buffer message for each character scanned.:: 224* unnamed-faq-62:: 225* unnamed-faq-63:: 226* unnamed-faq-64:: 227* unnamed-faq-65:: 228* unnamed-faq-66:: 229* unnamed-faq-67:: 230* unnamed-faq-68:: 231* unnamed-faq-69:: 232* unnamed-faq-70:: 233* unnamed-faq-71:: 234* unnamed-faq-72:: 235* unnamed-faq-73:: 236* unnamed-faq-74:: 237* unnamed-faq-75:: 238* unnamed-faq-76:: 239* unnamed-faq-77:: 240* unnamed-faq-78:: 241* unnamed-faq-79:: 242* unnamed-faq-80:: 243* unnamed-faq-81:: 244* unnamed-faq-82:: 245* unnamed-faq-83:: 246* unnamed-faq-84:: 247* unnamed-faq-85:: 248* unnamed-faq-86:: 249* unnamed-faq-87:: 250* unnamed-faq-88:: 251* unnamed-faq-90:: 252* unnamed-faq-91:: 253* unnamed-faq-92:: 254* unnamed-faq-93:: 255* unnamed-faq-94:: 256* unnamed-faq-95:: 257* unnamed-faq-96:: 258* unnamed-faq-97:: 259* unnamed-faq-98:: 260* unnamed-faq-99:: 261* unnamed-faq-100:: 262* unnamed-faq-101:: 263* What is the difference between YYLEX_PARAM and YY_DECL?:: 264* Why do I get "conflicting types for yylex" error?:: 265* How do I access the values set in a Flex action from within a Bison action?:: 266 267Appendices 268 269* Makefiles and Flex:: 270* Bison Bridge:: 271* M4 Dependency:: 272* Common Patterns:: 273 274Indices 275 276* Concept Index:: 277* Index of Functions and Macros:: 278* Index of Variables:: 279* Index of Data Types:: 280* Index of Hooks:: 281* Index of Scanner Options:: 282 283@end detailmenu 284@end menu 285@end ifnottex 286@node Copyright, Reporting Bugs, Top, Top 287@chapter Copyright 288 289@cindex copyright of flex 290@cindex distributing flex 291@insertcopying 292 293@node Reporting Bugs, Introduction, Copyright, Top 294@chapter Reporting Bugs 295 296@cindex bugs, reporting 297@cindex reporting bugs 298 299If you find a bug in @code{flex}, please report it using 300the SourceForge Bug Tracking facilities which can be found on 301@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}. 302 303@node Introduction, Simple Examples, Reporting Bugs, Top 304@chapter Introduction 305 306@cindex scanner, definition of 307@code{flex} is a tool for generating @dfn{scanners}. A scanner is a 308program which recognizes lexical patterns in text. The @code{flex} 309program reads the given input files, or its standard input if no file 310names are given, for a description of a scanner to generate. The 311description is in the form of pairs of regular expressions and C code, 312called @dfn{rules}. @code{flex} generates as output a C source file, 313@file{lex.yy.c} by default, which defines a routine @code{yylex()}. 314This file can be compiled and linked with the flex runtime library to 315produce an executable. When the executable is run, it analyzes its 316input for occurrences of the regular expressions. Whenever it finds 317one, it executes the corresponding C code. 318 319@node Simple Examples, Format, Introduction, Top 320@chapter Some Simple Examples 321 322First some simple examples to get the flavor of how one uses 323@code{flex}. 324 325@cindex username expansion 326The following @code{flex} input specifies a scanner which, when it 327encounters the string @samp{username} will replace it with the user's 328login name: 329 330@example 331@verbatim 332 %% 333 username printf( "%s", getlogin() ); 334@end verbatim 335@end example 336 337@cindex default rule 338@cindex rules, default 339By default, any text not matched by a @code{flex} scanner is copied to 340the output, so the net effect of this scanner is to copy its input file 341to its output with each occurrence of @samp{username} expanded. In this 342input, there is just one rule. @samp{username} is the @dfn{pattern} and 343the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the 344beginning of the rules. 345 346Here's another simple example: 347 348@cindex counting characters and lines 349@example 350@verbatim 351 int num_lines = 0, num_chars = 0; 352 353 %% 354 \n ++num_lines; ++num_chars; 355 . ++num_chars; 356 357 %% 358 main() 359 { 360 yylex(); 361 printf( "# of lines = %d, # of chars = %d\n", 362 num_lines, num_chars ); 363 } 364@end verbatim 365@end example 366 367This scanner counts the number of characters and the number of lines in 368its input. It produces no output other than the final report on the 369character and line counts. The first line declares two globals, 370@code{num_lines} and @code{num_chars}, which are accessible both inside 371@code{yylex()} and in the @code{main()} routine declared after the 372second @samp{%%}. There are two rules, one which matches a newline 373(@samp{\n}) and increments both the line count and the character count, 374and one which matches any character other than a newline (indicated by 375the @samp{.} regular expression). 376 377A somewhat more complicated example: 378 379@cindex Pascal-like language 380@example 381@verbatim 382 /* scanner for a toy Pascal-like language */ 383 384 %{ 385 /* need this for the call to atof() below */ 386 #include math.h> 387 %} 388 389 DIGIT [0-9] 390 ID [a-z][a-z0-9]* 391 392 %% 393 394 {DIGIT}+ { 395 printf( "An integer: %s (%d)\n", yytext, 396 atoi( yytext ) ); 397 } 398 399 {DIGIT}+"."{DIGIT}* { 400 printf( "A float: %s (%g)\n", yytext, 401 atof( yytext ) ); 402 } 403 404 if|then|begin|end|procedure|function { 405 printf( "A keyword: %s\n", yytext ); 406 } 407 408 {ID} printf( "An identifier: %s\n", yytext ); 409 410 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 411 412 "{"[\^{}}\n]*"}" /* eat up one-line comments */ 413 414 [ \t\n]+ /* eat up whitespace */ 415 416 . printf( "Unrecognized character: %s\n", yytext ); 417 418 %% 419 420 main( argc, argv ) 421 int argc; 422 char **argv; 423 { 424 ++argv, --argc; /* skip over program name */ 425 if ( argc > 0 ) 426 yyin = fopen( argv[0], "r" ); 427 else 428 yyin = stdin; 429 430 yylex(); 431 } 432@end verbatim 433@end example 434 435This is the beginnings of a simple scanner for a language like Pascal. 436It identifies different types of @dfn{tokens} and reports on what it has 437seen. 438 439The details of this example will be explained in the following 440sections. 441 442@node Format, Patterns, Simple Examples, Top 443@chapter Format of the Input File 444 445 446@cindex format of flex input 447@cindex input, format of 448@cindex file format 449@cindex sections of flex input 450 451The @code{flex} input file consists of three sections, separated by a 452line containing only @samp{%%}. 453 454@cindex format of input file 455@example 456@verbatim 457 definitions 458 %% 459 rules 460 %% 461 user code 462@end verbatim 463@end example 464 465@menu 466* Definitions Section:: 467* Rules Section:: 468* User Code Section:: 469* Comments in the Input:: 470@end menu 471 472@node Definitions Section, Rules Section, Format, Format 473@section Format of the Definitions Section 474 475@cindex input file, Definitions section 476@cindex Definitions, in flex input 477The @dfn{definitions section} contains declarations of simple @dfn{name} 478definitions to simplify the scanner specification, and declarations of 479@dfn{start conditions}, which are explained in a later section. 480 481@cindex aliases, how to define 482@cindex pattern aliases, how to define 483Name definitions have the form: 484 485@example 486@verbatim 487 name definition 488@end verbatim 489@end example 490 491The @samp{name} is a word beginning with a letter or an underscore 492(@samp{_}) followed by zero or more letters, digits, @samp{_}, or 493@samp{-} (dash). The definition is taken to begin at the first 494non-whitespace character following the name and continuing to the end of 495the line. The definition can subsequently be referred to using 496@samp{@{name@}}, which will expand to @samp{(definition)}. For example, 497 498@cindex pattern aliases, defining 499@cindex defining pattern aliases 500@example 501@verbatim 502 DIGIT [0-9] 503 ID [a-z][a-z0-9]* 504@end verbatim 505@end example 506 507Defines @samp{DIGIT} to be a regular expression which matches a single 508digit, and @samp{ID} to be a regular expression which matches a letter 509followed by zero-or-more letters-or-digits. A subsequent reference to 510 511@cindex pattern aliases, use of 512@example 513@verbatim 514 {DIGIT}+"."{DIGIT}* 515@end verbatim 516@end example 517 518is identical to 519 520@example 521@verbatim 522 ([0-9])+"."([0-9])* 523@end verbatim 524@end example 525 526and matches one-or-more digits followed by a @samp{.} followed by 527zero-or-more digits. 528 529@cindex comments in flex input 530An unindented comment (i.e., a line 531beginning with @samp{/*}) is copied verbatim to the output up 532to the next @samp{*/}. 533 534@cindex %@{ and %@}, in Definitions Section 535@cindex embedding C code in flex input 536@cindex C code in flex input 537Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 538is also copied verbatim to the output (with the %@{ and %@} symbols 539removed). The %@{ and %@} symbols must appear unindented on lines by 540themselves. 541 542@cindex %top 543 544A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except 545that the code in a @code{%top} block is relocated to the @emph{top} of the 546generated file, before any flex definitions @footnote{Actually, 547@code{yyIN_HEADER} is defined before the @samp{%top} block.}. 548The @code{%top} block is useful when you want certain preprocessor macros to be 549defined or certain files to be included before the generated code. 550The single characters, @samp{@{} and @samp{@}} are used to delimit the 551@code{%top} block, as show in the example below: 552 553@example 554@verbatim 555 %top{ 556 /* This code goes at the "top" of the generated file. */ 557 #include <stdint.h> 558 #include <inttypes.h> 559 } 560@end verbatim 561@end example 562 563Multiple @code{%top} blocks are allowed, and their order is preserved. 564 565@node Rules Section, User Code Section, Definitions Section, Format 566@section Format of the Rules Section 567 568@cindex input file, Rules Section 569@cindex rules, in flex input 570The @dfn{rules} section of the @code{flex} input contains a series of 571rules of the form: 572 573@example 574@verbatim 575 pattern action 576@end verbatim 577@end example 578 579where the pattern must be unindented and the action must begin 580on the same line. 581@xref{Patterns}, for a further description of patterns and actions. 582 583In the rules section, any indented or %@{ %@} enclosed text appearing 584before the first rule may be used to declare variables which are local 585to the scanning routine and (after the declarations) code which is to be 586executed whenever the scanning routine is entered. Other indented or 587%@{ %@} text in the rule section is still copied to the output, but its 588meaning is not well-defined and it may well cause compile-time errors 589(this feature is present for @acronym{POSIX} compliance. @xref{Lex and 590Posix}, for other such features). 591 592Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} 593is copied verbatim to the output (with the %@{ and %@} symbols removed). 594The %@{ and %@} symbols must appear unindented on lines by themselves. 595 596@node User Code Section, Comments in the Input, Rules Section, Format 597@section Format of the User Code Section 598 599@cindex input file, user code Section 600@cindex user code, in flex input 601The user code section is simply copied to @file{lex.yy.c} verbatim. It 602is used for companion routines which call or are called by the scanner. 603The presence of this section is optional; if it is missing, the second 604@samp{%%} in the input file may be skipped, too. 605 606@node Comments in the Input, , User Code Section, Format 607@section Comments in the Input 608 609@cindex comments, syntax of 610Flex supports C-style comments, that is, anything between @samp{/*} and 611@samp{*/} is 612considered a comment. Whenever flex encounters a comment, it copies the 613entire comment verbatim to the generated source code. Comments may 614appear just about anywhere, but with the following exceptions: 615 616@itemize 617@cindex comments, in rules section 618@item 619Comments may not appear in the Rules Section wherever flex is expecting 620a regular expression. This means comments may not appear at the 621beginning of a line, or immediately following a list of scanner states. 622@item 623Comments may not appear on an @samp{%option} line in the Definitions 624Section. 625@end itemize 626 627If you want to follow a simple rule, then always begin a comment on a 628new line, with one or more whitespace characters before the initial 629@samp{/*}). This rule will work anywhere in the input file. 630 631All the comments in the following example are valid: 632 633@cindex comments, valid uses of 634@cindex comments in the input 635@example 636@verbatim 637%{ 638/* code block */ 639%} 640 641/* Definitions Section */ 642%x STATE_X 643 644%% 645 /* Rules Section */ 646ruleA /* after regex */ { /* code block */ } /* after code block */ 647 /* Rules Section (indented) */ 648<STATE_X>{ 649ruleC ECHO; 650ruleD ECHO; 651%{ 652/* code block */ 653%} 654} 655%% 656/* User Code Section */ 657 658@end verbatim 659@end example 660 661@node Patterns, Matching, Format, Top 662@chapter Patterns 663 664@cindex patterns, in rules section 665@cindex regular expressions, in patterns 666The patterns in the input (see @ref{Rules Section}) are written using an 667extended set of regular expressions. These are: 668 669@cindex patterns, syntax 670@cindex patterns, syntax 671@table @samp 672@item x 673match the character 'x' 674 675@item . 676any character (byte) except newline 677 678@cindex [] in patterns 679@cindex character classes in patterns, syntax of 680@cindex POSIX, character classes in patterns, syntax of 681@item [xyz] 682a @dfn{character class}; in this case, the pattern 683matches either an 'x', a 'y', or a 'z' 684 685@cindex ranges in patterns 686@item [abj-oZ] 687a "character class" with a range in it; matches 688an 'a', a 'b', any letter from 'j' through 'o', 689or a 'Z' 690 691@cindex ranges in patterns, negating 692@cindex negating ranges in patterns 693@item [^A-Z] 694a "negated character class", i.e., any character 695but those in the class. In this case, any 696character EXCEPT an uppercase letter. 697 698@item [^A-Z\n] 699any character EXCEPT an uppercase letter or 700a newline 701 702@item [a-z]@{-@}[aeiou] 703the lowercase consonants 704 705@item r* 706zero or more r's, where r is any regular expression 707 708@item r+ 709one or more r's 710 711@item r? 712zero or one r's (that is, ``an optional r'') 713 714@cindex braces in patterns 715@item r@{2,5@} 716anywhere from two to five r's 717 718@item r@{2,@} 719two or more r's 720 721@item r@{4@} 722exactly 4 r's 723 724@cindex pattern aliases, expansion of 725@item @{name@} 726the expansion of the @samp{name} definition 727(@pxref{Format}). 728 729@cindex literal text in patterns, syntax of 730@cindex verbatim text in patterns, syntax of 731@item "[xyz]\"foo" 732the literal string: @samp{[xyz]"foo} 733 734@cindex escape sequences in patterns, syntax of 735@item \X 736if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or 737@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a 738literal @samp{X} (used to escape operators such as @samp{*}) 739 740@cindex NULL character in patterns, syntax of 741@item \0 742a NUL character (ASCII code 0) 743 744@cindex octal characters in patterns 745@item \123 746the character with octal value 123 747 748@item \x2a 749the character with hexadecimal value 2a 750 751@item (r) 752match an @samp{r}; parentheses are used to override precedence (see below) 753 754@item (?r-s:pattern) 755apply option @samp{r} and omit option @samp{s} while interpreting pattern. 756Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}. 757 758@samp{i} means case-insensitive. @samp{-i} means case-sensitive. 759 760@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever. 761@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}. 762 763@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless 764it is backslash-escaped, contained within @samp{""}s, or appears inside a 765character class. 766 767The following are all valid: 768 769@verbatim 770(?:foo) same as (foo) 771(?i:ab7) same as ([aA][bB]7) 772(?-i:ab) same as (ab) 773(?s:.) same as [\x00-\xFF] 774(?-s:.) same as [^\n] 775(?ix-s: a . b) same as ([Aa][^\n][bB]) 776(?x:a b) same as ("ab") 777(?x:a\ b) same as ("a b") 778(?x:a" "b) same as ("a b") 779(?x:a[ ]b) same as ("a b") 780(?x:a 781 /* comment */ 782 b 783 c) same as (abc) 784@end verbatim 785 786@item (?# comment ) 787omit everything within @samp{()}. The first @samp{)} 788character encountered ends the pattern. It is not possible to for the comment 789to contain a @samp{)} character. The comment may span lines. 790 791@cindex concatenation, in patterns 792@item rs 793the regular expression @samp{r} followed by the regular expression @samp{s}; called 794@dfn{concatenation} 795 796@item r|s 797either an @samp{r} or an @samp{s} 798 799@cindex trailing context, in patterns 800@item r/s 801an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is 802included when determining whether this rule is the longest match, but is 803then returned to the input before the action is executed. So the action 804only sees the text matched by @samp{r}. This type of pattern is called 805@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex 806cannot match correctly. @xref{Limitations}, regarding dangerous trailing 807context.) 808 809@cindex beginning of line, in patterns 810@cindex BOL, in patterns 811@item ^r 812an @samp{r}, but only at the beginning of a line (i.e., 813when just starting to scan, or right after a 814newline has been scanned). 815 816@cindex end of line, in patterns 817@cindex EOL, in patterns 818@item r$ 819an @samp{r}, but only at the end of a line (i.e., just before a 820newline). Equivalent to @samp{r/\n}. 821 822@cindex newline, matching in patterns 823Note that @code{flex}'s notion of ``newline'' is exactly 824whatever the C compiler used to compile @code{flex} 825interprets @samp{\n} as; in particular, on some DOS 826systems you must either filter out @samp{\r}s in the 827input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. 828 829@cindex start conditions, in patterns 830@item <s>r 831an @samp{r}, but only in start condition @code{s} (see @ref{Start 832Conditions} for discussion of start conditions). 833 834@item <s1,s2,s3>r 835same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. 836 837@item <*>r 838an @samp{r} in any start condition, even an exclusive one. 839 840@cindex end of file, in patterns 841@cindex EOF in patterns, syntax of 842@item <<EOF>> 843an end-of-file. 844 845@item <s1,s2><<EOF>> 846an end-of-file when in start condition @code{s1} or @code{s2} 847@end table 848 849Note that inside of a character class, all regular expression operators 850lose their special meaning except escape (@samp{\}) and the character class 851operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. 852 853@cindex patterns, precedence of operators 854The regular expressions listed above are grouped according to 855precedence, from highest precedence at the top to lowest at the bottom. 856Those grouped together have equal precedence (see special note on the 857precedence of the repeat operator, @samp{@{@}}, under the documentation 858for the @samp{--posix} POSIX compliance option). For example, 859 860@cindex patterns, grouping and precedence 861@example 862@verbatim 863 foo|bar* 864@end verbatim 865@end example 866 867is the same as 868 869@example 870@verbatim 871 (foo)|(ba(r*)) 872@end verbatim 873@end example 874 875since the @samp{*} operator has higher precedence than concatenation, 876and concatenation higher than alternation (@samp{|}). This pattern 877therefore matches @emph{either} the string @samp{foo} @emph{or} the 878string @samp{ba} followed by zero-or-more @samp{r}'s. To match 879@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: 880 881@example 882@verbatim 883 foo|(bar)* 884@end verbatim 885@end example 886 887And to match a sequence of zero or more repetitions of @samp{foo} and 888@samp{bar}: 889 890@cindex patterns, repetitions with grouping 891@example 892@verbatim 893 (foo|bar)* 894@end verbatim 895@end example 896 897@cindex character classes in patterns 898In addition to characters and ranges of characters, character classes 899can also contain @dfn{character class expressions}. These are 900expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which 901themselves must appear between the @samp{[} and @samp{]} of the 902character class. Other elements may occur inside the character class, 903too). The valid expressions are: 904 905@cindex patterns, valid character classes 906@example 907@verbatim 908 [:alnum:] [:alpha:] [:blank:] 909 [:cntrl:] [:digit:] [:graph:] 910 [:lower:] [:print:] [:punct:] 911 [:space:] [:upper:] [:xdigit:] 912@end verbatim 913@end example 914 915These expressions all designate a set of characters equivalent to the 916corresponding standard C @code{isXXX} function. For example, 917@samp{[:alnum:]} designates those characters for which @code{isalnum()} 918returns true - i.e., any alphabetic or numeric character. Some systems 919don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a 920blank or a tab. 921 922For example, the following character classes are all equivalent: 923 924@cindex character classes, equivalence of 925@cindex patterns, character class equivalence 926@example 927@verbatim 928 [[:alnum:]] 929 [[:alpha:][:digit:]] 930 [[:alpha:][0-9]] 931 [a-zA-Z0-9] 932@end verbatim 933@end example 934 935A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. 936This means the character classes are sensitive to the locale in which @code{flex} 937is executed, and the resulting scanner will not be sensitive to the runtime locale. 938This may or may not be desirable. 939 940 941@itemize 942@cindex case-insensitive, effect on character classes 943@item If your scanner is case-insensitive (the @samp{-i} flag), then 944@samp{[:upper:]} and @samp{[:lower:]} are equivalent to 945@samp{[:alpha:]}. 946 947@anchor{case and character ranges} 948@item Character classes with ranges, such as @samp{[a-Z]}, should be used with 949caution in a case-insensitive scanner if the range spans upper or lowercase 950characters. Flex does not know if you want to fold all upper and lowercase 951characters together, or if you want the literal numeric range specified (with 952no case folding). When in doubt, flex will assume that you meant the literal 953numeric range, and will issue a warning. The exception to this rule is a 954character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you 955want case-folding to occur. Here are some examples with the @samp{-i} flag 956enabled: 957 958@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} 959@item Range @tab Result @tab Literal Range @tab Alternate Range 960@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab 961@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab 962@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} 963@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} 964@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} 965@end multitable 966 967@cindex end of line, in negated character classes 968@cindex EOL, in negated character classes 969@item 970A negated character class such as the example @samp{[^A-Z]} above 971@emph{will} match a newline unless @samp{\n} (or an equivalent escape 972sequence) is one of the characters explicitly present in the negated 973character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other 974regular expression tools treat negated character classes, but 975unfortunately the inconsistency is historically entrenched. Matching 976newlines means that a pattern like @samp{[^"]*} can match the entire 977input unless there's another quote in the input. 978 979Flex allows negation of character class expressions by prepending @samp{^} to 980the POSIX character class name. 981 982@example 983@verbatim 984 [:^alnum:] [:^alpha:] [:^blank:] 985 [:^cntrl:] [:^digit:] [:^graph:] 986 [:^lower:] [:^print:] [:^punct:] 987 [:^space:] [:^upper:] [:^xdigit:] 988@end verbatim 989@end example 990 991Flex will issue a warning if the expressions @samp{[:^upper:]} and 992@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is 993unclear. The current behavior is to skip them entirely, but this may change 994without notice in future revisions of flex. 995 996@item 997 998The @samp{@{-@}} operator computes the difference of two character classes. For 999example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class 1000@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is 1001just the single character @samp{a}). The @samp{@{-@}} operator is left 1002associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful 1003not to accidentally create an empty set, which will never match. 1004 1005@item 1006 1007The @samp{@{+@}} operator computes the union of two character classes. For 1008example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator 1009is useful when preceded by the result of a difference operation, as in, 1010@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to 1011@samp{[A-Zq]} in the "C" locale. 1012 1013@cindex trailing context, limits of 1014@cindex ^ as non-special character in patterns 1015@cindex $ as normal character in patterns 1016@item 1017A rule can have at most one instance of trailing context (the @samp{/} operator 1018or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns 1019can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, 1020cannot be grouped inside parentheses. A @samp{^} which does not occur at 1021the beginning of a rule or a @samp{$} which does not occur at the end of 1022a rule loses its special properties and is treated as a normal character. 1023 1024@item 1025The following are invalid: 1026 1027@cindex patterns, invalid trailing context 1028@example 1029@verbatim 1030 foo/bar$ 1031 <sc1>foo<sc2>bar 1032@end verbatim 1033@end example 1034 1035Note that the first of these can be written @samp{foo/bar\n}. 1036 1037@item 1038The following will result in @samp{$} or @samp{^} being treated as a normal character: 1039 1040@cindex patterns, special characters treated as non-special 1041@example 1042@verbatim 1043 foo|(bar$) 1044 foo|^bar 1045@end verbatim 1046@end example 1047 1048If the desired meaning is a @samp{foo} or a 1049@samp{bar}-followed-by-a-newline, the following could be used (the 1050special @code{|} action is explained below, @pxref{Actions}): 1051 1052@cindex patterns, end of line 1053@example 1054@verbatim 1055 foo | 1056 bar$ /* action goes here */ 1057@end verbatim 1058@end example 1059 1060A similar trick will work for matching a @samp{foo} or a 1061@samp{bar}-at-the-beginning-of-a-line. 1062@end itemize 1063 1064@node Matching, Actions, Patterns, Top 1065@chapter How the Input Is Matched 1066 1067@cindex patterns, matching 1068@cindex input, matching 1069@cindex trailing context, matching 1070@cindex matching, and trailing context 1071@cindex matching, length of 1072@cindex matching, multiple matches 1073When the generated scanner is run, it analyzes its input looking for 1074strings which match any of its patterns. If it finds more than one 1075match, it takes the one matching the most text (for trailing context 1076rules, this includes the length of the trailing part, even though it 1077will then be returned to the input). If it finds two or more matches of 1078the same length, the rule listed first in the @code{flex} input file is 1079chosen. 1080 1081@cindex token 1082@cindex yytext 1083@cindex yyleng 1084Once the match is determined, the text corresponding to the match 1085(called the @dfn{token}) is made available in the global character 1086pointer @code{yytext}, and its length in the global integer 1087@code{yyleng}. The @dfn{action} corresponding to the matched pattern is 1088then executed (@pxref{Actions}), and then the remaining input is scanned 1089for another match. 1090 1091@cindex default rule 1092If no match is found, then the @dfn{default rule} is executed: the next 1093character in the input is considered matched and copied to the standard 1094output. Thus, the simplest valid @code{flex} input is: 1095 1096@cindex minimal scanner 1097@example 1098@verbatim 1099 %% 1100@end verbatim 1101@end example 1102 1103which generates a scanner that simply copies its input (one character at 1104a time) to its output. 1105 1106@cindex yytext, two types of 1107@cindex %array, use of 1108@cindex %pointer, use of 1109@vindex yytext 1110Note that @code{yytext} can be defined in two different ways: either as 1111a character @emph{pointer} or as a character @emph{array}. You can 1112control which definition @code{flex} uses by including one of the 1113special directives @code{%pointer} or @code{%array} in the first 1114(definitions) section of your flex input. The default is 1115@code{%pointer}, unless you use the @samp{-l} lex compatibility option, 1116in which case @code{yytext} will be an array. The advantage of using 1117@code{%pointer} is substantially faster scanning and no buffer overflow 1118when matching very large tokens (unless you run out of dynamic memory). 1119The disadvantage is that you are restricted in how your actions can 1120modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()} 1121function destroys the present contents of @code{yytext}, which can be a 1122considerable porting headache when moving between different @code{lex} 1123versions. 1124 1125@cindex %array, advantages of 1126The advantage of @code{%array} is that you can then modify @code{yytext} 1127to your heart's content, and calls to @code{unput()} do not destroy 1128@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex} 1129programs sometimes access @code{yytext} externally using declarations of 1130the form: 1131 1132@example 1133@verbatim 1134 extern char yytext[]; 1135@end verbatim 1136@end example 1137 1138This definition is erroneous when used with @code{%pointer}, but correct 1139for @code{%array}. 1140 1141The @code{%array} declaration defines @code{yytext} to be an array of 1142@code{YYLMAX} characters, which defaults to a fairly large value. You 1143can change the size by simply #define'ing @code{YYLMAX} to a different 1144value in the first section of your @code{flex} input. As mentioned 1145above, with @code{%pointer} yytext grows dynamically to accommodate 1146large tokens. While this means your @code{%pointer} scanner can 1147accommodate very large tokens (such as matching entire blocks of 1148comments), bear in mind that each time the scanner must resize 1149@code{yytext} it also must rescan the entire token from the beginning, 1150so matching such tokens can prove slow. @code{yytext} presently does 1151@emph{not} dynamically grow if a call to @code{unput()} results in too 1152much text being pushed back; instead, a run-time error results. 1153 1154@cindex %array, with C++ 1155Also note that you cannot use @code{%array} with C++ scanner classes 1156(@pxref{Cxx}). 1157 1158@node Actions, Generated Scanner, Matching, Top 1159@chapter Actions 1160 1161@cindex actions 1162Each pattern in a rule has a corresponding @dfn{action}, which can be 1163any arbitrary C statement. The pattern ends at the first non-escaped 1164whitespace character; the remainder of the line is its action. If the 1165action is empty, then when the pattern is matched the input token is 1166simply discarded. For example, here is the specification for a program 1167which deletes all occurrences of @samp{zap me} from its input: 1168 1169@cindex deleting lines from input 1170@example 1171@verbatim 1172 %% 1173 "zap me" 1174@end verbatim 1175@end example 1176 1177This example will copy all other characters in the input to the output 1178since they will be matched by the default rule. 1179 1180Here is a program which compresses multiple blanks and tabs down to a 1181single blank, and throws away whitespace found at the end of a line: 1182 1183@cindex whitespace, compressing 1184@cindex compressing whitespace 1185@example 1186@verbatim 1187 %% 1188 [ \t]+ putchar( ' ' ); 1189 [ \t]+$ /* ignore this token */ 1190@end verbatim 1191@end example 1192 1193@cindex %@{ and %@}, in Rules Section 1194@cindex actions, use of @{ and @} 1195@cindex actions, embedded C strings 1196@cindex C-strings, in actions 1197@cindex comments, in actions 1198If the action contains a @samp{@{}, then the action spans till the 1199balancing @samp{@}} is found, and the action may cross multiple lines. 1200@code{flex} knows about C strings and comments and won't be fooled by 1201braces found within them, but also allows actions to begin with 1202@samp{%@{} and will consider the action to be all the text up to the 1203next @samp{%@}} (regardless of ordinary braces inside the action). 1204 1205@cindex |, in actions 1206An action consisting solely of a vertical bar (@samp{|}) means ``same as the 1207action for the next rule''. See below for an illustration. 1208 1209Actions can include arbitrary C code, including @code{return} statements 1210to return a value to whatever routine called @code{yylex()}. Each time 1211@code{yylex()} is called it continues processing tokens from where it 1212last left off until it either reaches the end of the file or executes a 1213return. 1214 1215@cindex yytext, modification of 1216Actions are free to modify @code{yytext} except for lengthening it 1217(adding characters to its end--these will overwrite later characters in 1218the input stream). This however does not apply when using @code{%array} 1219(@pxref{Matching}). In that case, @code{yytext} may be freely modified 1220in any way. 1221 1222@cindex yyleng, modification of 1223@cindex yymore, and yyleng 1224Actions are free to modify @code{yyleng} except they should not do so if 1225the action also includes use of @code{yymore()} (see below). 1226 1227@cindex preprocessor macros, for use in actions 1228There are a number of special directives which can be included within an 1229action: 1230 1231@table @code 1232@item ECHO 1233@cindex ECHO 1234copies yytext to the scanner's output. 1235 1236@item BEGIN 1237@cindex BEGIN 1238followed by the name of a start condition places the scanner in the 1239corresponding start condition (see below). 1240 1241@item REJECT 1242@cindex REJECT 1243directs the scanner to proceed on to the ``second best'' rule which 1244matched the input (or a prefix of the input). The rule is chosen as 1245described above in @ref{Matching}, and @code{yytext} and @code{yyleng} 1246set up appropriately. It may either be one which matched as much text 1247as the originally chosen rule but came later in the @code{flex} input 1248file, or one which matched less text. For example, the following will 1249both count the words in the input and call the routine @code{special()} 1250whenever @samp{frob} is seen: 1251 1252@example 1253@verbatim 1254 int word_count = 0; 1255 %% 1256 1257 frob special(); REJECT; 1258 [^ \t\n]+ ++word_count; 1259@end verbatim 1260@end example 1261 1262Without the @code{REJECT}, any occurrences of @samp{frob} in the input 1263would not be counted as words, since the scanner normally executes only 1264one action per token. Multiple uses of @code{REJECT} are allowed, each 1265one finding the next best choice to the currently active rule. For 1266example, when the following scanner scans the token @samp{abcd}, it will 1267write @samp{abcdabcaba} to the output: 1268 1269@cindex REJECT, calling multiple times 1270@cindex |, use of 1271@example 1272@verbatim 1273 %% 1274 a | 1275 ab | 1276 abc | 1277 abcd ECHO; REJECT; 1278 .|\n /* eat up any unmatched character */ 1279@end verbatim 1280@end example 1281 1282The first three rules share the fourth's action since they use the 1283special @samp{|} action. 1284 1285@code{REJECT} is a particularly expensive feature in terms of scanner 1286performance; if it is used in @emph{any} of the scanner's actions it 1287will slow down @emph{all} of the scanner's matching. Furthermore, 1288@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options 1289(@pxref{Scanner Options}). 1290 1291Note also that unlike the other special actions, @code{REJECT} is a 1292@emph{branch}. Code immediately following it in the action will 1293@emph{not} be executed. 1294 1295@item yymore() 1296@cindex yymore() 1297tells the scanner that the next time it matches a rule, the 1298corresponding token should be @emph{appended} onto the current value of 1299@code{yytext} rather than replacing it. For example, given the input 1300@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to 1301the output: 1302 1303@cindex yymore(), mega-kludge 1304@cindex yymore() to append token to previous token 1305@example 1306@verbatim 1307 %% 1308 mega- ECHO; yymore(); 1309 kludge ECHO; 1310@end verbatim 1311@end example 1312 1313First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} 1314is matched, but the previous @samp{mega-} is still hanging around at the 1315beginning of 1316@code{yytext} 1317so the 1318@code{ECHO} 1319for the @samp{kludge} rule will actually write @samp{mega-kludge}. 1320@end table 1321 1322@cindex yymore, performance penalty of 1323Two notes regarding use of @code{yymore()}. First, @code{yymore()} 1324depends on the value of @code{yyleng} correctly reflecting the size of 1325the current token, so you must not modify @code{yyleng} if you are using 1326@code{yymore()}. Second, the presence of @code{yymore()} in the 1327scanner's action entails a minor performance penalty in the scanner's 1328matching speed. 1329 1330@cindex yyless() 1331@code{yyless(n)} returns all but the first @code{n} characters of the 1332current token back to the input stream, where they will be rescanned 1333when the scanner looks for the next match. @code{yytext} and 1334@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now 1335be equal to @code{n}). For example, on the input @samp{foobar} the 1336following will write out @samp{foobarbar}: 1337 1338@cindex yyless(), pushing back characters 1339@cindex pushing back characters with yyless 1340@example 1341@verbatim 1342 %% 1343 foobar ECHO; yyless(3); 1344 [a-z]+ ECHO; 1345@end verbatim 1346@end example 1347 1348An argument of 0 to @code{yyless()} will cause the entire current input 1349string to be scanned again. Unless you've changed how the scanner will 1350subsequently process its input (using @code{BEGIN}, for example), this 1351will result in an endless loop. 1352 1353Note that @code{yyless()} is a macro and can only be used in the flex 1354input file, not from other source files. 1355 1356@cindex unput() 1357@cindex pushing back characters with unput 1358@code{unput(c)} puts the character @code{c} back onto the input stream. 1359It will be the next character scanned. The following action will take 1360the current token and cause it to be rescanned enclosed in parentheses. 1361 1362@cindex unput(), pushing back characters 1363@cindex pushing back characters with unput() 1364@example 1365@verbatim 1366 { 1367 int i; 1368 /* Copy yytext because unput() trashes yytext */ 1369 char *yycopy = strdup( yytext ); 1370 unput( ')' ); 1371 for ( i = yyleng - 1; i >= 0; --i ) 1372 unput( yycopy[i] ); 1373 unput( '(' ); 1374 free( yycopy ); 1375 } 1376@end verbatim 1377@end example 1378 1379Note that since each @code{unput()} puts the given character back at the 1380@emph{beginning} of the input stream, pushing back strings must be done 1381back-to-front. 1382 1383@cindex %pointer, and unput() 1384@cindex unput(), and %pointer 1385An important potential problem when using @code{unput()} is that if you 1386are using @code{%pointer} (the default), a call to @code{unput()} 1387@emph{destroys} the contents of @code{yytext}, starting with its 1388rightmost character and devouring one character to the left with each 1389call. If you need the value of @code{yytext} preserved after a call to 1390@code{unput()} (as in the above example), you must either first copy it 1391elsewhere, or build your scanner using @code{%array} instead 1392(@pxref{Matching}). 1393 1394@cindex pushing back EOF 1395@cindex EOF, pushing back 1396Finally, note that you cannot put back @samp{EOF} to attempt to mark the 1397input stream with an end-of-file. 1398 1399@cindex input() 1400@code{input()} reads the next character from the input stream. For 1401example, the following is one way to eat up C comments: 1402 1403@cindex comments, discarding 1404@cindex discarding C comments 1405@example 1406@verbatim 1407 %% 1408 "/*" { 1409 register int c; 1410 1411 for ( ; ; ) 1412 { 1413 while ( (c = input()) != '*' && 1414 c != EOF ) 1415 ; /* eat up text of comment */ 1416 1417 if ( c == '*' ) 1418 { 1419 while ( (c = input()) == '*' ) 1420 ; 1421 if ( c == '/' ) 1422 break; /* found the end */ 1423 } 1424 1425 if ( c == EOF ) 1426 { 1427 error( "EOF in comment" ); 1428 break; 1429 } 1430 } 1431 } 1432@end verbatim 1433@end example 1434 1435@cindex input(), and C++ 1436@cindex yyinput() 1437(Note that if the scanner is compiled using @code{C++}, then 1438@code{input()} is instead referred to as @b{yyinput()}, in order to 1439avoid a name clash with the @code{C++} stream by the name of 1440@code{input}.) 1441 1442@cindex flushing the internal buffer 1443@cindex YY_FLUSH_BUFFER() 1444@code{YY_FLUSH_BUFFER()} flushes the scanner's internal buffer so that 1445the next time the scanner attempts to match a token, it will first 1446refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}). 1447This action is a special case of the more general 1448@code{yy_flush_buffer()} function, described below (@pxref{Multiple 1449Input Buffers}) 1450 1451@cindex yyterminate() 1452@cindex terminating with yyterminate() 1453@cindex exiting with yyterminate() 1454@cindex halting with yyterminate() 1455@code{yyterminate()} can be used in lieu of a return statement in an 1456action. It terminates the scanner and returns a 0 to the scanner's 1457caller, indicating ``all done''. By default, @code{yyterminate()} is 1458also called when an end-of-file is encountered. It is a macro and may 1459be redefined. 1460 1461@node Generated Scanner, Start Conditions, Actions, Top 1462@chapter The Generated Scanner 1463 1464@cindex yylex(), in generated scanner 1465The output of @code{flex} is the file @file{lex.yy.c}, which contains 1466the scanning routine @code{yylex()}, a number of tables used by it for 1467matching tokens, and a number of auxiliary routines and macros. By 1468default, @code{yylex()} is declared as follows: 1469 1470@example 1471@verbatim 1472 int yylex() 1473 { 1474 ... various definitions and the actions in here ... 1475 } 1476@end verbatim 1477@end example 1478 1479@cindex yylex(), overriding 1480(If your environment supports function prototypes, then it will be 1481@code{int yylex( void )}.) This definition may be changed by defining 1482the @code{YY_DECL} macro. For example, you could use: 1483 1484@cindex yylex, overriding the prototype of 1485@example 1486@verbatim 1487 #define YY_DECL float lexscan( a, b ) float a, b; 1488@end verbatim 1489@end example 1490 1491to give the scanning routine the name @code{lexscan}, returning a float, 1492and taking two floats as arguments. Note that if you give arguments to 1493the scanning routine using a K&R-style/non-prototyped function 1494declaration, you must terminate the definition with a semi-colon (;). 1495 1496@code{flex} generates @samp{C99} function definitions by 1497default. However flex does have the ability to generate obsolete, er, 1498@samp{traditional}, function definitions. This is to support 1499bootstrapping gcc on old systems. Unfortunately, traditional 1500definitions prevent us from using any standard data types smaller than 1501int (such as short, char, or bool) as function arguments. For this 1502reason, future versions of @code{flex} may generate standard C99 code 1503only, leaving K&R-style functions to the historians. Currently, if you 1504do @strong{not} want @samp{C99} definitions, then you must use 1505@code{%option noansi-definitions}. 1506 1507@cindex stdin, default for yyin 1508@cindex yyin 1509Whenever @code{yylex()} is called, it scans tokens from the global input 1510file @file{yyin} (which defaults to stdin). It continues until it 1511either reaches an end-of-file (at which point it returns the value 0) or 1512one of its actions executes a @code{return} statement. 1513 1514@cindex EOF and yyrestart() 1515@cindex end-of-file, and yyrestart() 1516@cindex yyrestart() 1517If the scanner reaches an end-of-file, subsequent calls are undefined 1518unless either @file{yyin} is pointed at a new input file (in which case 1519scanning continues from that file), or @code{yyrestart()} is called. 1520@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which 1521can be NULL, if you've set up @code{YY_INPUT} to scan from a source other 1522than @code{yyin}), and initializes @file{yyin} for scanning from that 1523file. Essentially there is no difference between just assigning 1524@file{yyin} to a new input file or using @code{yyrestart()} to do so; 1525the latter is available for compatibility with previous versions of 1526@code{flex}, and because it can be used to switch input files in the 1527middle of scanning. It can also be used to throw away the current input 1528buffer, by calling it with an argument of @file{yyin}; but it would be 1529better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that 1530@code{yyrestart()} does @emph{not} reset the start condition to 1531@code{INITIAL} (@pxref{Start Conditions}). 1532 1533@cindex RETURN, within actions 1534If @code{yylex()} stops scanning due to executing a @code{return} 1535statement in one of the actions, the scanner may then be called again 1536and it will resume scanning where it left off. 1537 1538@cindex YY_INPUT 1539By default (and for purposes of efficiency), the scanner uses 1540block-reads rather than simple @code{getc()} calls to read characters 1541from @file{yyin}. The nature of how it gets its input can be controlled 1542by defining the @code{YY_INPUT} macro. The calling sequence for 1543@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action 1544is to place up to @code{max_size} characters in the character array 1545@code{buf} and return in the integer variable @code{result} either the 1546number of characters read or the constant @code{YY_NULL} (0 on Unix 1547systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from 1548the global file-pointer @file{yyin}. 1549 1550@cindex YY_INPUT, overriding 1551Here is a sample definition of @code{YY_INPUT} (in the definitions 1552section of the input file): 1553 1554@example 1555@verbatim 1556 %{ 1557 #define YY_INPUT(buf,result,max_size) \ 1558 { \ 1559 int c = getchar(); \ 1560 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 1561 } 1562 %} 1563@end verbatim 1564@end example 1565 1566This definition will change the input processing to occur one character 1567at a time. 1568 1569@cindex yywrap() 1570When the scanner receives an end-of-file indication from YY_INPUT, it 1571then checks the @code{yywrap()} function. If @code{yywrap()} returns 1572false (zero), then it is assumed that the function has gone ahead and 1573set up @file{yyin} to point to another input file, and scanning 1574continues. If it returns true (non-zero), then the scanner terminates, 1575returning 0 to its caller. Note that in either case, the start 1576condition remains unchanged; it does @emph{not} revert to 1577@code{INITIAL}. 1578 1579@cindex yywrap, default for 1580@cindex nowrap, %option 1581@cindex %option nowrap 1582If you do not supply your own version of @code{yywrap()}, then you must 1583either use @code{%option noyywrap} (in which case the scanner behaves as 1584though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to 1585obtain the default version of the routine, which always returns 1. 1586 1587For scanning from in-memory buffers (e.g., scanning strings), see 1588@ref{Scanning Strings}. @xref{Multiple Input Buffers}. 1589 1590@cindex ECHO, and yyout 1591@cindex yyout 1592@cindex stdout, as default for yyout 1593The scanner writes its @code{ECHO} output to the @file{yyout} global 1594(default, @file{stdout}), which may be redefined by the user simply by 1595assigning it to some other @code{FILE} pointer. 1596 1597@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top 1598@chapter Start Conditions 1599 1600@cindex start conditions 1601@code{flex} provides a mechanism for conditionally activating rules. 1602Any rule whose pattern is prefixed with @samp{<sc>} will only be active 1603when the scanner is in the @dfn{start condition} named @code{sc}. For 1604example, 1605 1606@c proofread edit stopped here 1607@example 1608@verbatim 1609 <STRING>[^"]* { /* eat up the string body ... */ 1610 ... 1611 } 1612@end verbatim 1613@end example 1614 1615will be active only when the scanner is in the @code{STRING} start 1616condition, and 1617 1618@cindex start conditions, multiple 1619@example 1620@verbatim 1621 <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ 1622 ... 1623 } 1624@end verbatim 1625@end example 1626 1627will be active only when the current start condition is either 1628@code{INITIAL}, @code{STRING}, or @code{QUOTE}. 1629 1630@cindex start conditions, inclusive v.s.@: exclusive 1631Start conditions are declared in the definitions (first) section of the 1632input using unindented lines beginning with either @samp{%s} or 1633@samp{%x} followed by a list of names. The former declares 1634@dfn{inclusive} start conditions, the latter @dfn{exclusive} start 1635conditions. A start condition is activated using the @code{BEGIN} 1636action. Until the next @code{BEGIN} action is executed, rules with the 1637given start condition will be active and rules with other start 1638conditions will be inactive. If the start condition is inclusive, then 1639rules with no start conditions at all will also be active. If it is 1640exclusive, then @emph{only} rules qualified with the start condition 1641will be active. A set of rules contingent on the same exclusive start 1642condition describe a scanner which is independent of any of the other 1643rules in the @code{flex} input. Because of this, exclusive start 1644conditions make it easy to specify ``mini-scanners'' which scan portions 1645of the input that are syntactically different from the rest (e.g., 1646comments). 1647 1648If the distinction between inclusive and exclusive start conditions 1649is still a little vague, here's a simple example illustrating the 1650connection between the two. The set of rules: 1651 1652@cindex start conditions, inclusive 1653@example 1654@verbatim 1655 %s example 1656 %% 1657 1658 <example>foo do_something(); 1659 1660 bar something_else(); 1661@end verbatim 1662@end example 1663 1664is equivalent to 1665 1666@cindex start conditions, exclusive 1667@example 1668@verbatim 1669 %x example 1670 %% 1671 1672 <example>foo do_something(); 1673 1674 <INITIAL,example>bar something_else(); 1675@end verbatim 1676@end example 1677 1678Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in 1679the second example wouldn't be active (i.e., couldn't match) when in 1680start condition @code{example}. If we just used @code{<example>} to 1681qualify @code{bar}, though, then it would only be active in 1682@code{example} and not in @code{INITIAL}, while in the first example 1683it's active in both, because in the first example the @code{example} 1684start condition is an inclusive @code{(%s)} start condition. 1685 1686@cindex start conditions, special wildcard condition 1687Also note that the special start-condition specifier 1688@code{<*>} 1689matches every start condition. Thus, the above example could also 1690have been written: 1691 1692@cindex start conditions, use of wildcard condition (<*>) 1693@example 1694@verbatim 1695 %x example 1696 %% 1697 1698 <example>foo do_something(); 1699 1700 <*>bar something_else(); 1701@end verbatim 1702@end example 1703 1704The default rule (to @code{ECHO} any unmatched character) remains active 1705in start conditions. It is equivalent to: 1706 1707@cindex start conditions, behavior of default rule 1708@example 1709@verbatim 1710 <*>.|\n ECHO; 1711@end verbatim 1712@end example 1713 1714@cindex BEGIN, explanation 1715@findex BEGIN 1716@vindex INITIAL 1717@code{BEGIN(0)} returns to the original state where only the rules with 1718no start conditions are active. This state can also be referred to as 1719the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is 1720equivalent to @code{BEGIN(0)}. (The parentheses around the start 1721condition name are not required but are considered good style.) 1722 1723@code{BEGIN} actions can also be given as indented code at the beginning 1724of the rules section. For example, the following will cause the scanner 1725to enter the @code{SPECIAL} start condition whenever @code{yylex()} is 1726called and the global variable @code{enter_special} is true: 1727 1728@cindex start conditions, using BEGIN 1729@example 1730@verbatim 1731 int enter_special; 1732 1733 %x SPECIAL 1734 %% 1735 if ( enter_special ) 1736 BEGIN(SPECIAL); 1737 1738 <SPECIAL>blahblahblah 1739 ...more rules follow... 1740@end verbatim 1741@end example 1742 1743To illustrate the uses of start conditions, here is a scanner which 1744provides two different interpretations of a string like @samp{123.456}. 1745By default it will treat it as three tokens, the integer @samp{123}, a 1746dot (@samp{.}), and the integer @samp{456}. But if the string is 1747preceded earlier in the line by the string @samp{expect-floats} it will 1748treat it as a single token, the floating-point number @samp{123.456}: 1749 1750@cindex start conditions, for different interpretations of same input 1751@example 1752@verbatim 1753 %{ 1754 #include <math.h> 1755 %} 1756 %s expect 1757 1758 %% 1759 expect-floats BEGIN(expect); 1760 1761 <expect>[0-9]+@samp{.}[0-9]+ { 1762 printf( "found a float, = %f\n", 1763 atof( yytext ) ); 1764 } 1765 <expect>\n { 1766 /* that's the end of the line, so 1767 * we need another "expect-number" 1768 * before we'll recognize any more 1769 * numbers 1770 */ 1771 BEGIN(INITIAL); 1772 } 1773 1774 [0-9]+ { 1775 printf( "found an integer, = %d\n", 1776 atoi( yytext ) ); 1777 } 1778 1779 "." printf( "found a dot\n" ); 1780@end verbatim 1781@end example 1782 1783@cindex comments, example of scanning C comments 1784Here is a scanner which recognizes (and discards) C comments while 1785maintaining a count of the current input line. 1786 1787@cindex recognizing C comments 1788@example 1789@verbatim 1790 %x comment 1791 %% 1792 int line_num = 1; 1793 1794 "/*" BEGIN(comment); 1795 1796 <comment>[^*\n]* /* eat anything that's not a '*' */ 1797 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1798 <comment>\n ++line_num; 1799 <comment>"*"+"/" BEGIN(INITIAL); 1800@end verbatim 1801@end example 1802 1803This scanner goes to a bit of trouble to match as much 1804text as possible with each rule. In general, when attempting to write 1805a high-speed scanner try to match as much possible in each rule, as 1806it's a big win. 1807 1808Note that start-conditions names are really integer values and 1809can be stored as such. Thus, the above could be extended in the 1810following fashion: 1811 1812@cindex start conditions, integer values 1813@cindex using integer values of start condition names 1814@example 1815@verbatim 1816 %x comment foo 1817 %% 1818 int line_num = 1; 1819 int comment_caller; 1820 1821 "/*" { 1822 comment_caller = INITIAL; 1823 BEGIN(comment); 1824 } 1825 1826 ... 1827 1828 <foo>"/*" { 1829 comment_caller = foo; 1830 BEGIN(comment); 1831 } 1832 1833 <comment>[^*\n]* /* eat anything that's not a '*' */ 1834 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1835 <comment>\n ++line_num; 1836 <comment>"*"+"/" BEGIN(comment_caller); 1837@end verbatim 1838@end example 1839 1840@cindex YY_START, example 1841Furthermore, you can access the current start condition using the 1842integer-valued @code{YY_START} macro. For example, the above 1843assignments to @code{comment_caller} could instead be written 1844 1845@cindex getting current start state with YY_START 1846@example 1847@verbatim 1848 comment_caller = YY_START; 1849@end verbatim 1850@end example 1851 1852@vindex YY_START 1853Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that 1854is what's used by AT&T @code{lex}). 1855 1856For historical reasons, start conditions do not have their own 1857name-space within the generated scanner. The start condition names are 1858unmodified in the generated scanner and generated header. 1859@xref{option-header}. @xref{option-prefix}. 1860 1861 1862 1863Finally, here's an example of how to match C-style quoted strings using 1864exclusive start conditions, including expanded escape sequences (but 1865not including checking for a string that's too long): 1866 1867@cindex matching C-style double-quoted strings 1868@example 1869@verbatim 1870 %x str 1871 1872 %% 1873 char string_buf[MAX_STR_CONST]; 1874 char *string_buf_ptr; 1875 1876 1877 \" string_buf_ptr = string_buf; BEGIN(str); 1878 1879 <str>\" { /* saw closing quote - all done */ 1880 BEGIN(INITIAL); 1881 *string_buf_ptr = '\0'; 1882 /* return string constant token type and 1883 * value to parser 1884 */ 1885 } 1886 1887 <str>\n { 1888 /* error - unterminated string constant */ 1889 /* generate error message */ 1890 } 1891 1892 <str>\\[0-7]{1,3} { 1893 /* octal escape sequence */ 1894 int result; 1895 1896 (void) sscanf( yytext + 1, "%o", &result ); 1897 1898 if ( result > 0xff ) 1899 /* error, constant is out-of-bounds */ 1900 1901 *string_buf_ptr++ = result; 1902 } 1903 1904 <str>\\[0-9]+ { 1905 /* generate error - bad escape sequence; something 1906 * like '\48' or '\0777777' 1907 */ 1908 } 1909 1910 <str>\\n *string_buf_ptr++ = '\n'; 1911 <str>\\t *string_buf_ptr++ = '\t'; 1912 <str>\\r *string_buf_ptr++ = '\r'; 1913 <str>\\b *string_buf_ptr++ = '\b'; 1914 <str>\\f *string_buf_ptr++ = '\f'; 1915 1916 <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1917 1918 <str>[^\\\n\"]+ { 1919 char *yptr = yytext; 1920 1921 while ( *yptr ) 1922 *string_buf_ptr++ = *yptr++; 1923 } 1924@end verbatim 1925@end example 1926 1927@cindex start condition, applying to multiple patterns 1928Often, such as in some of the examples above, you wind up writing a 1929whole bunch of rules all preceded by the same start condition(s). Flex 1930makes this a little easier and cleaner by introducing a notion of start 1931condition @dfn{scope}. A start condition scope is begun with: 1932 1933@example 1934@verbatim 1935 <SCs>{ 1936@end verbatim 1937@end example 1938 1939where @code{SCs} is a list of one or more start conditions. Inside the 1940start condition scope, every rule automatically has the prefix 1941@code{SCs>} applied to it, until a @samp{@}} which matches the initial 1942@samp{@{}. So, for example, 1943 1944@cindex extended scope of start conditions 1945@example 1946@verbatim 1947 <ESC>{ 1948 "\\n" return '\n'; 1949 "\\r" return '\r'; 1950 "\\f" return '\f'; 1951 "\\0" return '\0'; 1952 } 1953@end verbatim 1954@end example 1955 1956is equivalent to: 1957 1958@example 1959@verbatim 1960 <ESC>"\\n" return '\n'; 1961 <ESC>"\\r" return '\r'; 1962 <ESC>"\\f" return '\f'; 1963 <ESC>"\\0" return '\0'; 1964@end verbatim 1965@end example 1966 1967Start condition scopes may be nested. 1968 1969@cindex stacks, routines for manipulating 1970@cindex start conditions, use of a stack 1971 1972The following routines are available for manipulating stacks of start conditions: 1973 1974@deftypefun void yy_push_state ( int @code{new_state} ) 1975pushes the current start condition onto the top of the start condition 1976stack and switches to 1977@code{new_state} 1978as though you had used 1979@code{BEGIN new_state} 1980(recall that start condition names are also integers). 1981@end deftypefun 1982 1983@deftypefun void yy_pop_state () 1984pops the top of the stack and switches to it via 1985@code{BEGIN}. 1986@end deftypefun 1987 1988@deftypefun int yy_top_state () 1989returns the top of the stack without altering the stack's contents. 1990@end deftypefun 1991 1992@cindex memory, for start condition stacks 1993The start condition stack grows dynamically and so has no built-in size 1994limitation. If memory is exhausted, program execution aborts. 1995 1996To use start condition stacks, your scanner must include a @code{%option 1997stack} directive (@pxref{Scanner Options}). 1998 1999@node Multiple Input Buffers, EOF, Start Conditions, Top 2000@chapter Multiple Input Buffers 2001 2002@cindex multiple input streams 2003Some scanners (such as those which support ``include'' files) require 2004reading from several input streams. As @code{flex} scanners do a large 2005amount of buffering, one cannot control where the next input will be 2006read from by simply writing a @code{YY_INPUT()} which is sensitive to 2007the scanning context. @code{YY_INPUT()} is only called when the scanner 2008reaches the end of its buffer, which may be a long time after scanning a 2009statement such as an @code{include} statement which requires switching 2010the input source. 2011 2012To negotiate these sorts of problems, @code{flex} provides a mechanism 2013for creating and switching between multiple input buffers. An input 2014buffer is created by using: 2015 2016@cindex memory, allocating input buffers 2017@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) 2018@end deftypefun 2019 2020which takes a @code{FILE} pointer and a size and creates a buffer 2021associated with the given file and large enough to hold @code{size} 2022characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It 2023returns a @code{YY_BUFFER_STATE} handle, which may then be passed to 2024other routines (see below). 2025@tindex YY_BUFFER_STATE 2026The @code{YY_BUFFER_STATE} type is a 2027pointer to an opaque @code{struct yy_buffer_state} structure, so you may 2028safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) 20290)} if you wish, and also refer to the opaque structure in order to 2030correctly declare input buffers in source files other than that of your 2031scanner. Note that the @code{FILE} pointer in the call to 2032@code{yy_create_buffer} is only used as the value of @file{yyin} seen by 2033@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses 2034@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to 2035@code{yy_create_buffer}. You select a particular buffer to scan from 2036using: 2037 2038@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) 2039@end deftypefun 2040 2041The above function switches the scanner's input buffer so subsequent tokens 2042will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may 2043be used by @code{yywrap()} to set things up for continued scanning, instead of 2044opening a new file and pointing @file{yyin} at it. If you are looking for a 2045stack of input buffers, then you want to use @code{yypush_buffer_state()} 2046instead of this function. Note also that switching input sources via either 2047@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the 2048start condition. 2049 2050@cindex memory, deleting input buffers 2051@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) 2052@end deftypefun 2053 2054is used to reclaim the storage associated with a buffer. (@code{buffer} 2055can be NULL, in which case the routine does nothing.) You can also clear 2056the current contents of a buffer using: 2057 2058@cindex pushing an input buffer 2059@cindex stack, input buffer push 2060@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) 2061@end deftypefun 2062 2063This function pushes the new buffer state onto an internal stack. The pushed 2064state becomes the new current state. The stack is maintained by flex and will 2065grow as required. This function is intended to be used instead of 2066@code{yy_switch_to_buffer}, when you want to change states, but preserve the 2067current state for later use. 2068 2069@cindex popping an input buffer 2070@cindex stack, input buffer pop 2071@deftypefun void yypop_buffer_state ( ) 2072@end deftypefun 2073 2074This function removes the current state from the top of the stack, and deletes 2075it by calling @code{yy_delete_buffer}. The next state on the stack, if any, 2076becomes the new current state. 2077 2078@cindex clearing an input buffer 2079@cindex flushing an input buffer 2080@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) 2081@end deftypefun 2082 2083This function discards the buffer's contents, 2084so the next time the scanner attempts to match a token from the 2085buffer, it will first fill the buffer anew using 2086@code{YY_INPUT()}. 2087 2088@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) 2089@end deftypefun 2090 2091is an alias for @code{yy_create_buffer()}, 2092provided for compatibility with the C++ use of @code{new} and 2093@code{delete} for creating and destroying dynamic objects. 2094 2095@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro 2096@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the 2097current buffer. It should not be used as an lvalue. 2098 2099@cindex EOF, example using multiple input buffers 2100Here are two examples of using these features for writing a scanner 2101which expands include files (the 2102@code{<<EOF>>} 2103feature is discussed below). 2104 2105This first example uses yypush_buffer_state and yypop_buffer_state. Flex 2106maintains the stack internally. 2107 2108@cindex handling include files with multiple input buffers 2109@example 2110@verbatim 2111 /* the "incl" state is used for picking up the name 2112 * of an include file 2113 */ 2114 %x incl 2115 %% 2116 include BEGIN(incl); 2117 2118 [a-z]+ ECHO; 2119 [^a-z\n]*\n? ECHO; 2120 2121 <incl>[ \t]* /* eat the whitespace */ 2122 <incl>[^ \t\n]+ { /* got the include file name */ 2123 yyin = fopen( yytext, "r" ); 2124 2125 if ( ! yyin ) 2126 error( ... ); 2127 2128 yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); 2129 2130 BEGIN(INITIAL); 2131 } 2132 2133 <<EOF>> { 2134 yypop_buffer_state(); 2135 2136 if ( !YY_CURRENT_BUFFER ) 2137 { 2138 yyterminate(); 2139 } 2140 } 2141@end verbatim 2142@end example 2143 2144The second example, below, does the same thing as the previous example did, but 2145manages its own input buffer stack manually (instead of letting flex do it). 2146 2147@cindex handling include files with multiple input buffers 2148@example 2149@verbatim 2150 /* the "incl" state is used for picking up the name 2151 * of an include file 2152 */ 2153 %x incl 2154 2155 %{ 2156 #define MAX_INCLUDE_DEPTH 10 2157 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 2158 int include_stack_ptr = 0; 2159 %} 2160 2161 %% 2162 include BEGIN(incl); 2163 2164 [a-z]+ ECHO; 2165 [^a-z\n]*\n? ECHO; 2166 2167 <incl>[ \t]* /* eat the whitespace */ 2168 <incl>[^ \t\n]+ { /* got the include file name */ 2169 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 2170 { 2171 fprintf( stderr, "Includes nested too deeply" ); 2172 exit( 1 ); 2173 } 2174 2175 include_stack[include_stack_ptr++] = 2176 YY_CURRENT_BUFFER; 2177 2178 yyin = fopen( yytext, "r" ); 2179 2180 if ( ! yyin ) 2181 error( ... ); 2182 2183 yy_switch_to_buffer( 2184 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 2185 2186 BEGIN(INITIAL); 2187 } 2188 2189 <<EOF>> { 2190 if ( --include_stack_ptr 0 ) 2191 { 2192 yyterminate(); 2193 } 2194 2195 else 2196 { 2197 yy_delete_buffer( YY_CURRENT_BUFFER ); 2198 yy_switch_to_buffer( 2199 include_stack[include_stack_ptr] ); 2200 } 2201 } 2202@end verbatim 2203@end example 2204 2205@anchor{Scanning Strings} 2206@cindex strings, scanning strings instead of files 2207The following routines are available for setting up input buffers for 2208scanning in-memory strings instead of files. All of them create a new 2209input buffer for scanning the string, and return a corresponding 2210@code{YY_BUFFER_STATE} handle (which you should delete with 2211@code{yy_delete_buffer()} when done with it). They also switch to the 2212new buffer using @code{yy_switch_to_buffer()}, so the next call to 2213@code{yylex()} will start scanning the string. 2214 2215@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) 2216scans a NUL-terminated string. 2217@end deftypefun 2218 2219@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) 2220scans @code{len} bytes (including possibly @code{NUL}s) starting at location 2221@code{bytes}. 2222@end deftypefun 2223 2224Note that both of these functions create and scan a @emph{copy} of the 2225string or bytes. (This may be desirable, since @code{yylex()} modifies 2226the contents of the buffer it is scanning.) You can avoid the copy by 2227using: 2228 2229@vindex YY_END_OF_BUFFER_CHAR 2230@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) 2231which scans in place the buffer starting at @code{base}, consisting of 2232@code{size} bytes, the last two bytes of which @emph{must} be 2233@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not 2234scanned; thus, scanning consists of @code{base[0]} through 2235@code{base[size-2]}, inclusive. 2236@end deftypefun 2237 2238If you fail to set up @code{base} in this manner (i.e., forget the final 2239two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()} 2240returns a NULL pointer instead of creating a new input buffer. 2241 2242@deftp {Data type} yy_size_t 2243is an integral type to which you can cast an integer expression 2244reflecting the size of the buffer. 2245@end deftp 2246 2247@node EOF, Misc Macros, Multiple Input Buffers, Top 2248@chapter End-of-File Rules 2249 2250@cindex EOF, explanation 2251The special rule @code{<<EOF>>} indicates 2252actions which are to be taken when an end-of-file is 2253encountered and @code{yywrap()} returns non-zero (i.e., indicates 2254no further files to process). The action must finish 2255by doing one of the following things: 2256 2257@itemize 2258@item 2259@findex YY_NEW_FILE (now obsolete) 2260assigning @file{yyin} to a new input file (in previous versions of 2261@code{flex}, after doing the assignment you had to call the special 2262action @code{YY_NEW_FILE}. This is no longer necessary.) 2263 2264@item 2265executing a @code{return} statement; 2266 2267@item 2268executing the special @code{yyterminate()} action. 2269 2270@item 2271or, switching to a new buffer using @code{yy_switch_to_buffer()} as 2272shown in the example above. 2273@end itemize 2274 2275<<EOF>> rules may not be used with other patterns; they may only be 2276qualified with a list of start conditions. If an unqualified <<EOF>> 2277rule is given, it applies to @emph{all} start conditions which do not 2278already have <<EOF>> actions. To specify an <<EOF>> rule for only the 2279initial start condition, use: 2280 2281@example 2282@verbatim 2283 <INITIAL><<EOF>> 2284@end verbatim 2285@end example 2286 2287These rules are useful for catching things like unclosed comments. An 2288example: 2289 2290@cindex <<EOF>>, use of 2291@example 2292@verbatim 2293 %x quote 2294 %% 2295 2296 ...other rules for dealing with quotes... 2297 2298 <quote><<EOF>> { 2299 error( "unterminated quote" ); 2300 yyterminate(); 2301 } 2302 <<EOF>> { 2303 if ( *++filelist ) 2304 yyin = fopen( *filelist, "r" ); 2305 else 2306 yyterminate(); 2307 } 2308@end verbatim 2309@end example 2310 2311@node Misc Macros, User Values, EOF, Top 2312@chapter Miscellaneous Macros 2313 2314@hkindex YY_USER_ACTION 2315The macro @code{YY_USER_ACTION} can be defined to provide an action 2316which is always executed prior to the matched rule's action. For 2317example, it could be #define'd to call a routine to convert yytext to 2318lower-case. When @code{YY_USER_ACTION} is invoked, the variable 2319@code{yy_act} gives the number of the matched rule (rules are numbered 2320starting with 1). Suppose you want to profile how often each of your 2321rules is matched. The following would do the trick: 2322 2323@cindex YY_USER_ACTION to track each time a rule is matched 2324@example 2325@verbatim 2326 #define YY_USER_ACTION ++ctr[yy_act] 2327@end verbatim 2328@end example 2329 2330@vindex YY_NUM_RULES 2331where @code{ctr} is an array to hold the counts for the different rules. 2332Note that the macro @code{YY_NUM_RULES} gives the total number of rules 2333(including the default rule), even if you use @samp{-s)}, so a correct 2334declaration for @code{ctr} is: 2335 2336@example 2337@verbatim 2338 int ctr[YY_NUM_RULES]; 2339@end verbatim 2340@end example 2341 2342@hkindex YY_USER_INIT 2343The macro @code{YY_USER_INIT} may be defined to provide an action which 2344is always executed before the first scan (and before the scanner's 2345internal initializations are done). For example, it could be used to 2346call a routine to read in a data table or open a logging file. 2347 2348@findex yy_set_interactive 2349The macro @code{yy_set_interactive(is_interactive)} can be used to 2350control whether the current buffer is considered @dfn{interactive}. An 2351interactive buffer is processed more slowly, but must be used when the 2352scanner's input source is indeed interactive to avoid problems due to 2353waiting to fill buffers (see the discussion of the @samp{-I} flag in 2354@ref{Scanner Options}). A non-zero value in the macro invocation marks 2355the buffer as interactive, a zero value as non-interactive. Note that 2356use of this macro overrides @code{%option always-interactive} or 2357@code{%option never-interactive} (@pxref{Scanner Options}). 2358@code{yy_set_interactive()} must be invoked prior to beginning to scan 2359the buffer that is (or is not) to be considered interactive. 2360 2361@cindex BOL, setting it 2362@findex yy_set_bol 2363The macro @code{yy_set_bol(at_bol)} can be used to control whether the 2364current buffer's scanning context for the next token match is done as 2365though at the beginning of a line. A non-zero macro argument makes 2366rules anchored with @samp{^} active, while a zero argument makes 2367@samp{^} rules inactive. 2368 2369@cindex BOL, checking the BOL flag 2370@findex YY_AT_BOL 2371The macro @code{YY_AT_BOL()} returns true if the next token scanned from 2372the current buffer will have @samp{^} rules active, false otherwise. 2373 2374@cindex actions, redefining YY_BREAK 2375@hkindex YY_BREAK 2376In the generated scanner, the actions are all gathered in one large 2377switch statement and separated using @code{YY_BREAK}, which may be 2378redefined. By default, it is simply a @code{break}, to separate each 2379rule's action from the following rule's. Redefining @code{YY_BREAK} 2380allows, for example, C++ users to #define YY_BREAK to do nothing (while 2381being very careful that every rule ends with a @code{break} or a 2382@code{return}!) to avoid suffering from unreachable statement warnings 2383where because a rule's action ends with @code{return}, the 2384@code{YY_BREAK} is inaccessible. 2385 2386@node User Values, Yacc, Misc Macros, Top 2387@chapter Values Available To the User 2388 2389This chapter summarizes the various values available to the user in the 2390rule actions. 2391 2392@table @code 2393@vindex yytext 2394@item char *yytext 2395holds the text of the current token. It may be modified but not 2396lengthened (you cannot append characters to the end). 2397 2398@cindex yytext, default array size 2399@cindex array, default size for yytext 2400@vindex YYLMAX 2401If the special directive @code{%array} appears in the first section of 2402the scanner description, then @code{yytext} is instead declared 2403@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition 2404that you can redefine in the first section if you don't like the default 2405value (generally 8KB). Using @code{%array} results in somewhat slower 2406scanners, but the value of @code{yytext} becomes immune to calls to 2407@code{unput()}, which potentially destroy its value when @code{yytext} is 2408a character pointer. The opposite of @code{%array} is @code{%pointer}, 2409which is the default. 2410 2411@cindex C++ and %array 2412You cannot use @code{%array} when generating C++ scanner classes (the 2413@samp{-+} flag). 2414 2415@vindex yyleng 2416@item int yyleng 2417holds the length of the current token. 2418 2419@vindex yyin 2420@item FILE *yyin 2421is the file which by default @code{flex} reads from. It may be 2422redefined but doing so only makes sense before scanning begins or after 2423an EOF has been encountered. Changing it in the midst of scanning will 2424have unexpected results since @code{flex} buffers its input; use 2425@code{yyrestart()} instead. Once scanning terminates because an 2426end-of-file has been seen, you can assign @file{yyin} at the new input 2427file and then call the scanner again to continue scanning. 2428 2429@findex yyrestart 2430@item void yyrestart( FILE *new_file ) 2431may be called to point @file{yyin} at the new input file. The 2432switch-over to the new file is immediate (any previously buffered-up 2433input is lost). Note that calling @code{yyrestart()} with @file{yyin} 2434as an argument thus throws away the current input buffer and continues 2435scanning the same input file. 2436 2437@vindex yyout 2438@item FILE *yyout 2439is the file to which @code{ECHO} actions are done. It can be reassigned 2440by the user. 2441 2442@vindex YY_CURRENT_BUFFER 2443@item YY_CURRENT_BUFFER 2444returns a @code{YY_BUFFER_STATE} handle to the current buffer. 2445 2446@vindex YY_START 2447@item YY_START 2448returns an integer value corresponding to the current start condition. 2449You can subsequently use this value with @code{BEGIN} to return to that 2450start condition. 2451@end table 2452 2453@node Yacc, Scanner Options, User Values, Top 2454@chapter Interfacing with Yacc 2455 2456@cindex yacc, interface 2457 2458@vindex yylval, with yacc 2459One of the main uses of @code{flex} is as a companion to the @code{yacc} 2460parser-generator. @code{yacc} parsers expect to call a routine named 2461@code{yylex()} to find the next input token. The routine is supposed to 2462return the type of the next token as well as putting any associated 2463value in the global @code{yylval}. To use @code{flex} with @code{yacc}, 2464one specifies the @samp{-d} option to @code{yacc} to instruct it to 2465generate the file @file{y.tab.h} containing definitions of all the 2466@code{%tokens} appearing in the @code{yacc} input. This file is then 2467included in the @code{flex} scanner. For example, if one of the tokens 2468is @code{TOK_NUMBER}, part of the scanner might look like: 2469 2470@cindex yacc interface 2471@example 2472@verbatim 2473 %{ 2474 #include "y.tab.h" 2475 %} 2476 2477 %% 2478 2479 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2480@end verbatim 2481@end example 2482 2483@node Scanner Options, Performance, Yacc, Top 2484@chapter Scanner Options 2485 2486@cindex command-line options 2487@cindex options, command-line 2488@cindex arguments, command-line 2489 2490The various @code{flex} options are categorized by function in the following 2491menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. 2492 2493@menu 2494* Options for Specifying Filenames:: 2495* Options Affecting Scanner Behavior:: 2496* Code-Level And API Options:: 2497* Options for Scanner Speed and Size:: 2498* Debugging Options:: 2499* Miscellaneous Options:: 2500@end menu 2501 2502Even though there are many scanner options, a typical scanner might only 2503specify the following options: 2504 2505@example 2506@verbatim 2507%option 8bit reentrant bison-bridge 2508%option warn nodefault 2509%option yylineno 2510%option outfile="scanner.c" header-file="scanner.h" 2511@end verbatim 2512@end example 2513 2514The first line specifies the general type of scanner we want. The second line 2515specifies that we are being careful. The third line asks flex to track line 2516numbers. The last line tells flex what to name the files. (The options can be 2517specified in any order. We just divided them.) 2518 2519@code{flex} also provides a mechanism for controlling options within the 2520scanner specification itself, rather than from the flex command-line. 2521This is done by including @code{%option} directives in the first section 2522of the scanner specification. You can specify multiple options with a 2523single @code{%option} directive, and multiple directives in the first 2524section of your flex input file. 2525 2526Most options are given simply as names, optionally preceded by the 2527word @samp{no} (with no intervening whitespace) to negate their meaning. 2528The names are the same as their long-option equivalents (but without the 2529leading @samp{--} ). 2530 2531@code{flex} scans your rule actions to determine whether you use the 2532@code{REJECT} or @code{yymore()} features. The @code{REJECT} and 2533@code{yymore} options are available to override its decision as to 2534whether you use the options, either by setting them (e.g., @code{%option 2535reject)} to indicate the feature is indeed used, or unsetting them to 2536indicate it actually is not used (e.g., @code{%option noyymore)}. 2537 2538 2539A number of options are available for lint purists who want to suppress 2540the appearance of unneeded routines in the generated scanner. Each of 2541the following, if unset (e.g., @code{%option nounput}), results in the 2542corresponding routine not appearing in the generated scanner: 2543 2544@example 2545@verbatim 2546 input, unput 2547 yy_push_state, yy_pop_state, yy_top_state 2548 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2549 2550 yyget_extra, yyset_extra, yyget_leng, yyget_text, 2551 yyget_lineno, yyset_lineno, yyget_in, yyset_in, 2552 yyget_out, yyset_out, yyget_lval, yyset_lval, 2553 yyget_lloc, yyset_lloc, yyget_debug, yyset_debug 2554@end verbatim 2555@end example 2556 2557(though @code{yy_push_state()} and friends won't appear anyway unless 2558you use @code{%option stack)}. 2559 2560@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options 2561@section Options for Specifying Filenames 2562 2563@table @samp 2564 2565@anchor{option-header} 2566@opindex ---header-file 2567@opindex header-file 2568@item --header-file=FILE, @code{%option header-file="FILE"} 2569instructs flex to write a C header to @file{FILE}. This file contains 2570function prototypes, extern variables, and types used by the scanner. 2571Only the external API is exported by the header file. Many macros that 2572are usable from within scanner actions are not exported to the header 2573file. This is due to namespace problems and the goal of a clean 2574external API. 2575 2576While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} 2577is substituted with the appropriate prefix. 2578 2579The @samp{--header-file} option is not compatible with the @samp{--c++} option, 2580since the C++ scanner provides its own header in @file{yyFlexLexer.h}. 2581 2582 2583 2584@anchor{option-outfile} 2585@opindex -o 2586@opindex ---outfile 2587@opindex outfile 2588@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"} 2589directs flex to write the scanner to the file @file{FILE} instead of 2590@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option, 2591then the scanner is written to @file{stdout} but its @code{#line} 2592directives (see the @samp{-l} option above) refer to the file 2593@file{FILE}. 2594 2595 2596 2597@anchor{option-stdout} 2598@opindex -t 2599@opindex ---stdout 2600@opindex stdout 2601@item -t, --stdout, @code{%option stdout} 2602instructs @code{flex} to write the scanner it generates to standard 2603output instead of @file{lex.yy.c}. 2604 2605 2606 2607@opindex ---skel 2608@item -SFILE, --skel=FILE 2609overrides the default skeleton file from which 2610@code{flex} 2611constructs its scanners. You'll never need this option unless you are doing 2612@code{flex} 2613maintenance or development. 2614 2615@opindex ---tables-file 2616@opindex tables-file 2617@item --tables-file=FILE 2618Write serialized scanner dfa tables to FILE. The generated scanner will not 2619contain the tables, and requires them to be loaded at runtime. 2620@xref{serialization}. 2621 2622@opindex ---tables-verify 2623@opindex tables-verify 2624@item --tables-verify 2625This option is for flex development. We document it here in case you stumble 2626upon it by accident or in case you suspect some inconsistency in the serialized 2627tables. Flex will serialize the scanner dfa tables but will also generate the 2628in-code tables as it normally does. At runtime, the scanner will verify that 2629the serialized tables match the in-code tables, instead of loading them. 2630 2631@end table 2632 2633@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options 2634@section Options Affecting Scanner Behavior 2635 2636@table @samp 2637@anchor{option-case-insensitive} 2638@opindex -i 2639@opindex ---case-insensitive 2640@opindex case-insensitive 2641@item -i, --case-insensitive, @code{%option case-insensitive} 2642instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The 2643case of letters given in the @code{flex} input patterns will be ignored, 2644and tokens in the input will be matched regardless of case. The matched 2645text given in @code{yytext} will have the preserved case (i.e., it will 2646not be folded). For tricky behavior, see @ref{case and character ranges}. 2647 2648 2649 2650@anchor{option-lex-compat} 2651@opindex -l 2652@opindex ---lex-compat 2653@opindex lex-compat 2654@item -l, --lex-compat, @code{%option lex-compat} 2655turns on maximum compatibility with the original AT&T @code{lex} 2656implementation. Note that this does not mean @emph{full} compatibility. 2657Use of this option costs a considerable amount of performance, and it 2658cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or 2659@samp{-CF} options. For details on the compatibilities it provides, see 2660@ref{Lex and Posix}. This option also results in the name 2661@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. 2662 2663 2664 2665@anchor{option-batch} 2666@opindex -B 2667@opindex ---batch 2668@opindex batch 2669@item -B, --batch, @code{%option batch} 2670instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of 2671@emph{interactive} scanners generated by @samp{--interactive} (see below). In 2672general, you use @samp{-B} when you are @emph{certain} that your scanner 2673will never be used interactively, and you want to squeeze a 2674@emph{little} more performance out of it. If your goal is instead to 2675squeeze out a @emph{lot} more performance, you should be using the 2676@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically 2677anyway. 2678 2679 2680 2681@anchor{option-interactive} 2682@opindex -I 2683@opindex ---interactive 2684@opindex interactive 2685@item -I, --interactive, @code{%option interactive} 2686instructs @code{flex} to generate an @i{interactive} scanner. An 2687interactive scanner is one that only looks ahead to decide what token 2688has been matched if it absolutely must. It turns out that always 2689looking one extra character ahead, even if the scanner has already seen 2690enough text to disambiguate the current token, is a bit faster than only 2691looking ahead when necessary. But scanners that always look ahead give 2692dreadful interactive performance; for example, when a user types a 2693newline, it is not recognized as a newline token until they enter 2694@emph{another} token, which often means typing in another whole line. 2695 2696@code{flex} scanners default to @code{interactive} unless you use the 2697@samp{-Cf} or @samp{-CF} table-compression options 2698(@pxref{Performance}). That's because if you're looking for 2699high-performance you should be using one of these options, so if you 2700didn't, @code{flex} assumes you'd rather trade off a bit of run-time 2701performance for intuitive interactive behavior. Note also that you 2702@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or 2703@samp{-CF}. Thus, this option is not really needed; it is on by default 2704for all those cases in which it is allowed. 2705 2706You can force a scanner to 2707@emph{not} 2708be interactive by using 2709@samp{--batch} 2710 2711 2712 2713@anchor{option-7bit} 2714@opindex -7 2715@opindex ---7bit 2716@opindex 7bit 2717@item -7, --7bit, @code{%option 7bit} 2718instructs @code{flex} to generate a 7-bit scanner, i.e., one which can 2719only recognize 7-bit characters in its input. The advantage of using 2720@samp{--7bit} is that the scanner's tables can be up to half the size of 2721those generated using the @samp{--8bit}. The disadvantage is that such 2722scanners often hang or crash if their input contains an 8-bit character. 2723 2724Note, however, that unless you generate your scanner using the 2725@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} 2726will save only a small amount of table space, and make your scanner 2727considerably less portable. @code{Flex}'s default behavior is to 2728generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, 2729in which case @code{flex} defaults to generating 7-bit scanners unless 2730your site was always configured to generate 8-bit scanners (as will 2731often be the case with non-USA sites). You can tell whether flex 2732generated a 7-bit or an 8-bit scanner by inspecting the flag summary in 2733the @samp{--verbose} output as described above. 2734 2735Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still 2736defaults to generating an 8-bit scanner, since usually with these 2737compression options full 8-bit tables are not much more expensive than 27387-bit tables. 2739 2740 2741 2742@anchor{option-8bit} 2743@opindex -8 2744@opindex ---8bit 2745@opindex 8bit 2746@item -8, --8bit, @code{%option 8bit} 2747instructs @code{flex} to generate an 8-bit scanner, i.e., one which can 2748recognize 8-bit characters. This flag is only needed for scanners 2749generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to 2750generating an 8-bit scanner anyway. 2751 2752See the discussion of 2753@samp{--7bit} 2754above for @code{flex}'s default behavior and the tradeoffs between 7-bit 2755and 8-bit scanners. 2756 2757 2758 2759@anchor{option-default} 2760@opindex ---default 2761@opindex default 2762@item --default, @code{%option default} 2763generate the default rule. 2764 2765 2766 2767@anchor{option-always-interactive} 2768@opindex ---always-interactive 2769@opindex always-interactive 2770@item --always-interactive, @code{%option always-interactive} 2771instructs flex to generate a scanner which always considers its input 2772@emph{interactive}. Normally, on each new input file the scanner calls 2773@code{isatty()} in an attempt to determine whether the scanner's input 2774source is interactive and thus should be read a character at a time. 2775When this option is used, however, then no such call is made. 2776 2777 2778 2779@opindex ---never-interactive 2780@item --never-interactive, @code{--never-interactive} 2781instructs flex to generate a scanner which never considers its input 2782interactive. This is the opposite of @code{always-interactive}. 2783 2784 2785@anchor{option-posix} 2786@opindex -X 2787@opindex ---posix 2788@opindex posix 2789@item -X, --posix, @code{%option posix} 2790turns on maximum compatibility with the POSIX 1003.2-1992 definition of 2791@code{lex}. Since @code{flex} was originally designed to implement the 2792POSIX definition of @code{lex} this generally involves very few changes 2793in behavior. At the current writing the known differences between 2794@code{flex} and the POSIX standard are: 2795 2796@itemize 2797@item 2798In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower 2799precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). 2800Most POSIX utilities use an Extended Regular Expression (ERE) precedence 2801that has the precedence of the repeat operator higher than concatenation 2802(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex} 2803places the precedence of the repeat operator higher than concatenation 2804which matches the ERE processing of other POSIX utilities. When either 2805@samp{--posix} or @samp{-l} are specified, @code{flex} will use the 2806traditional AT&T and POSIX-compliant precedence for the repeat operator 2807where concatenation has higher precedence than the repeat operator. 2808@end itemize 2809 2810 2811@anchor{option-stack} 2812@opindex ---stack 2813@opindex stack 2814@item --stack, @code{%option stack} 2815enables the use of 2816start condition stacks (@pxref{Start Conditions}). 2817 2818 2819 2820@anchor{option-stdinit} 2821@opindex ---stdinit 2822@opindex stdinit 2823@item --stdinit, @code{%option stdinit} 2824if set (i.e., @b{%option stdinit)} initializes @code{yyin} and 2825@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of 2826@file{NULL}. Some existing @code{lex} programs depend on this behavior, 2827even though it is not compliant with ANSI C, which does not require 2828@file{stdin} and @file{stdout} to be compile-time constant. In a 2829reentrant scanner, however, this is not a problem since initialization 2830is performed in @code{yylex_init} at runtime. 2831 2832 2833 2834@anchor{option-yylineno} 2835@opindex ---yylineno 2836@opindex yylineno 2837@item --yylineno, @code{%option yylineno} 2838directs @code{flex} to generate a scanner 2839that maintains the number of the current line read from its input in the 2840global variable @code{yylineno}. This option is implied by @code{%option 2841lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is 2842accessible regardless of the value of @code{%option yylineno}, however, its 2843value is not modified by @code{flex} unless @code{%option yylineno} is enabled. 2844 2845 2846 2847@anchor{option-yywrap} 2848@opindex ---yywrap 2849@opindex yywrap 2850@item --yywrap, @code{%option yywrap} 2851if unset (i.e., @code{--noyywrap)}, makes the scanner not call 2852@code{yywrap()} upon an end-of-file, but simply assume that there are no 2853more files to scan (until the user points @file{yyin} at a new file and 2854calls @code{yylex()} again). 2855 2856@end table 2857 2858@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options 2859@section Code-Level And API Options 2860 2861@table @samp 2862 2863@anchor{option-ansi-definitions} 2864@opindex ---option-ansi-definitions 2865@opindex ansi-definitions 2866@item --ansi-definitions, @code{%option ansi-definitions} 2867instruct flex to generate ANSI C99 definitions for functions. 2868This option is enabled by default. 2869If @code{%option noansi-definitions} is specified, then the obsolete style 2870is generated. 2871 2872@anchor{option-ansi-prototypes} 2873@opindex ---option-ansi-prototypes 2874@opindex ansi-prototypes 2875@item --ansi-prototypes, @code{%option ansi-prototypes} 2876instructs flex to generate ANSI C99 prototypes for functions. 2877This option is enabled by default. 2878If @code{noansi-prototypes} is specified, then 2879prototypes will have empty parameter lists. 2880 2881@anchor{option-bison-bridge} 2882@opindex ---bison-bridge 2883@opindex bison-bridge 2884@item --bison-bridge, @code{%option bison-bridge} 2885instructs flex to generate a C scanner that is 2886meant to be called by a 2887@code{GNU bison} 2888parser. The scanner has minor API changes for 2889@code{bison} 2890compatibility. In particular, the declaration of 2891@code{yylex} 2892is modified to take an additional parameter, 2893@code{yylval}. 2894@xref{Bison Bridge}. 2895 2896@anchor{option-bison-locations} 2897@opindex ---bison-locations 2898@opindex bison-locations 2899@item --bison-locations, @code{%option bison-locations} 2900instruct flex that 2901@code{GNU bison} @code{%locations} are being used. 2902This means @code{yylex} will be passed 2903an additional parameter, @code{yylloc}. This option 2904implies @code{%option bison-bridge}. 2905@xref{Bison Bridge}. 2906 2907@anchor{option-noline} 2908@opindex -L 2909@opindex ---noline 2910@opindex noline 2911@item -L, --noline, @code{%option noline} 2912instructs 2913@code{flex} 2914not to generate 2915@code{#line} 2916directives. Without this option, 2917@code{flex} 2918peppers the generated scanner 2919with @code{#line} directives so error messages in the actions will be correctly 2920located with respect to either the original 2921@code{flex} 2922input file (if the errors are due to code in the input file), or 2923@file{lex.yy.c} 2924(if the errors are 2925@code{flex}'s 2926fault -- you should report these sorts of errors to the email address 2927given in @ref{Reporting Bugs}). 2928 2929 2930 2931@anchor{option-reentrant} 2932@opindex -R 2933@opindex ---reentrant 2934@opindex reentrant 2935@item -R, --reentrant, @code{%option reentrant} 2936instructs flex to generate a reentrant C scanner. The generated scanner 2937may safely be used in a multi-threaded environment. The API for a 2938reentrant scanner is different than for a non-reentrant scanner 2939@pxref{Reentrant}). Because of the API difference between 2940reentrant and non-reentrant @code{flex} scanners, non-reentrant flex 2941code must be modified before it is suitable for use with this option. 2942This option is not compatible with the @samp{--c++} option. 2943 2944The option @samp{--reentrant} does not affect the performance of 2945the scanner. 2946 2947 2948 2949@anchor{option-c++} 2950@opindex -+ 2951@opindex ---c++ 2952@opindex c++ 2953@item -+, --c++, @code{%option c++} 2954specifies that you want flex to generate a C++ 2955scanner class. @xref{Cxx}, for 2956details. 2957 2958 2959 2960@anchor{option-array} 2961@opindex ---array 2962@opindex array 2963@item --array, @code{%option array} 2964specifies that you want yytext to be an array instead of a char* 2965 2966 2967 2968@anchor{option-pointer} 2969@opindex ---pointer 2970@opindex pointer 2971@item --pointer, @code{%option pointer} 2972specify that @code{yytext} should be a @code{char *}, not an array. 2973This default is @code{char *}. 2974 2975 2976 2977@anchor{option-prefix} 2978@opindex -P 2979@opindex ---prefix 2980@opindex prefix 2981@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"} 2982changes the default @samp{yy} prefix used by @code{flex} for all 2983globally-visible variable and function names to instead be 2984@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of 2985@code{yytext} to @code{footext}. It also changes the name of the default 2986output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial 2987list of the names affected: 2988 2989@example 2990@verbatim 2991 yy_create_buffer 2992 yy_delete_buffer 2993 yy_flex_debug 2994 yy_init_buffer 2995 yy_flush_buffer 2996 yy_load_buffer_state 2997 yy_switch_to_buffer 2998 yyin 2999 yyleng 3000 yylex 3001 yylineno 3002 yyout 3003 yyrestart 3004 yytext 3005 yywrap 3006 yyalloc 3007 yyrealloc 3008 yyfree 3009@end verbatim 3010@end example 3011 3012(If you are using a C++ scanner, then only @code{yywrap} and 3013@code{yyFlexLexer} are affected.) Within your scanner itself, you can 3014still refer to the global variables and functions using either version 3015of their name; but externally, they have the modified name. 3016 3017This option lets you easily link together multiple 3018@code{flex} 3019programs into the same executable. Note, though, that using this 3020option also renames 3021@code{yywrap()}, 3022so you now 3023@emph{must} 3024either 3025provide your own (appropriately-named) version of the routine for your 3026scanner, or use 3027@code{%option noyywrap}, 3028as linking with 3029@samp{-lfl} 3030no longer provides one for you by default. 3031 3032 3033 3034@anchor{option-main} 3035@opindex ---main 3036@opindex main 3037@item --main, @code{%option main} 3038 directs flex to provide a default @code{main()} program for the 3039scanner, which simply calls @code{yylex()}. This option implies 3040@code{noyywrap} (see below). 3041 3042 3043 3044@anchor{option-nounistd} 3045@opindex ---nounistd 3046@opindex nounistd 3047@item --nounistd, @code{%option nounistd} 3048suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option 3049is meant to target environments in which @file{unistd.h} does not exist. Be aware 3050that certain options may cause flex to generate code that relies on functions 3051normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.) 3052If you wish to use these functions, you will have to inform your compiler where 3053to find them. 3054@xref{option-always-interactive}. @xref{option-read}. 3055 3056 3057 3058@anchor{option-yyclass} 3059@opindex ---yyclass 3060@opindex yyclass 3061@item --yyclass=NAME, @code{%option yyclass="NAME"} 3062only applies when generating a C++ scanner (the @samp{--c++} option). It 3063informs @code{flex} that you have derived @code{NAME} as a subclass of 3064@code{yyFlexLexer}, so @code{flex} will place your actions in the member 3065function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It 3066also generates a @code{yyFlexLexer::yylex()} member function that emits 3067a run-time error (by invoking @code{yyFlexLexer::LexerError())} if 3068called. @xref{Cxx}. 3069 3070@end table 3071 3072@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options 3073@section Options for Scanner Speed and Size 3074 3075@table @samp 3076 3077@item -C[aefFmr] 3078controls the degree of table compression and, more generally, trade-offs 3079between small scanners and fast scanners. 3080 3081@table @samp 3082@opindex -C 3083@item -C 3084A lone @samp{-C} specifies that the scanner tables should be compressed 3085but neither equivalence classes nor meta-equivalence classes should be 3086used. 3087 3088@anchor{option-align} 3089@opindex -Ca 3090@opindex ---align 3091@opindex align 3092@item -Ca, --align, @code{%option align} 3093(``align'') instructs flex to trade off larger tables in the 3094generated scanner for faster performance because the elements of 3095the tables are better aligned for memory access and computation. On some 3096RISC architectures, fetching and manipulating longwords is more efficient 3097than with smaller-sized units such as shortwords. This option can 3098quadruple the size of the tables used by your scanner. 3099 3100@anchor{option-ecs} 3101@opindex -Ce 3102@opindex ---ecs 3103@opindex ecs 3104@item -Ce, --ecs, @code{%option ecs} 3105directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets 3106of characters which have identical lexical properties (for example, if 3107the only appearance of digits in the @code{flex} input is in the 3108character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be 3109put in the same equivalence class). Equivalence classes usually give 3110dramatic reductions in the final table/object file sizes (typically a 3111factor of 2-5) and are pretty cheap performance-wise (one array look-up 3112per character scanned). 3113 3114@opindex -Cf 3115@item -Cf 3116specifies that the @dfn{full} scanner tables should be generated - 3117@code{flex} should not compress the tables by taking advantages of 3118similar transition functions for different states. 3119 3120@opindex -CF 3121@item -CF 3122specifies that the alternate fast scanner representation (described 3123above under the @samp{--fast} flag) should be used. This option cannot be 3124used with @samp{--c++}. 3125 3126@anchor{option-meta-ecs} 3127@opindex -Cm 3128@opindex ---meta-ecs 3129@opindex meta-ecs 3130@item -Cm, --meta-ecs, @code{%option meta-ecs} 3131directs 3132@code{flex} 3133to construct 3134@dfn{meta-equivalence classes}, 3135which are sets of equivalence classes (or characters, if equivalence 3136classes are not being used) that are commonly used together. Meta-equivalence 3137classes are often a big win when using compressed tables, but they 3138have a moderate performance impact (one or two @code{if} tests and one 3139array look-up per character scanned). 3140 3141@anchor{option-read} 3142@opindex -Cr 3143@opindex ---read 3144@opindex read 3145@item -Cr, --read, @code{%option read} 3146causes the generated scanner to @emph{bypass} use of the standard I/O 3147library (@code{stdio}) for input. Instead of calling @code{fread()} or 3148@code{getc()}, the scanner will use the @code{read()} system call, 3149resulting in a performance gain which varies from system to system, but 3150in general is probably negligible unless you are also using @samp{-Cf} 3151or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for 3152example, you read from @file{yyin} using @code{stdio} prior to calling 3153the scanner (because the scanner will miss whatever text your previous 3154reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect 3155if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). 3156@end table 3157 3158The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense 3159together - there is no opportunity for meta-equivalence classes if the 3160table is not being compressed. Otherwise the options may be freely 3161mixed, and are cumulative. 3162 3163The default setting is @samp{-Cem}, which specifies that @code{flex} 3164should generate equivalence classes and meta-equivalence classes. This 3165setting provides the highest degree of table compression. You can trade 3166off faster-executing scanners at the cost of larger tables with the 3167following generally being true: 3168 3169@example 3170@verbatim 3171 slowest & smallest 3172 -Cem 3173 -Cm 3174 -Ce 3175 -C 3176 -C{f,F}e 3177 -C{f,F} 3178 -C{f,F}a 3179 fastest & largest 3180@end verbatim 3181@end example 3182 3183Note that scanners with the smallest tables are usually generated and 3184compiled the quickest, so during development you will usually want to 3185use the default, maximal compression. 3186 3187@samp{-Cfe} is often a good compromise between speed and size for 3188production scanners. 3189 3190@anchor{option-full} 3191@opindex -f 3192@opindex ---full 3193@opindex full 3194@item -f, --full, @code{%option full} 3195specifies 3196@dfn{fast scanner}. 3197No table compression is done and @code{stdio} is bypassed. 3198The result is large but fast. This option is equivalent to 3199@samp{--Cfr} 3200 3201 3202@anchor{option-fast} 3203@opindex -F 3204@opindex ---fast 3205@opindex fast 3206@item -F, --fast, @code{%option fast} 3207specifies that the @emph{fast} scanner table representation should be 3208used (and @code{stdio} bypassed). This representation is about as fast 3209as the full table representation @samp{--full}, and for some sets of 3210patterns will be considerably smaller (and for others, larger). In 3211general, if the pattern set contains both @emph{keywords} and a 3212catch-all, @emph{identifier} rule, such as in the set: 3213 3214@example 3215@verbatim 3216 "case" return TOK_CASE; 3217 "switch" return TOK_SWITCH; 3218 ... 3219 "default" return TOK_DEFAULT; 3220 [a-z]+ return TOK_ID; 3221@end verbatim 3222@end example 3223 3224then you're better off using the full table representation. If only 3225the @emph{identifier} rule is present and you then use a hash table or some such 3226to detect the keywords, you're better off using 3227@samp{--fast}. 3228 3229This option is equivalent to @samp{-CFr}. It cannot be used 3230with @samp{--c++}. 3231 3232@end table 3233 3234@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options 3235@section Debugging Options 3236 3237@table @samp 3238 3239@anchor{option-backup} 3240@opindex -b 3241@opindex ---backup 3242@opindex backup 3243@item -b, --backup, @code{%option backup} 3244Generate backing-up information to @file{lex.backup}. This is a list of 3245scanner states which require backing up and the input characters on 3246which they do so. By adding rules one can remove backing-up states. If 3247@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} 3248is used, the generated scanner will run faster (see the @samp{--perf-report} flag). 3249Only users who wish to squeeze every last cycle out of their scanners 3250need worry about this option. (@pxref{Performance}). 3251 3252 3253 3254@anchor{option-debug} 3255@opindex -d 3256@opindex ---debug 3257@opindex debug 3258@item -d, --debug, @code{%option debug} 3259makes the generated scanner run in @dfn{debug} mode. Whenever a pattern 3260is recognized and the global variable @code{yy_flex_debug} is non-zero 3261(which is the default), the scanner will write to @file{stderr} a line 3262of the form: 3263 3264@example 3265@verbatim 3266 -accepting rule at line 53 ("the matched text") 3267@end verbatim 3268@end example 3269 3270The line number refers to the location of the rule in the file defining 3271the scanner (i.e., the file that was fed to flex). Messages are also 3272generated when the scanner backs up, accepts the default rule, reaches 3273the end of its input buffer (or encounters a NUL; at this point, the two 3274look the same as far as the scanner's concerned), or reaches an 3275end-of-file. 3276 3277 3278 3279@anchor{option-perf-report} 3280@opindex -p 3281@opindex ---perf-report 3282@opindex perf-report 3283@item -p, --perf-report, @code{%option perf-report} 3284generates a performance report to @file{stderr}. The report consists of 3285comments regarding features of the @code{flex} input file which will 3286cause a serious loss of performance in the resulting scanner. If you 3287give the flag twice, you will also get comments regarding features that 3288lead to minor performance losses. 3289 3290Note that the use of @code{REJECT}, and 3291variable trailing context (@pxref{Limitations}) entails a substantial 3292performance penalty; use of @code{yymore()}, the @samp{^} operator, and 3293the @samp{--interactive} flag entail minor performance penalties. 3294 3295 3296 3297@anchor{option-nodefault} 3298@opindex -s 3299@opindex ---nodefault 3300@opindex nodefault 3301@item -s, --nodefault, @code{%option nodefault} 3302causes the @emph{default rule} (that unmatched scanner input is echoed 3303to @file{stdout)} to be suppressed. If the scanner encounters input 3304that does not match any of its rules, it aborts with an error. This 3305option is useful for finding holes in a scanner's rule set. 3306 3307 3308 3309@anchor{option-trace} 3310@opindex -T 3311@opindex ---trace 3312@opindex trace 3313@item -T, --trace, @code{%option trace} 3314makes @code{flex} run in @dfn{trace} mode. It will generate a lot of 3315messages to @file{stderr} concerning the form of the input and the 3316resultant non-deterministic and deterministic finite automata. This 3317option is mostly for use in maintaining @code{flex}. 3318 3319 3320 3321@anchor{option-nowarn} 3322@opindex -w 3323@opindex ---nowarn 3324@opindex nowarn 3325@item -w, --nowarn, @code{%option nowarn} 3326suppresses warning messages. 3327 3328 3329 3330@anchor{option-verbose} 3331@opindex -v 3332@opindex ---verbose 3333@opindex verbose 3334@item -v, --verbose, @code{%option verbose} 3335specifies that @code{flex} should write to @file{stderr} a summary of 3336statistics regarding the scanner it generates. Most of the statistics 3337are meaningless to the casual @code{flex} user, but the first line 3338identifies the version of @code{flex} (same as reported by @samp{--version}), 3339and the next line the flags used when generating the scanner, including 3340those that are on by default. 3341 3342 3343 3344@anchor{option-warn} 3345@opindex ---warn 3346@opindex warn 3347@item --warn, @code{%option warn} 3348warn about certain things. In particular, if the default rule can be 3349matched but no default rule has been given, the flex will warn you. 3350We recommend using this option always. 3351 3352@end table 3353 3354@node Miscellaneous Options, , Debugging Options, Scanner Options 3355@section Miscellaneous Options 3356 3357@table @samp 3358@opindex -c 3359@item -c 3360A do-nothing option included for POSIX compliance. 3361 3362@opindex -h 3363@opindex ---help 3364@item -h, -?, --help 3365generates a ``help'' summary of @code{flex}'s options to @file{stdout} 3366and then exits. 3367 3368@opindex -n 3369@item -n 3370Another do-nothing option included for 3371POSIX compliance. 3372 3373@opindex -V 3374@opindex ---version 3375@item -V, --version 3376prints the version number to @file{stdout} and exits. 3377 3378@end table 3379 3380 3381@node Performance, Cxx, Scanner Options, Top 3382@chapter Performance Considerations 3383 3384@cindex performance, considerations 3385The main design goal of @code{flex} is that it generate high-performance 3386scanners. It has been optimized for dealing well with large sets of 3387rules. Aside from the effects on scanner speed of the table compression 3388@samp{-C} options outlined above, there are a number of options/actions 3389which degrade performance. These are, from most expensive to least: 3390 3391@cindex REJECT, performance costs 3392@cindex yylineno, performance costs 3393@cindex trailing context, performance costs 3394@example 3395@verbatim 3396 REJECT 3397 arbitrary trailing context 3398 3399 pattern sets that require backing up 3400 %option yylineno 3401 %array 3402 3403 %option interactive 3404 %option always-interactive 3405 3406 @samp{^} beginning-of-line operator 3407 yymore() 3408@end verbatim 3409@end example 3410 3411with the first two all being quite expensive and the last two being 3412quite cheap. Note also that @code{unput()} is implemented as a routine 3413call that potentially does quite a bit of work, while @code{yyless()} is 3414a quite-cheap macro. So if you are just putting back some excess text 3415you scanned, use @code{yyless()}. 3416 3417@code{REJECT} should be avoided at all costs when performance is 3418important. It is a particularly expensive option. 3419 3420There is one case when @code{%option yylineno} can be expensive. That is when 3421your patterns match long tokens that could @emph{possibly} contain a newline 3422character. There is no performance penalty for rules that can not possibly 3423match newlines, since flex does not need to check them for newlines. In 3424general, you should avoid rules such as @code{[^f]+}, which match very long 3425tokens, including newlines, and may possibly match your entire file! A better 3426approach is to separate @code{[^f]+} into two rules: 3427 3428@example 3429@verbatim 3430%option yylineno 3431%% 3432 [^f\n]+ 3433 \n+ 3434@end verbatim 3435@end example 3436 3437The above scanner does not incur a performance penalty. 3438 3439@cindex patterns, tuning for performance 3440@cindex performance, backing up 3441@cindex backing up, example of eliminating 3442Getting rid of backing up is messy and often may be an enormous amount 3443of work for a complicated scanner. In principal, one begins by using 3444the @samp{-b} flag to generate a @file{lex.backup} file. For example, 3445on the input: 3446 3447@cindex backing up, eliminating 3448@example 3449@verbatim 3450 %% 3451 foo return TOK_KEYWORD; 3452 foobar return TOK_KEYWORD; 3453@end verbatim 3454@end example 3455 3456the file looks like: 3457 3458@example 3459@verbatim 3460 State #6 is non-accepting - 3461 associated rule line numbers: 3462 2 3 3463 out-transitions: [ o ] 3464 jam-transitions: EOF [ \001-n p-\177 ] 3465 3466 State #8 is non-accepting - 3467 associated rule line numbers: 3468 3 3469 out-transitions: [ a ] 3470 jam-transitions: EOF [ \001-` b-\177 ] 3471 3472 State #9 is non-accepting - 3473 associated rule line numbers: 3474 3 3475 out-transitions: [ r ] 3476 jam-transitions: EOF [ \001-q s-\177 ] 3477 3478 Compressed tables always back up. 3479@end verbatim 3480@end example 3481 3482The first few lines tell us that there's a scanner state in which it can 3483make a transition on an 'o' but not on any other character, and that in 3484that state the currently scanned text does not match any rule. The 3485state occurs when trying to match the rules found at lines 2 and 3 in 3486the input file. If the scanner is in that state and then reads 3487something other than an 'o', it will have to back up to find a rule 3488which is matched. With a bit of headscratching one can see that this 3489must be the state it's in when it has seen @samp{fo}. When this has 3490happened, if anything other than another @samp{o} is seen, the scanner 3491will have to back up to simply match the @samp{f} (by the default rule). 3492 3493The comment regarding State #8 indicates there's a problem when 3494@samp{foob} has been scanned. Indeed, on any character other than an 3495@samp{a}, the scanner will have to back up to accept "foo". Similarly, 3496the comment for State #9 concerns when @samp{fooba} has been scanned and 3497an @samp{r} does not follow. 3498 3499The final comment reminds us that there's no point going to all the 3500trouble of removing backing up from the rules unless we're using 3501@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so 3502with compressed scanners. 3503 3504@cindex error rules, to eliminate backing up 3505The way to remove the backing up is to add ``error'' rules: 3506 3507@cindex backing up, eliminating by adding error rules 3508@example 3509@verbatim 3510 %% 3511 foo return TOK_KEYWORD; 3512 foobar return TOK_KEYWORD; 3513 3514 fooba | 3515 foob | 3516 fo { 3517 /* false alarm, not really a keyword */ 3518 return TOK_ID; 3519 } 3520@end verbatim 3521@end example 3522 3523Eliminating backing up among a list of keywords can also be done using a 3524``catch-all'' rule: 3525 3526@cindex backing up, eliminating with catch-all rule 3527@example 3528@verbatim 3529 %% 3530 foo return TOK_KEYWORD; 3531 foobar return TOK_KEYWORD; 3532 3533 [a-z]+ return TOK_ID; 3534@end verbatim 3535@end example 3536 3537This is usually the best solution when appropriate. 3538 3539Backing up messages tend to cascade. With a complicated set of rules 3540it's not uncommon to get hundreds of messages. If one can decipher 3541them, though, it often only takes a dozen or so rules to eliminate the 3542backing up (though it's easy to make a mistake and have an error rule 3543accidentally match a valid token. A possible future @code{flex} feature 3544will be to automatically add rules to eliminate backing up). 3545 3546It's important to keep in mind that you gain the benefits of eliminating 3547backing up only if you eliminate @emph{every} instance of backing up. 3548Leaving just one means you gain nothing. 3549 3550@emph{Variable} trailing context (where both the leading and trailing 3551parts do not have a fixed length) entails almost the same performance 3552loss as @code{REJECT} (i.e., substantial). So when possible a rule 3553like: 3554 3555@cindex trailing context, variable length 3556@example 3557@verbatim 3558 %% 3559 mouse|rat/(cat|dog) run(); 3560@end verbatim 3561@end example 3562 3563is better written: 3564 3565@example 3566@verbatim 3567 %% 3568 mouse/cat|dog run(); 3569 rat/cat|dog run(); 3570@end verbatim 3571@end example 3572 3573or as 3574 3575@example 3576@verbatim 3577 %% 3578 mouse|rat/cat run(); 3579 mouse|rat/dog run(); 3580@end verbatim 3581@end example 3582 3583Note that here the special '|' action does @emph{not} provide any 3584savings, and can even make things worse (@pxref{Limitations}). 3585 3586Another area where the user can increase a scanner's performance (and 3587one that's easier to implement) arises from the fact that the longer the 3588tokens matched, the faster the scanner will run. This is because with 3589long tokens the processing of most input characters takes place in the 3590(short) inner scanning loop, and does not often have to go through the 3591additional work of setting up the scanning environment (e.g., 3592@code{yytext}) for the action. Recall the scanner for C comments: 3593 3594@cindex performance optimization, matching longer tokens 3595@example 3596@verbatim 3597 %x comment 3598 %% 3599 int line_num = 1; 3600 3601 "/*" BEGIN(comment); 3602 3603 <comment>[^*\n]* 3604 <comment>"*"+[^*/\n]* 3605 <comment>\n ++line_num; 3606 <comment>"*"+"/" BEGIN(INITIAL); 3607@end verbatim 3608@end example 3609 3610This could be sped up by writing it as: 3611 3612@example 3613@verbatim 3614 %x comment 3615 %% 3616 int line_num = 1; 3617 3618 "/*" BEGIN(comment); 3619 3620 <comment>[^*\n]* 3621 <comment>[^*\n]*\n ++line_num; 3622 <comment>"*"+[^*/\n]* 3623 <comment>"*"+[^*/\n]*\n ++line_num; 3624 <comment>"*"+"/" BEGIN(INITIAL); 3625@end verbatim 3626@end example 3627 3628Now instead of each newline requiring the processing of another action, 3629recognizing the newlines is distributed over the other rules to keep the 3630matched text as long as possible. Note that @emph{adding} rules does 3631@emph{not} slow down the scanner! The speed of the scanner is 3632independent of the number of rules or (modulo the considerations given 3633at the beginning of this section) how complicated the rules are with 3634regard to operators such as @samp{*} and @samp{|}. 3635 3636@cindex keywords, for performance 3637@cindex performance, using keywords 3638A final example in speeding up a scanner: suppose you want to scan 3639through a file containing identifiers and keywords, one per line 3640and with no other extraneous characters, and recognize all the 3641keywords. A natural first approach is: 3642 3643@cindex performance optimization, recognizing keywords 3644@example 3645@verbatim 3646 %% 3647 asm | 3648 auto | 3649 break | 3650 ... etc ... 3651 volatile | 3652 while /* it's a keyword */ 3653 3654 .|\n /* it's not a keyword */ 3655@end verbatim 3656@end example 3657 3658To eliminate the back-tracking, introduce a catch-all rule: 3659 3660@example 3661@verbatim 3662 %% 3663 asm | 3664 auto | 3665 break | 3666 ... etc ... 3667 volatile | 3668 while /* it's a keyword */ 3669 3670 [a-z]+ | 3671 .|\n /* it's not a keyword */ 3672@end verbatim 3673@end example 3674 3675Now, if it's guaranteed that there's exactly one word per line, then we 3676can reduce the total number of matches by a half by merging in the 3677recognition of newlines with that of the other tokens: 3678 3679@example 3680@verbatim 3681 %% 3682 asm\n | 3683 auto\n | 3684 break\n | 3685 ... etc ... 3686 volatile\n | 3687 while\n /* it's a keyword */ 3688 3689 [a-z]+\n | 3690 .|\n /* it's not a keyword */ 3691@end verbatim 3692@end example 3693 3694One has to be careful here, as we have now reintroduced backing up 3695into the scanner. In particular, while 3696@emph{we} 3697know that there will never be any characters in the input stream 3698other than letters or newlines, 3699@code{flex} 3700can't figure this out, and it will plan for possibly needing to back up 3701when it has scanned a token like @samp{auto} and then the next character 3702is something other than a newline or a letter. Previously it would 3703then just match the @samp{auto} rule and be done, but now it has no @samp{auto} 3704rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, 3705we could either duplicate all rules but without final newlines, or, 3706since we never expect to encounter such an input and therefore don't 3707how it's classified, we can introduce one more catch-all rule, this 3708one which doesn't include a newline: 3709 3710@example 3711@verbatim 3712 %% 3713 asm\n | 3714 auto\n | 3715 break\n | 3716 ... etc ... 3717 volatile\n | 3718 while\n /* it's a keyword */ 3719 3720 [a-z]+\n | 3721 [a-z]+ | 3722 .|\n /* it's not a keyword */ 3723@end verbatim 3724@end example 3725 3726Compiled with @samp{-Cf}, this is about as fast as one can get a 3727@code{flex} scanner to go for this particular problem. 3728 3729A final note: @code{flex} is slow when matching @code{NUL}s, 3730particularly when a token contains multiple @code{NUL}s. It's best to 3731write rules which match @emph{short} amounts of text if it's anticipated 3732that the text will often include @code{NUL}s. 3733 3734Another final note regarding performance: as mentioned in 3735@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge 3736tokens is a slow process because it presently requires that the (huge) 3737token be rescanned from the beginning. Thus if performance is vital, 3738you should attempt to match ``large'' quantities of text but not 3739``huge'' quantities, where the cutoff between the two is at about 8K 3740characters per token. 3741 3742@node Cxx, Reentrant, Performance, Top 3743@chapter Generating C++ Scanners 3744 3745@cindex c++, experimental form of scanner class 3746@cindex experimental form of c++ scanner class 3747@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} 3748and may change considerably between major releases. 3749 3750@cindex C++ 3751@cindex member functions, C++ 3752@cindex methods, c++ 3753@code{flex} provides two different ways to generate scanners for use 3754with C++. The first way is to simply compile a scanner generated by 3755@code{flex} using a C++ compiler instead of a C compiler. You should 3756not encounter any compilation errors (@pxref{Reporting Bugs}). You can 3757then use C++ code in your rule actions instead of C code. Note that the 3758default input source for your scanner remains @file{yyin}, and default 3759echoing is still done to @file{yyout}. Both of these remain @code{FILE 3760*} variables and not C++ @emph{streams}. 3761 3762You can also use @code{flex} to generate a C++ scanner class, using the 3763@samp{-+} option (or, equivalently, @code{%option c++)}, which is 3764automatically specified if the name of the @code{flex} executable ends 3765in a '+', such as @code{flex++}. When using this option, @code{flex} 3766defaults to generating the scanner to the file @file{lex.yy.cc} instead 3767of @file{lex.yy.c}. The generated scanner includes the header file 3768@file{FlexLexer.h}, which defines the interface to two C++ classes. 3769 3770The first class, 3771@code{FlexLexer}, 3772provides an abstract base class defining the general scanner class 3773interface. It provides the following member functions: 3774 3775@table @code 3776@findex YYText (C++ only) 3777@item const char* YYText() 3778returns the text of the most recently matched token, the equivalent of 3779@code{yytext}. 3780 3781@findex YYLeng (C++ only) 3782@item int YYLeng() 3783returns the length of the most recently matched token, the equivalent of 3784@code{yyleng}. 3785 3786@findex lineno (C++ only) 3787@item int lineno() const 3788returns the current input line number (see @code{%option yylineno)}, or 3789@code{1} if @code{%option yylineno} was not used. 3790 3791@findex set_debug (C++ only) 3792@item void set_debug( int flag ) 3793sets the debugging flag for the scanner, equivalent to assigning to 3794@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build 3795the scanner using @code{%option debug} to include debugging information 3796in it. 3797 3798@findex debug (C++ only) 3799@item int debug() const 3800returns the current setting of the debugging flag. 3801@end table 3802 3803Also provided are member functions equivalent to 3804@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the 3805first argument is an @code{istream*} object pointer and not a 3806@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and 3807@code{yyrestart()} (again, the first argument is a @code{istream*} 3808object pointer). 3809 3810@tindex yyFlexLexer (C++ only) 3811@tindex FlexLexer (C++ only) 3812The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, 3813which is derived from @code{FlexLexer}. It defines the following 3814additional member functions: 3815 3816@table @code 3817@findex yyFlexLexer constructor (C++ only) 3818@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) 3819constructs a @code{yyFlexLexer} object using the given streams for input 3820and output. If not specified, the streams default to @code{cin} and 3821@code{cout}, respectively. 3822 3823@findex yylex (C++ version) 3824@item virtual int yylex() 3825performs the same role is @code{yylex()} does for ordinary @code{flex} 3826scanners: it scans the input stream, consuming tokens, until a rule's 3827action returns a value. If you derive a subclass @code{S} from 3828@code{yyFlexLexer} and want to access the member functions and variables 3829of @code{S} inside @code{yylex()}, then you need to use @code{%option 3830yyclass="S"} to inform @code{flex} that you will be using that subclass 3831instead of @code{yyFlexLexer}. In this case, rather than generating 3832@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()} 3833(and also generates a dummy @code{yyFlexLexer::yylex()} that calls 3834@code{yyFlexLexer::LexerError()} if called). 3835 3836@findex switch_streams (C++ only) 3837@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) 3838reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to 3839@code{new_out} (if non-null), deleting the previous input buffer if 3840@code{yyin} is reassigned. 3841 3842@item int yylex( istream* new_in, ostream* new_out = 0 ) 3843first switches the input streams via @code{switch_streams( new_in, 3844new_out )} and then returns the value of @code{yylex()}. 3845@end table 3846 3847In addition, @code{yyFlexLexer} defines the following protected virtual 3848functions which you can redefine in derived classes to tailor the 3849scanner: 3850 3851@table @code 3852@findex LexerInput (C++ only) 3853@item virtual int LexerInput( char* buf, int max_size ) 3854reads up to @code{max_size} characters into @code{buf} and returns the 3855number of characters read. To indicate end-of-input, return 0 3856characters. Note that @code{interactive} scanners (see the @samp{-B} 3857and @samp{-I} flags in @ref{Scanner Options}) define the macro 3858@code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to 3859take different actions depending on whether or not the scanner might be 3860scanning an interactive input source, you can test for the presence of 3861this name via @code{#ifdef} statements. 3862 3863@findex LexerOutput (C++ only) 3864@item virtual void LexerOutput( const char* buf, int size ) 3865writes out @code{size} characters from the buffer @code{buf}, which, while 3866@code{NUL}-terminated, may also contain internal @code{NUL}s if the 3867scanner's rules can match text with @code{NUL}s in them. 3868 3869@cindex error reporting, in C++ 3870@findex LexerError (C++ only) 3871@item virtual void LexerError( const char* msg ) 3872reports a fatal error message. The default version of this function 3873writes the message to the stream @code{cerr} and exits. 3874@end table 3875 3876Note that a @code{yyFlexLexer} object contains its @emph{entire} 3877scanning state. Thus you can use such objects to create reentrant 3878scanners, but see also @ref{Reentrant}. You can instantiate multiple 3879instances of the same @code{yyFlexLexer} class, and you can also combine 3880multiple C++ scanner classes together in the same program using the 3881@samp{-P} option discussed above. 3882 3883Finally, note that the @code{%array} feature is not available to C++ 3884scanner classes; you must use @code{%pointer} (the default). 3885 3886Here is an example of a simple C++ scanner: 3887 3888@cindex C++ scanners, use of 3889@example 3890@verbatim 3891 // An example of using the flex C++ scanner class. 3892 3893 %{ 3894 int mylineno = 0; 3895 %} 3896 3897 string \"[^\n"]+\" 3898 3899 ws [ \t]+ 3900 3901 alpha [A-Za-z] 3902 dig [0-9] 3903 name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* 3904 num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? 3905 num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? 3906 number {num1}|{num2} 3907 3908 %% 3909 3910 {ws} /* skip blanks and tabs */ 3911 3912 "/*" { 3913 int c; 3914 3915 while((c = yyinput()) != 0) 3916 { 3917 if(c == '\n') 3918 ++mylineno; 3919 3920 else if(c == @samp{*}) 3921 { 3922 if((c = yyinput()) == '/') 3923 break; 3924 else 3925 unput(c); 3926 } 3927 } 3928 } 3929 3930 {number} cout "number " YYText() '\n'; 3931 3932 \n mylineno++; 3933 3934 {name} cout "name " YYText() '\n'; 3935 3936 {string} cout "string " YYText() '\n'; 3937 3938 %% 3939 3940 int main( int /* argc */, char** /* argv */ ) 3941 { 3942 @code{flex}Lexer* lexer = new yyFlexLexer; 3943 while(lexer->yylex() != 0) 3944 ; 3945 return 0; 3946 } 3947@end verbatim 3948@end example 3949 3950@cindex C++, multiple different scanners 3951If you want to create multiple (different) lexer classes, you use the 3952@samp{-P} flag (or the @code{prefix=} option) to rename each 3953@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can 3954include @file{<FlexLexer.h>} in your other sources once per lexer class, 3955first renaming @code{yyFlexLexer} as follows: 3956 3957@cindex include files, with C++ 3958@cindex header files, with C++ 3959@cindex C++ scanners, including multiple scanners 3960@example 3961@verbatim 3962 #undef yyFlexLexer 3963 #define yyFlexLexer xxFlexLexer 3964 #include <FlexLexer.h> 3965 3966 #undef yyFlexLexer 3967 #define yyFlexLexer zzFlexLexer 3968 #include <FlexLexer.h> 3969@end verbatim 3970@end example 3971 3972if, for example, you used @code{%option prefix="xx"} for one of your 3973scanners and @code{%option prefix="zz"} for the other. 3974 3975@node Reentrant, Lex and Posix, Cxx, Top 3976@chapter Reentrant C Scanners 3977 3978@cindex reentrant, explanation 3979@code{flex} has the ability to generate a reentrant C scanner. This is 3980accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated 3981scanner is both portable, and safe to use in one or more separate threads of 3982control. The most common use for reentrant scanners is from within 3983multi-threaded applications. Any thread may create and execute a reentrant 3984@code{flex} scanner without the need for synchronization with other threads. 3985 3986@menu 3987* Reentrant Uses:: 3988* Reentrant Overview:: 3989* Reentrant Example:: 3990* Reentrant Detail:: 3991* Reentrant Functions:: 3992@end menu 3993 3994@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant 3995@section Uses for Reentrant Scanners 3996 3997However, there are other uses for a reentrant scanner. For example, you 3998could scan two or more files simultaneously to implement a @code{diff} at 3999the token level (i.e., instead of at the character level): 4000 4001@cindex reentrant scanners, multiple interleaved scanners 4002@example 4003@verbatim 4004 /* Example of maintaining more than one active scanner. */ 4005 4006 do { 4007 int tok1, tok2; 4008 4009 tok1 = yylex( scanner_1 ); 4010 tok2 = yylex( scanner_2 ); 4011 4012 if( tok1 != tok2 ) 4013 printf("Files are different."); 4014 4015 } while ( tok1 && tok2 ); 4016@end verbatim 4017@end example 4018 4019Another use for a reentrant scanner is recursion. 4020(Note that a recursive scanner can also be created using a non-reentrant scanner and 4021buffer states. @xref{Multiple Input Buffers}.) 4022 4023The following crude scanner supports the @samp{eval} command by invoking 4024another instance of itself. 4025 4026@cindex reentrant scanners, recursive invocation 4027@example 4028@verbatim 4029 /* Example of recursive invocation. */ 4030 4031 %option reentrant 4032 4033 %% 4034 "eval(".+")" { 4035 yyscan_t scanner; 4036 YY_BUFFER_STATE buf; 4037 4038 yylex_init( &scanner ); 4039 yytext[yyleng-1] = ' '; 4040 4041 buf = yy_scan_string( yytext + 5, scanner ); 4042 yylex( scanner ); 4043 4044 yy_delete_buffer(buf,scanner); 4045 yylex_destroy( scanner ); 4046 } 4047 ... 4048 %% 4049@end verbatim 4050@end example 4051 4052@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant 4053@section An Overview of the Reentrant API 4054 4055@cindex reentrant, API explanation 4056The API for reentrant scanners is different than for non-reentrant 4057scanners. Here is a quick overview of the API: 4058 4059@itemize 4060@code{%option reentrant} must be specified. 4061 4062@item 4063All functions take one additional argument: @code{yyscanner} 4064 4065@item 4066All global variables are replaced by their macro equivalents. 4067(We tell you this because it may be important to you during debugging.) 4068 4069@item 4070@code{yylex_init} and @code{yylex_destroy} must be called before and 4071after @code{yylex}, respectively. 4072 4073@item 4074Accessor methods (get/set functions) provide access to common 4075@code{flex} variables. 4076 4077@item 4078User-specific data can be stored in @code{yyextra}. 4079@end itemize 4080 4081@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant 4082@section Reentrant Example 4083 4084First, an example of a reentrant scanner: 4085@cindex reentrant, example of 4086@example 4087@verbatim 4088 /* This scanner prints "//" comments. */ 4089 4090 %option reentrant stack noyywrap 4091 %x COMMENT 4092 4093 %% 4094 4095 "//" yy_push_state( COMMENT, yyscanner); 4096 .|\n 4097 4098 <COMMENT>\n yy_pop_state( yyscanner ); 4099 <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); 4100 4101 %% 4102 4103 int main ( int argc, char * argv[] ) 4104 { 4105 yyscan_t scanner; 4106 4107 yylex_init ( &scanner ); 4108 yylex ( scanner ); 4109 yylex_destroy ( scanner ); 4110 return 0; 4111 } 4112@end verbatim 4113@end example 4114 4115@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant 4116@section The Reentrant API in Detail 4117 4118Here are the things you need to do or know to use the reentrant C API of 4119@code{flex}. 4120 4121@menu 4122* Specify Reentrant:: 4123* Extra Reentrant Argument:: 4124* Global Replacement:: 4125* Init and Destroy Functions:: 4126* Accessor Methods:: 4127* Extra Data:: 4128* About yyscan_t:: 4129@end menu 4130 4131@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail 4132@subsection Declaring a Scanner As Reentrant 4133 4134 %option reentrant (--reentrant) must be specified. 4135 4136Notice that @code{%option reentrant} is specified in the above example 4137(@pxref{Reentrant Example}. Had this option not been specified, 4138@code{flex} would have happily generated a non-reentrant scanner without 4139complaining. You may explicitly specify @code{%option noreentrant}, if 4140you do @emph{not} want a reentrant scanner, although it is not 4141necessary. The default is to generate a non-reentrant scanner. 4142 4143@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail 4144@subsection The Extra Argument 4145 4146@cindex reentrant, calling functions 4147@vindex yyscanner (reentrant only) 4148All functions take one additional argument: @code{yyscanner}. 4149 4150Notice that the calls to @code{yy_push_state} and @code{yy_pop_state} 4151both have an argument, @code{yyscanner} , that is not present in a 4152non-reentrant scanner. Here are the declarations of 4153@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner: 4154 4155@example 4156@verbatim 4157 static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; 4158 static void yy_pop_state ( yyscan_t yyscanner ) ; 4159@end verbatim 4160@end example 4161 4162Notice that the argument @code{yyscanner} appears in the declaration of 4163both functions. In fact, all @code{flex} functions in a reentrant 4164scanner have this additional argument. It is always the last argument 4165in the argument list, it is always of type @code{yyscan_t} (which is 4166typedef'd to @code{void *}) and it is 4167always named @code{yyscanner}. As you may have guessed, 4168@code{yyscanner} is a pointer to an opaque data structure encapsulating 4169the current state of the scanner. For a list of function declarations, 4170see @ref{Reentrant Functions}. Note that preprocessor macros, such as 4171@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this 4172additional argument. 4173 4174@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail 4175@subsection Global Variables Replaced By Macros 4176 4177@cindex reentrant, accessing flex variables 4178All global variables in traditional flex have been replaced by macro equivalents. 4179 4180Note that in the above example, @code{yyout} and @code{yytext} are 4181not plain variables. These are macros that will expand to their equivalent lvalue. 4182All of the familiar @code{flex} globals have been replaced by their macro 4183equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno}, 4184@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc} 4185are macros. You may safely use these macros in actions as if they were plain 4186variables. We only tell you this so you don't expect to link to these variables 4187externally. Currently, each macro expands to a member of an internal struct, e.g., 4188 4189@example 4190@verbatim 4191#define yytext (((struct yyguts_t*)yyscanner)->yytext_r) 4192@end verbatim 4193@end example 4194 4195One important thing to remember about 4196@code{yytext} 4197and friends is that 4198@code{yytext} 4199is not a global variable in a reentrant 4200scanner, you can not access it directly from outside an action or from 4201other functions. You must use an accessor method, e.g., 4202@code{yyget_text}, 4203to accomplish this. (See below). 4204 4205@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail 4206@subsection Init and Destroy Functions 4207 4208@cindex memory, considerations for reentrant scanners 4209@cindex reentrant, initialization 4210@findex yylex_init 4211@findex yylex_destroy 4212 4213@code{yylex_init} and @code{yylex_destroy} must be called before and 4214after @code{yylex}, respectively. 4215 4216@example 4217@verbatim 4218 int yylex_init ( yyscan_t * ptr_yy_globals ) ; 4219 int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; 4220 int yylex ( yyscan_t yyscanner ) ; 4221 int yylex_destroy ( yyscan_t yyscanner ) ; 4222@end verbatim 4223@end example 4224 4225The function @code{yylex_init} must be called before calling any other 4226function. The argument to @code{yylex_init} is the address of an 4227uninitialized pointer to be filled in by @code{yylex_init}, overwriting 4228any previous contents. The function @code{yylex_init_extra} may be used 4229instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}. 4230See the section on yyextra, below, for more details. 4231 4232The value stored in @code{ptr_yy_globals} should 4233thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex 4234does not save the argument passed to @code{yylex_init}, so it is safe to 4235pass the address of a local pointer to @code{yylex_init} so long as it remains 4236in scope for the duration of all calls to the scanner, up to and including 4237the call to @code{yylex_destroy}. 4238 4239The function 4240@code{yylex} should be familiar to you by now. The reentrant version 4241takes one argument, which is the value returned (via an argument) by 4242@code{yylex_init}. Otherwise, it behaves the same as the non-reentrant 4243version of @code{yylex}. 4244 4245Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success, 4246or non-zero on failure, in which case errno is set to one of the following values: 4247 4248@itemize 4249@item ENOMEM 4250Memory allocation error. @xref{memory-management}. 4251@item EINVAL 4252Invalid argument. 4253@end itemize 4254 4255 4256The function @code{yylex_destroy} should be 4257called to free resources used by the scanner. After @code{yylex_destroy} 4258is called, the contents of @code{yyscanner} should not be used. Of 4259course, there is no need to destroy a scanner if you plan to reuse it. 4260A @code{flex} scanner (both reentrant and non-reentrant) may be 4261restarted by calling @code{yyrestart}. 4262 4263Below is an example of a program that creates a scanner, uses it, then destroys 4264it when done: 4265 4266@example 4267@verbatim 4268 int main () 4269 { 4270 yyscan_t scanner; 4271 int tok; 4272 4273 yylex_init(&scanner); 4274 4275 while ((tok=yylex()) > 0) 4276 printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); 4277 4278 yylex_destroy(scanner); 4279 return 0; 4280 } 4281@end verbatim 4282@end example 4283 4284@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail 4285@subsection Accessing Variables with Reentrant Scanners 4286 4287@cindex reentrant, accessor functions 4288Accessor methods (get/set functions) provide access to common 4289@code{flex} variables. 4290 4291Many scanners that you build will be part of a larger project. Portions 4292of your project will need access to @code{flex} values, such as 4293@code{yytext}. In a non-reentrant scanner, these values are global, so 4294there is no problem accessing them. However, in a reentrant scanner, there are no 4295global @code{flex} values. You can not access them directly. Instead, 4296you must access @code{flex} values using accessor methods (get/set 4297functions). Each accessor method is named @code{yyget_NAME} or 4298@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex} 4299variable you want. For example: 4300 4301@cindex accessor functions, use of 4302@example 4303@verbatim 4304 /* Set the last character of yytext to NULL. */ 4305 void chop ( yyscan_t scanner ) 4306 { 4307 int len = yyget_leng( scanner ); 4308 yyget_text( scanner )[len - 1] = '\0'; 4309 } 4310@end verbatim 4311@end example 4312 4313The above code may be called from within an action like this: 4314 4315@example 4316@verbatim 4317 %% 4318 .+\n { chop( yyscanner );} 4319@end verbatim 4320@end example 4321 4322You may find that @code{%option header-file} is particularly useful for generating 4323prototypes of all the accessor functions. @xref{option-header}. 4324 4325@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail 4326@subsection Extra Data 4327 4328@cindex reentrant, extra data 4329@vindex yyextra 4330User-specific data can be stored in @code{yyextra}. 4331 4332In a reentrant scanner, it is unwise to use global variables to 4333communicate with or maintain state between different pieces of your program. 4334However, you may need access to external data or invoke external functions 4335from within the scanner actions. 4336Likewise, you may need to pass information to your scanner 4337(e.g., open file descriptors, or database connections). 4338In a non-reentrant scanner, the only way to do this would be through the 4339use of global variables. 4340@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner. 4341This data is accessible through the accessor methods 4342@code{yyget_extra} and @code{yyset_extra} 4343from outside the scanner, and through the shortcut macro 4344@code{yyextra} 4345from within the scanner itself. They are defined as follows: 4346 4347@tindex YY_EXTRA_TYPE (reentrant only) 4348@findex yyget_extra 4349@findex yyset_extra 4350@example 4351@verbatim 4352 #define YY_EXTRA_TYPE void* 4353 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4354 void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); 4355@end verbatim 4356@end example 4357 4358In addition, an extra form of @code{yylex_init} is provided, 4359@code{yylex_init_extra}. This function is provided so that the yyextra value can 4360be accessed from within the very first yyalloc, used to allocate 4361the scanner itself. 4362 4363By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You 4364may redefine this type using @code{%option extra-type="your_type"} in 4365the scanner: 4366 4367@cindex YY_EXTRA_TYPE, defining your own type 4368@example 4369@verbatim 4370 /* An example of overriding YY_EXTRA_TYPE. */ 4371 %{ 4372 #include <sys/stat.h> 4373 #include <unistd.h> 4374 %} 4375 %option reentrant 4376 %option extra-type="struct stat *" 4377 %% 4378 4379 __filesize__ printf( "%ld", yyextra->st_size ); 4380 __lastmod__ printf( "%ld", yyextra->st_mtime ); 4381 %% 4382 void scan_file( char* filename ) 4383 { 4384 yyscan_t scanner; 4385 struct stat buf; 4386 FILE *in; 4387 4388 in = fopen( filename, "r" ); 4389 stat( filename, &buf ); 4390 4391 yylex_init_extra( buf, &scanner ); 4392 yyset_in( in, scanner ); 4393 yylex( scanner ); 4394 yylex_destroy( scanner ); 4395 4396 fclose( in ); 4397 } 4398@end verbatim 4399@end example 4400 4401 4402@node About yyscan_t, , Extra Data, Reentrant Detail 4403@subsection About yyscan_t 4404 4405@tindex yyscan_t (reentrant only) 4406@code{yyscan_t} is defined as: 4407 4408@example 4409@verbatim 4410 typedef void* yyscan_t; 4411@end verbatim 4412@end example 4413 4414It is initialized by @code{yylex_init()} to point to 4415an internal structure. You should never access this value 4416directly. In particular, you should never attempt to free it 4417(use @code{yylex_destroy()} instead.) 4418 4419@node Reentrant Functions, , Reentrant Detail, Reentrant 4420@section Functions and Macros Available in Reentrant C Scanners 4421 4422The following Functions are available in a reentrant scanner: 4423 4424@findex yyget_text 4425@findex yyget_leng 4426@findex yyget_in 4427@findex yyget_out 4428@findex yyget_lineno 4429@findex yyset_in 4430@findex yyset_out 4431@findex yyset_lineno 4432@findex yyget_debug 4433@findex yyset_debug 4434@findex yyget_extra 4435@findex yyset_extra 4436 4437@example 4438@verbatim 4439 char *yyget_text ( yyscan_t scanner ); 4440 int yyget_leng ( yyscan_t scanner ); 4441 FILE *yyget_in ( yyscan_t scanner ); 4442 FILE *yyget_out ( yyscan_t scanner ); 4443 int yyget_lineno ( yyscan_t scanner ); 4444 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 4445 int yyget_debug ( yyscan_t scanner ); 4446 4447 void yyset_debug ( int flag, yyscan_t scanner ); 4448 void yyset_in ( FILE * in_str , yyscan_t scanner ); 4449 void yyset_out ( FILE * out_str , yyscan_t scanner ); 4450 void yyset_lineno ( int line_number , yyscan_t scanner ); 4451 void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); 4452@end verbatim 4453@end example 4454 4455There are no ``set'' functions for yytext and yyleng. This is intentional. 4456 4457The following Macro shortcuts are available in actions in a reentrant 4458scanner: 4459 4460@example 4461@verbatim 4462 yytext 4463 yyleng 4464 yyin 4465 yyout 4466 yylineno 4467 yyextra 4468 yy_flex_debug 4469@end verbatim 4470@end example 4471 4472@cindex yylineno, in a reentrant scanner 4473In a reentrant C scanner, support for yylineno is always present 4474(i.e., you may access yylineno), but the value is never modified by 4475@code{flex} unless @code{%option yylineno} is enabled. This is to allow 4476the user to maintain the line count independently of @code{flex}. 4477 4478@anchor{bison-functions} 4479The following functions and macros are made available when @code{%option 4480bison-bridge} (@samp{--bison-bridge}) is specified: 4481 4482@example 4483@verbatim 4484 YYSTYPE * yyget_lval ( yyscan_t scanner ); 4485 void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); 4486 yylval 4487@end verbatim 4488@end example 4489 4490The following functions and macros are made available 4491when @code{%option bison-locations} (@samp{--bison-locations}) is specified: 4492 4493@example 4494@verbatim 4495 YYLTYPE *yyget_lloc ( yyscan_t scanner ); 4496 void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); 4497 yylloc 4498@end verbatim 4499@end example 4500 4501Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for 4502yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are 4503generated by @code{bison}, and are included in section 1 of the @code{flex} 4504input. 4505 4506@node Lex and Posix, Memory Management, Reentrant, Top 4507@chapter Incompatibilities with Lex and Posix 4508 4509@cindex POSIX and lex 4510@cindex lex (traditional) and POSIX 4511 4512@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two 4513implementations do not share any code, though), with some extensions and 4514incompatibilities, both of which are of concern to those who wish to 4515write scanners acceptable to both implementations. @code{flex} is fully 4516compliant with the POSIX @code{lex} specification, except that when 4517using @code{%pointer} (the default), a call to @code{unput()} destroys 4518the contents of @code{yytext}, which is counter to the POSIX 4519specification. In this section we discuss all of the known areas of 4520incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX 4521specification. @code{flex}'s @samp{-l} option turns on maximum 4522compatibility with the original AT&T @code{lex} implementation, at the 4523cost of a major loss in the generated scanner's performance. We note 4524below which incompatibilities can be overcome using the @samp{-l} 4525option. @code{flex} is fully compatible with @code{lex} with the 4526following exceptions: 4527 4528@itemize 4529@item 4530The undocumented @code{lex} scanner internal variable @code{yylineno} is 4531not supported unless @samp{-l} or @code{%option yylineno} is used. 4532 4533@item 4534@code{yylineno} should be maintained on a per-buffer basis, rather than 4535a per-scanner (single global variable) basis. 4536 4537@item 4538@code{yylineno} is not part of the POSIX specification. 4539 4540@item 4541The @code{input()} routine is not redefinable, though it may be called 4542to read characters following whatever has been matched by a rule. If 4543@code{input()} encounters an end-of-file the normal @code{yywrap()} 4544processing is done. A ``real'' end-of-file is returned by 4545@code{input()} as @code{EOF}. 4546 4547@item 4548Input is instead controlled by defining the @code{YY_INPUT()} macro. 4549 4550@item 4551The @code{flex} restriction that @code{input()} cannot be redefined is 4552in accordance with the POSIX specification, which simply does not 4553specify any way of controlling the scanner's input other than by making 4554an initial assignment to @file{yyin}. 4555 4556@item 4557The @code{unput()} routine is not redefinable. This restriction is in 4558accordance with POSIX. 4559 4560@item 4561@code{flex} scanners are not as reentrant as @code{lex} scanners. In 4562particular, if you have an interactive scanner and an interrupt handler 4563which long-jumps out of the scanner, and the scanner is subsequently 4564called again, you may get the following message: 4565 4566@cindex error messages, end of buffer missed 4567@example 4568@verbatim 4569 fatal @code{flex} scanner internal error--end of buffer missed 4570@end verbatim 4571@end example 4572 4573To reenter the scanner, first use: 4574 4575@cindex restarting the scanner 4576@example 4577@verbatim 4578 yyrestart( yyin ); 4579@end verbatim 4580@end example 4581 4582Note that this call will throw away any buffered input; usually this 4583isn't a problem with an interactive scanner. @xref{Reentrant}, for 4584@code{flex}'s reentrant API. 4585 4586@item 4587Also note that @code{flex} C++ scanner classes 4588@emph{are} 4589reentrant, so if using C++ is an option for you, you should use 4590them instead. @xref{Cxx}, and @ref{Reentrant} for details. 4591 4592@item 4593@code{output()} is not supported. Output from the @b{ECHO} macro is 4594done to the file-pointer @code{yyout} (default @file{stdout)}. 4595 4596@item 4597@code{output()} is not part of the POSIX specification. 4598 4599@item 4600@code{lex} does not support exclusive start conditions (%x), though they 4601are in the POSIX specification. 4602 4603@item 4604When definitions are expanded, @code{flex} encloses them in parentheses. 4605With @code{lex}, the following: 4606 4607@cindex name definitions, not POSIX 4608@example 4609@verbatim 4610 NAME [A-Z][A-Z0-9]* 4611 %% 4612 foo{NAME}? printf( "Found it\n" ); 4613 %% 4614@end verbatim 4615@end example 4616 4617will not match the string @samp{foo} because when the macro is expanded 4618the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence 4619is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With 4620@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} 4621and so the string @samp{foo} will match. 4622 4623@item 4624Note that if the definition begins with @samp{^} or ends with @samp{$} 4625then it is @emph{not} expanded with parentheses, to allow these 4626operators to appear in definitions without losing their special 4627meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators 4628cannot be used in a @code{flex} definition. 4629 4630@item 4631Using @samp{-l} results in the @code{lex} behavior of no parentheses 4632around the definition. 4633 4634@item 4635The POSIX specification is that the definition be enclosed in parentheses. 4636 4637@item 4638Some implementations of @code{lex} allow a rule's action to begin on a 4639separate line, if the rule's pattern has trailing whitespace: 4640 4641@cindex patterns and actions on different lines 4642@example 4643@verbatim 4644 %% 4645 foo|bar<space here> 4646 { foobar_action();} 4647@end verbatim 4648@end example 4649 4650@code{flex} does not support this feature. 4651 4652@item 4653The @code{lex} @code{%r} (generate a Ratfor scanner) option is not 4654supported. It is not part of the POSIX specification. 4655 4656@item 4657After a call to @code{unput()}, @emph{yytext} is undefined until the 4658next token is matched, unless the scanner was built using @code{%array}. 4659This is not the case with @code{lex} or the POSIX specification. The 4660@samp{-l} option does away with this incompatibility. 4661 4662@item 4663The precedence of the @samp{@{,@}} (numeric range) operator is 4664different. The AT&T and POSIX specifications of @code{lex} 4665interpret @samp{abc@{1,3@}} as match one, two, 4666or three occurrences of @samp{abc}'', whereas @code{flex} interprets it 4667as ``match @samp{ab} followed by one, two, or three occurrences of 4668@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this 4669incompatibility. 4670 4671@item 4672The precedence of the @samp{^} operator is different. @code{lex} 4673interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a 4674line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match 4675either @samp{foo} or @samp{bar} if they come at the beginning of a 4676line''. The latter is in agreement with the POSIX specification. 4677 4678@item 4679The special table-size declarations such as @code{%a} supported by 4680@code{lex} are not required by @code{flex} scanners.. @code{flex} 4681ignores them. 4682@item 4683The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be 4684written for use with either @code{flex} or @code{lex}. Scanners also 4685include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} 4686and @code{YY_FLEX_SUBMINOR_VERSION} 4687indicating which version of @code{flex} generated the scanner. For 4688example, for the 2.5.22 release, these defines would be 2, 5 and 22 4689respectively. If the version of @code{flex} being used is a beta 4690version, then the symbol @code{FLEX_BETA} is defined. 4691 4692@item 4693The symbols @samp{[[} and @samp{]]} in the code sections of the input 4694may conflict with the m4 delimiters. @xref{M4 Dependency}. 4695 4696 4697@end itemize 4698 4699@cindex POSIX comp;compliance 4700@cindex non-POSIX features of flex 4701The following @code{flex} features are not included in @code{lex} or the 4702POSIX specification: 4703 4704@itemize 4705@item 4706C++ scanners 4707@item 4708%option 4709@item 4710start condition scopes 4711@item 4712start condition stacks 4713@item 4714interactive/non-interactive scanners 4715@item 4716yy_scan_string() and friends 4717@item 4718yyterminate() 4719@item 4720yy_set_interactive() 4721@item 4722yy_set_bol() 4723@item 4724YY_AT_BOL() 4725 <<EOF>> 4726@item 4727<*> 4728@item 4729YY_DECL 4730@item 4731YY_START 4732@item 4733YY_USER_ACTION 4734@item 4735YY_USER_INIT 4736@item 4737#line directives 4738@item 4739%@{@}'s around actions 4740@item 4741reentrant C API 4742@item 4743multiple actions on a line 4744@item 4745almost all of the @code{flex} command-line options 4746@end itemize 4747 4748The feature ``multiple actions on a line'' 4749refers to the fact that with @code{flex} you can put multiple actions on 4750the same line, separated with semi-colons, while with @code{lex}, the 4751following: 4752 4753@example 4754@verbatim 4755 foo handle_foo(); ++num_foos_seen; 4756@end verbatim 4757@end example 4758 4759is (rather surprisingly) truncated to 4760 4761@example 4762@verbatim 4763 foo handle_foo(); 4764@end verbatim 4765@end example 4766 4767@code{flex} does not truncate the action. Actions that are not enclosed 4768in braces are simply terminated at the end of the line. 4769 4770@node Memory Management, Serialized Tables, Lex and Posix, Top 4771@chapter Memory Management 4772 4773@cindex memory management 4774@anchor{memory-management} 4775This chapter describes how flex handles dynamic memory, and how you can 4776override the default behavior. 4777 4778@menu 4779* The Default Memory Management:: 4780* Overriding The Default Memory Management:: 4781* A Note About yytext And Memory:: 4782@end menu 4783 4784@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management 4785@section The Default Memory Management 4786 4787Flex allocates dynamic memory during initialization, and once in a while from 4788within a call to yylex(). Initialization takes place during the first call to 4789yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a 4790buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} 4791@xref{faq-memory-leak}. 4792 4793Flex allocates dynamic memory for four purposes, listed below @footnote{The 4794quantities given here are approximate, and may vary due to host architecture, 4795compiler configuration, or due to future enhancements to flex.} 4796 4797@table @asis 4798 4799@item 16kB for the input buffer. 4800Flex allocates memory for the character buffer used to perform pattern 4801matching. Flex must read ahead from the input stream and store it in a large 4802character buffer. This buffer is typically the largest chunk of dynamic memory 4803flex consumes. This buffer will grow if necessary, doubling the size each time. 4804Flex frees this memory when you call yylex_destroy(). The default size of this 4805buffer (16384 bytes) is almost always too large. The ideal size for this 4806buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few 4807extra bytes for housekeeping. Currently, to override the size of the input buffer 4808you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan 4809to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management 4810API. 4811 4812@item 64kb for the REJECT state. This will only be allocated if you use REJECT. 4813The size is the large enough to hold the same number of states as characters in the input buffer. If you override the size of the 4814input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well. 4815 4816@item 100 bytes for the start condition stack. 4817Flex allocates memory for the start condition stack. This is the stack used 4818for pushing start states, i.e., with yy_push_state(). It will grow if 4819necessary. Since the states are simply integers, this stack doesn't consume 4820much memory. This stack is not present if @code{%option stack} is not 4821specified. You will rarely need to tune this buffer. The ideal size for this 4822stack is the maximum depth expected. The memory for this stack is 4823automatically destroyed when you call yylex_destroy(). @xref{option-stack}. 4824 4825@item 40 bytes for each YY_BUFFER_STATE. 4826Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself 4827is about 40 bytes, plus an additional large character buffer (described above.) 4828The initial buffer state is created during initialization, and with each call 4829to yy_create_buffer(). You can't tune the size of this, but you can tune the 4830character buffer as described above. Any buffer state that you explicitly 4831create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You 4832must call yy_delete_buffer() to free the memory. The exception to this rule is 4833that flex will delete the current buffer automatically when you call 4834yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. 4835That way, flex will not try to delete the buffer a second time (possibly 4836crashing your program!) At the time of this writing, flex does not provide a 4837growable stack for the buffer states. You have to manage that yourself. 4838@xref{Multiple Input Buffers}. 4839 4840@item 84 bytes for the reentrant scanner guts 4841Flex allocates about 84 bytes for the reentrant scanner structure when 4842you call yylex_init(). It is destroyed when the user calls yylex_destroy(). 4843 4844@end table 4845 4846 4847@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management 4848@section Overriding The Default Memory Management 4849 4850@cindex yyalloc, overriding 4851@cindex yyrealloc, overriding 4852@cindex yyfree, overriding 4853 4854Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree} 4855when it needs to allocate or free memory. By default, these functions are 4856wrappers around the standard C functions, @code{malloc}, @code{realloc}, and 4857@code{free}, respectively. You can override the default implementations by telling 4858flex that you will provide your own implementations. 4859 4860To override the default implementations, you must do two things: 4861 4862@enumerate 4863 4864@item Suppress the default implementations by specifying one or more of the 4865following options: 4866 4867@itemize 4868@opindex noyyalloc 4869@item @code{%option noyyalloc} 4870@item @code{%option noyyrealloc} 4871@item @code{%option noyyfree}. 4872@end itemize 4873 4874@item Provide your own implementation of the following functions: @footnote{It 4875is not necessary to override all (or any) of the memory management routines. 4876You may, for example, override @code{yyrealloc}, but not @code{yyfree} or 4877@code{yyalloc}.} 4878 4879@example 4880@verbatim 4881// For a non-reentrant scanner 4882void * yyalloc (size_t bytes); 4883void * yyrealloc (void * ptr, size_t bytes); 4884void yyfree (void * ptr); 4885 4886// For a reentrant scanner 4887void * yyalloc (size_t bytes, void * yyscanner); 4888void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); 4889void yyfree (void * ptr, void * yyscanner); 4890@end verbatim 4891@end example 4892 4893@end enumerate 4894 4895In the following example, we will override all three memory routines. We assume 4896that there is a custom allocator with garbage collection. In order to make this 4897example interesting, we will use a reentrant scanner, passing a pointer to the 4898custom allocator through @code{yyextra}. 4899 4900@cindex overriding the memory routines 4901@example 4902@verbatim 4903%{ 4904#include "some_allocator.h" 4905%} 4906 4907/* Suppress the default implementations. */ 4908%option noyyalloc noyyrealloc noyyfree 4909%option reentrant 4910 4911/* Initialize the allocator. */ 4912#define YY_EXTRA_TYPE struct allocator* 4913#define YY_USER_INIT yyextra = allocator_create(); 4914 4915%% 4916.|\n ; 4917%% 4918 4919/* Provide our own implementations. */ 4920void * yyalloc (size_t bytes, void* yyscanner) { 4921 return allocator_alloc (yyextra, bytes); 4922} 4923 4924void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { 4925 return allocator_realloc (yyextra, bytes); 4926} 4927 4928void yyfree (void * ptr, void * yyscanner) { 4929 /* Do nothing -- we leave it to the garbage collector. */ 4930} 4931 4932@end verbatim 4933@end example 4934 4935 4936@node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management 4937@section A Note About yytext And Memory 4938 4939@cindex yytext, memory considerations 4940 4941When flex finds a match, @code{yytext} points to the first character of the 4942match in the input buffer. The string itself is part of the input buffer, and 4943is @emph{NOT} allocated separately. The value of yytext will be overwritten the next 4944time yylex() is called. In short, the value of yytext is only valid from within 4945the matched rule's action. 4946 4947Often, you want the value of yytext to persist for later processing, i.e., by a 4948parser with non-zero lookahead. In order to preserve yytext, you will have to 4949copy it with strdup() or a similar function. But this introduces some headache 4950because your parser is now responsible for freeing the copy of yytext. If you 4951use a yacc or bison parser, (commonly used with flex), you will discover that 4952the error recovery mechanisms can cause memory to be leaked. 4953 4954To prevent memory leaks from strdup'd yytext, you will have to track the memory 4955somehow. Our experience has shown that a garbage collection mechanism or a 4956pooled memory mechanism will save you a lot of grief when writing parsers. 4957 4958@node Serialized Tables, Diagnostics, Memory Management, Top 4959@chapter Serialized Tables 4960@cindex serialization 4961@cindex memory, serialized tables 4962 4963@anchor{serialization} 4964A @code{flex} scanner has the ability to save the DFA tables to a file, and 4965load them at runtime when needed. The motivation for this feature is to reduce 4966the runtime memory footprint. Traditionally, these tables have been compiled into 4967the scanner as C arrays, and are sometimes quite large. Since the tables are 4968compiled into the scanner, the memory used by the tables can never be freed. 4969This is a waste of memory, especially if an application uses several scanners, 4970but none of them at the same time. 4971 4972The serialization feature allows the tables to be loaded at runtime, before 4973scanning begins. The tables may be discarded when scanning is finished. 4974 4975@menu 4976* Creating Serialized Tables:: 4977* Loading and Unloading Serialized Tables:: 4978* Tables File Format:: 4979@end menu 4980 4981@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables 4982@section Creating Serialized Tables 4983@cindex tables, creating serialized 4984@cindex serialization of tables 4985 4986You may create a scanner with serialized tables by specifying: 4987 4988@example 4989@verbatim 4990 %option tables-file=FILE 4991or 4992 --tables-file=FILE 4993@end verbatim 4994@end example 4995 4996These options instruct flex to save the DFA tables to the file @var{FILE}. The tables 4997will @emph{not} be embedded in the generated scanner. The scanner will not 4998function on its own. The scanner will be dependent upon the serialized tables. You must 4999load the tables from this file at runtime before you can scan anything. 5000 5001If you do not specify a filename to @code{--tables-file}, the tables will be 5002saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix. 5003 5004If your project uses several different scanners, you can concatenate the 5005serialized tables into one file, and flex will find the correct set of tables, 5006using the scanner prefix as part of the lookup key. An example follows: 5007 5008@cindex serialized tables, multiple scanners 5009@example 5010@verbatim 5011$ flex --tables-file --prefix=cpp cpp.l 5012$ flex --tables-file --prefix=c c.l 5013$ cat lex.cpp.tables lex.c.tables > all.tables 5014@end verbatim 5015@end example 5016 5017The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did 5018not specify a filename, the tables were serialized to @file{lex.c.tables} and 5019@file{lex.cpp.tables}, respectively. Then, we concatenated the two files 5020together into @file{all.tables}, which we will distribute with our project. At 5021runtime, we will open the file and tell flex to load the tables from it. Flex 5022will find the correct tables automatically. (See next section). 5023 5024@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables 5025@section Loading and Unloading Serialized Tables 5026@cindex tables, loading and unloading 5027@cindex loading tables at runtime 5028@cindex tables, freeing 5029@cindex freeing tables 5030@cindex memory, serialized tables 5031 5032If you've built your scanner with @code{%option tables-file}, then you must 5033load the scanner tables at runtime. This can be accomplished with the following 5034function: 5035 5036@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) 5037Locates scanner tables in the stream pointed to by @var{fp} and loads them. 5038Memory for the tables is allocated via @code{yyalloc}. You must call this 5039function before the first call to @code{yylex}. The argument @var{scanner} 5040only appears in the reentrant scanner. 5041This function returns @samp{0} (zero) on success, or non-zero on error. 5042@end deftypefun 5043 5044The loaded tables are @strong{not} automatically destroyed (unloaded) when you 5045call @code{yylex_destroy}. The reason is that you may create several scanners 5046of the same type (in a reentrant scanner), each of which needs access to these 5047tables. To avoid a nasty memory leak, you must call the following function: 5048 5049@deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) 5050Unloads the scanner tables. The tables must be loaded again before you can scan 5051any more data. The argument @var{scanner} only appears in the reentrant 5052scanner. This function returns @samp{0} (zero) on success, or non-zero on 5053error. 5054@end deftypefun 5055 5056@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not 5057thread-safe.} You must ensure that these functions are called exactly once (for 5058each scanner type) in a threaded program, before any thread calls @code{yylex}. 5059After the tables are loaded, they are never written to, and no thread 5060protection is required thereafter -- until you destroy them. 5061 5062@node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables 5063@section Tables File Format 5064@cindex tables, file format 5065@cindex file format, serialized tables 5066 5067This section defines the file format of serialized @code{flex} tables. 5068 5069The tables format allows for one or more sets of tables to be 5070specified, where each set corresponds to a given scanner. Scanners are 5071indexed by name, as described below. The file format is as follows: 5072 5073@example 5074@verbatim 5075 TABLE SET 1 5076 +-------------------------------+ 5077 Header | uint32 th_magic; | 5078 | uint32 th_hsize; | 5079 | uint32 th_ssize; | 5080 | uint16 th_flags; | 5081 | char th_version[]; | 5082 | char th_name[]; | 5083 | uint8 th_pad64[]; | 5084 +-------------------------------+ 5085 Table 1 | uint16 td_id; | 5086 | uint16 td_flags; | 5087 | uint32 td_lolen; | 5088 | uint32 td_hilen; | 5089 | void td_data[]; | 5090 | uint8 td_pad64[]; | 5091 +-------------------------------+ 5092 Table 2 | | 5093 . . . 5094 . . . 5095 . . . 5096 . . . 5097 Table n | | 5098 +-------------------------------+ 5099 TABLE SET 2 5100 . 5101 . 5102 . 5103 TABLE SET N 5104@end verbatim 5105@end example 5106 5107The above diagram shows that a complete set of tables consists of a header 5108followed by multiple individual tables. Furthermore, multiple complete sets may 5109be present in the same file, each set with its own header and tables. The sets 5110are contiguous in the file. The only way to know if another set follows is to 5111check the next four bytes for the magic number (or check for EOF). The header 5112and tables sections are padded to 64-bit boundaries. Below we describe each 5113field in detail. This format does not specify how the scanner will expand the 5114given data, i.e., data may be serialized as int8, but expanded to an int32 5115array at runtime. This is to reduce the size of the serialized data where 5116possible. Remember, @emph{all integer values are in network byte order}. 5117 5118@noindent 5119Fields of a table header: 5120 5121@table @code 5122@item th_magic 5123Magic number, always 0xF13C57B1. 5124 5125@item th_hsize 5126Size of this entire header, in bytes, including all fields plus any padding. 5127 5128@item th_ssize 5129Size of this entire set, in bytes, including the header, all tables, plus 5130any padding. 5131 5132@item th_flags 5133Bit flags for this table set. Currently unused. 5134 5135@item th_version[] 5136Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is 5137the version of flex that was used to create the serialized tables. 5138 5139@item th_name[] 5140Contains the name of this table set. The default is @samp{yytables}, 5141and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. 5142 5143@item th_pad64[] 5144Zero or more NULL bytes, padding the entire header to the next 64-bit boundary 5145as calculated from the beginning of the header. 5146@end table 5147 5148@noindent 5149Fields of a table: 5150 5151@table @code 5152@item td_id 5153Specifies the table identifier. Possible values are: 5154@table @code 5155@item YYTD_ID_ACCEPT (0x01) 5156@code{yy_accept} 5157@item YYTD_ID_BASE (0x02) 5158@code{yy_base} 5159@item YYTD_ID_CHK (0x03) 5160@code{yy_chk} 5161@item YYTD_ID_DEF (0x04) 5162@code{yy_def} 5163@item YYTD_ID_EC (0x05) 5164@code{yy_ec } 5165@item YYTD_ID_META (0x06) 5166@code{yy_meta} 5167@item YYTD_ID_NUL_TRANS (0x07) 5168@code{yy_NUL_trans} 5169@item YYTD_ID_NXT (0x08) 5170@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen} 5171field below. 5172@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09) 5173@code{yy_rule_can_match_eol} 5174@item YYTD_ID_START_STATE_LIST (0x0A) 5175@code{yy_start_state_list}. This array is handled specially because it is an 5176array of pointers to structs. See the @code{td_flags} field below. 5177@item YYTD_ID_TRANSITION (0x0B) 5178@code{yy_transition}. This array is handled specially because it is an array of 5179structs. See the @code{td_lolen} field below. 5180@item YYTD_ID_ACCLIST (0x0C) 5181@code{yy_acclist} 5182@end table 5183 5184@item td_flags 5185Bit flags describing how to interpret the data in @code{td_data}. 5186The data arrays are one-dimensional by default, but may be 5187two dimensional as specified in the @code{td_hilen} field. 5188 5189@table @code 5190@item YYTD_DATA8 (0x01) 5191The data is serialized as an array of type int8. 5192@item YYTD_DATA16 (0x02) 5193The data is serialized as an array of type int16. 5194@item YYTD_DATA32 (0x04) 5195The data is serialized as an array of type int32. 5196@item YYTD_PTRANS (0x08) 5197The data is a list of indexes of entries in the expanded @code{yy_transition} 5198array. Each index should be expanded to a pointer to the corresponding entry 5199in the @code{yy_transition} array. We count on the fact that the 5200@code{yy_transition} array has already been seen. 5201@item YYTD_STRUCT (0x10) 5202The data is a list of yy_trans_info structs, each of which consists of 5203two integers. There is no padding between struct elements or between structs. 5204The type of each member is determined by the @code{YYTD_DATA*} bits. 5205@end table 5206 5207@item td_lolen 5208Specifies the number of elements in the lowest dimension array. If this is 5209a one-dimensional array, then it is simply the number of elements in this array. 5210The element size is determined by the @code{td_flags} field. 5211 5212@item td_hilen 5213If @code{td_hilen} is non-zero, then the data is a two-dimensional array. 5214Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the 5215number of elements in the higher dimensional array, and @code{td_lolen} contains 5216the number of elements in the lowest dimension. 5217 5218Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or 5219@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified 5220by the @code{td_flags} field. It is possible for both @code{td_lolen} and 5221@code{td_hilen} to be zero, in which case @code{td_data} is a zero length 5222array, and no data is loaded, i.e., this table is simply skipped. Flex does not 5223currently generate tables of zero length. 5224 5225@item td_data[] 5226The table data. This array may be a one- or two-dimensional array, of type 5227@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or 5228@code{struct yy_trans_info*}, depending upon the values in the 5229@code{td_flags}, @code{td_lolen}, and @code{td_hilen} fields. 5230 5231@item td_pad64[] 5232Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as 5233calculated from the beginning of this table. 5234@end table 5235 5236@node Diagnostics, Limitations, Serialized Tables, Top 5237@chapter Diagnostics 5238 5239@cindex error reporting, diagnostic messages 5240@cindex warnings, diagnostic messages 5241 5242The following is a list of @code{flex} diagnostic messages: 5243 5244@itemize 5245@item 5246@samp{warning, rule cannot be matched} indicates that the given rule 5247cannot be matched because it follows other rules that will always match 5248the same text as it. For example, in the following @samp{foo} cannot be 5249matched because it comes after an identifier ``catch-all'' rule: 5250 5251@cindex warning, rule cannot be matched 5252@example 5253@verbatim 5254 [a-z]+ got_identifier(); 5255 foo got_foo(); 5256@end verbatim 5257@end example 5258 5259Using @code{REJECT} in a scanner suppresses this warning. 5260 5261@item 5262@samp{warning, -s option given but default rule can be matched} means 5263that it is possible (perhaps only in a particular start condition) that 5264the default rule (match any single character) is the only one that will 5265match a particular input. Since @samp{-s} was given, presumably this is 5266not intended. 5267 5268@item 5269@code{reject_used_but_not_detected undefined} or 5270@code{yymore_used_but_not_detected undefined}. These errors can occur 5271at compile time. They indicate that the scanner uses @code{REJECT} or 5272@code{yymore()} but that @code{flex} failed to notice the fact, meaning 5273that @code{flex} scanned the first two sections looking for occurrences 5274of these actions and failed to find any, but somehow you snuck some in 5275(via a #include file, for example). Use @code{%option reject} or 5276@code{%option yymore} to indicate to @code{flex} that you really do use 5277these features. 5278 5279@item 5280@samp{flex scanner jammed}. a scanner compiled with 5281@samp{-s} has encountered an input string which wasn't matched by any of 5282its rules. This error can also occur due to internal problems. 5283 5284@item 5285@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} 5286and one of its rules matched a string longer than the @code{YYLMAX} 5287constant (8K bytes by default). You can increase the value by 5288#define'ing @code{YYLMAX} in the definitions section of your @code{flex} 5289input. 5290 5291@item 5292@samp{scanner requires -8 flag to use the character 'x'}. Your scanner 5293specification includes recognizing the 8-bit character @samp{'x'} and 5294you did not specify the -8 flag, and your scanner defaulted to 7-bit 5295because you used the @samp{-Cf} or @samp{-CF} table compression options. 5296See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for 5297details. 5298 5299@item 5300@samp{flex scanner push-back overflow}. you used @code{unput()} to push 5301back so much text that the scanner's buffer could not hold both the 5302pushed-back text and the current token in @code{yytext}. Ideally the 5303scanner should dynamically resize the buffer in this case, but at 5304present it does not. 5305 5306@item 5307@samp{input buffer overflow, can't enlarge buffer because scanner uses 5308REJECT}. the scanner was working on matching an extremely large token 5309and needed to expand the input buffer. This doesn't work with scanners 5310that use @code{REJECT}. 5311 5312@item 5313@samp{fatal flex scanner internal error--end of buffer missed}. This can 5314occur in a scanner which is reentered after a long-jump has jumped out 5315(or over) the scanner's activation frame. Before reentering the 5316scanner, use: 5317@example 5318@verbatim 5319 yyrestart( yyin ); 5320@end verbatim 5321@end example 5322or, as noted above, switch to using the C++ scanner class. 5323 5324@item 5325@samp{too many start conditions in <> construct!} you listed more start 5326conditions in a <> construct than exist (so you must have listed at 5327least one of them twice). 5328@end itemize 5329 5330@node Limitations, Bibliography, Diagnostics, Top 5331@chapter Limitations 5332 5333@cindex limitations of flex 5334 5335Some trailing context patterns cannot be properly matched and generate 5336warning messages (@samp{dangerous trailing context}). These are 5337patterns where the ending of the first part of the rule matches the 5338beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' 5339matches the 'x' at the beginning of the trailing context. (Note that 5340the POSIX draft states that the text matched by such patterns is 5341undefined.) For some trailing context rules, parts which are actually 5342fixed-length are not recognized as such, leading to the abovementioned 5343performance loss. In particular, parts using @samp{|} or @samp{@{n@}} 5344(such as @samp{foo@{3@}}) are always considered variable-length. 5345Combining trailing context with the special @samp{|} action can result 5346in @emph{fixed} trailing context being turned into the more expensive 5347@emph{variable} trailing context. For example, in the following: 5348 5349@cindex warning, dangerous trailing context 5350@example 5351@verbatim 5352 %% 5353 abc | 5354 xyz/def 5355@end verbatim 5356@end example 5357 5358Use of @code{unput()} invalidates yytext and yyleng, unless the 5359@code{%array} directive or the @samp{-l} option has been used. 5360Pattern-matching of @code{NUL}s is substantially slower than matching 5361other characters. Dynamic resizing of the input buffer is slow, as it 5362entails rescanning all the text matched so far by the current (generally 5363huge) token. Due to both buffering of input and read-ahead, you cannot 5364intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()}, 5365with @code{flex} rules and expect it to work. Call @code{input()} 5366instead. The total table entries listed by the @samp{-v} flag excludes 5367the number of table entries needed to determine what rule has been 5368matched. The number of entries is equal to the number of DFA states if 5369the scanner does not use @code{REJECT}, and somewhat greater than the 5370number of states if it does. @code{REJECT} cannot be used with the 5371@samp{-f} or @samp{-F} options. 5372 5373The @code{flex} internal algorithms need documentation. 5374 5375@node Bibliography, FAQ, Limitations, Top 5376@chapter Additional Reading 5377 5378You may wish to read more about the following programs: 5379@itemize 5380@item lex 5381@item yacc 5382@item sed 5383@item awk 5384@end itemize 5385 5386The following books may contain material of interest: 5387 5388John Levine, Tony Mason, and Doug Brown, 5389@emph{Lex & Yacc}, 5390O'Reilly and Associates. Be sure to get the 2nd edition. 5391 5392M. E. Lesk and E. Schmidt, 5393@emph{LEX -- Lexical Analyzer Generator} 5394 5395Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, 5396Techniques and Tools}, Addison-Wesley (1986). Describes the 5397pattern-matching techniques used by @code{flex} (deterministic finite 5398automata). 5399 5400@node FAQ, Appendices, Bibliography, Top 5401@unnumbered FAQ 5402 5403From time to time, the @code{flex} maintainer receives certain 5404questions. Rather than repeat answers to well-understood problems, we 5405publish them here. 5406 5407@menu 5408* When was flex born?:: 5409* How do I expand backslash-escape sequences in C-style quoted strings?:: 5410* Why do flex scanners call fileno if it is not ANSI compatible?:: 5411* Does flex support recursive pattern definitions?:: 5412* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 5413* Flex is not matching my patterns in the same order that I defined them.:: 5414* My actions are executing out of order or sometimes not at all.:: 5415* How can I have multiple input sources feed into the same scanner at the same time?:: 5416* Can I build nested parsers that work with the same input file?:: 5417* How can I match text only at the end of a file?:: 5418* How can I make REJECT cascade across start condition boundaries?:: 5419* Why cant I use fast or full tables with interactive mode?:: 5420* How much faster is -F or -f than -C?:: 5421* If I have a simple grammar cant I just parse it with flex?:: 5422* Why doesn't yyrestart() set the start state back to INITIAL?:: 5423* How can I match C-style comments?:: 5424* The period isn't working the way I expected.:: 5425* Can I get the flex manual in another format?:: 5426* Does there exist a "faster" NDFA->DFA algorithm?:: 5427* How does flex compile the DFA so quickly?:: 5428* How can I use more than 8192 rules?:: 5429* How do I abandon a file in the middle of a scan and switch to a new file?:: 5430* How do I execute code only during initialization (only before the first scan)?:: 5431* How do I execute code at termination?:: 5432* Where else can I find help?:: 5433* Can I include comments in the "rules" section of the file?:: 5434* I get an error about undefined yywrap().:: 5435* How can I change the matching pattern at run time?:: 5436* How can I expand macros in the input?:: 5437* How can I build a two-pass scanner?:: 5438* How do I match any string not matched in the preceding rules?:: 5439* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 5440* Is there a way to make flex treat NULL like a regular character?:: 5441* Whenever flex can not match the input it says "flex scanner jammed".:: 5442* Why doesn't flex have non-greedy operators like perl does?:: 5443* Memory leak - 16386 bytes allocated by malloc.:: 5444* How do I track the byte offset for lseek()?:: 5445* How do I use my own I/O classes in a C++ scanner?:: 5446* How do I skip as many chars as possible?:: 5447* deleteme00:: 5448* Are certain equivalent patterns faster than others?:: 5449* Is backing up a big deal?:: 5450* Can I fake multi-byte character support?:: 5451* deleteme01:: 5452* Can you discuss some flex internals?:: 5453* unput() messes up yy_at_bol:: 5454* The | operator is not doing what I want:: 5455* Why can't flex understand this variable trailing context pattern?:: 5456* The ^ operator isn't working:: 5457* Trailing context is getting confused with trailing optional patterns:: 5458* Is flex GNU or not?:: 5459* ERASEME53:: 5460* I need to scan if-then-else blocks and while loops:: 5461* ERASEME55:: 5462* ERASEME56:: 5463* ERASEME57:: 5464* Is there a repository for flex scanners?:: 5465* How can I conditionally compile or preprocess my flex input file?:: 5466* Where can I find grammars for lex and yacc?:: 5467* I get an end-of-buffer message for each character scanned.:: 5468* unnamed-faq-62:: 5469* unnamed-faq-63:: 5470* unnamed-faq-64:: 5471* unnamed-faq-65:: 5472* unnamed-faq-66:: 5473* unnamed-faq-67:: 5474* unnamed-faq-68:: 5475* unnamed-faq-69:: 5476* unnamed-faq-70:: 5477* unnamed-faq-71:: 5478* unnamed-faq-72:: 5479* unnamed-faq-73:: 5480* unnamed-faq-74:: 5481* unnamed-faq-75:: 5482* unnamed-faq-76:: 5483* unnamed-faq-77:: 5484* unnamed-faq-78:: 5485* unnamed-faq-79:: 5486* unnamed-faq-80:: 5487* unnamed-faq-81:: 5488* unnamed-faq-82:: 5489* unnamed-faq-83:: 5490* unnamed-faq-84:: 5491* unnamed-faq-85:: 5492* unnamed-faq-86:: 5493* unnamed-faq-87:: 5494* unnamed-faq-88:: 5495* unnamed-faq-90:: 5496* unnamed-faq-91:: 5497* unnamed-faq-92:: 5498* unnamed-faq-93:: 5499* unnamed-faq-94:: 5500* unnamed-faq-95:: 5501* unnamed-faq-96:: 5502* unnamed-faq-97:: 5503* unnamed-faq-98:: 5504* unnamed-faq-99:: 5505* unnamed-faq-100:: 5506* unnamed-faq-101:: 5507* What is the difference between YYLEX_PARAM and YY_DECL?:: 5508* Why do I get "conflicting types for yylex" error?:: 5509* How do I access the values set in a Flex action from within a Bison action?:: 5510@end menu 5511 5512@node When was flex born? 5513@unnumberedsec When was flex born? 5514 5515Vern Paxson took over 5516the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it 5517was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 5518a legend was born :-). 5519 5520@node How do I expand backslash-escape sequences in C-style quoted strings? 5521@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings? 5522 5523A key point when scanning quoted strings is that you cannot (easily) write 5524a single rule that will precisely match the string if you allow things 5525like embedded escape sequences and newlines. If you try to match strings 5526with a single rule then you'll wind up having to rescan the string anyway 5527to find any escape sequences. 5528 5529Instead you can use exclusive start conditions and a set of rules, one for 5530matching non-escaped text, one for matching a single escape, one for 5531matching an embedded newline, and one for recognizing the end of the 5532string. Each of these rules is then faced with the question of where to 5533put its intermediary results. The best solution is for the rules to 5534append their local value of @code{yytext} to the end of a ``string literal'' 5535buffer. A rule like the escape-matcher will append to the buffer the 5536meaning of the escape sequence rather than the literal text in @code{yytext}. 5537In this way, @code{yytext} does not need to be modified at all. 5538 5539@node Why do flex scanners call fileno if it is not ANSI compatible? 5540@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? 5541 5542Flex scanners call @code{fileno()} in order to get the file descriptor 5543corresponding to @code{yyin}. The file descriptor may be passed to 5544@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified. 5545If your system does not have @code{fileno()} support, to get rid of the 5546@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()} 5547call, you must specify one of @code{%option always-interactive} or 5548@code{%option never-interactive}. 5549 5550@node Does flex support recursive pattern definitions? 5551@unnumberedsec Does flex support recursive pattern definitions? 5552 5553e.g., 5554 5555@example 5556@verbatim 5557%% 5558block "{"({block}|{statement})*"}" 5559@end verbatim 5560@end example 5561 5562No. You cannot have recursive definitions. The pattern-matching power of 5563regular expressions in general (and therefore flex scanners, too) is 5564limited. In particular, regular expressions cannot ``balance'' parentheses 5565to an arbitrary degree. For example, it's impossible to write a regular 5566expression that matches all strings containing the same number of '@{'s 5567as '@}'s. For more powerful pattern matching, you need a parser, such 5568as @cite{GNU bison}. 5569 5570@node How do I skip huge chunks of input (tens of megabytes) while using flex? 5571@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex? 5572 5573Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}. 5574 5575@node Flex is not matching my patterns in the same order that I defined them. 5576@unnumberedsec Flex is not matching my patterns in the same order that I defined them. 5577 5578@code{flex} picks the 5579rule that matches the most text (i.e., the longest possible input string). 5580This is because @code{flex} uses an entirely different matching technique 5581(``deterministic finite automata'') that actually does all of the matching 5582simultaneously, in parallel. (Seems impossible, but it's actually a fairly 5583simple technique once you understand the principles.) 5584 5585A side-effect of this parallel matching is that when the input matches more 5586than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This 5587is explained further in the manual, in the section @xref{Matching}. 5588 5589If you want @code{flex} to choose a shorter match, then you can work around this 5590behavior by expanding your short 5591rule to match more text, then put back the extra: 5592 5593@example 5594@verbatim 5595data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; 5596@end verbatim 5597@end example 5598 5599Another fix would be to make the second rule active only during the 5600@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive 5601by declaring it with @code{%x} instead of @code{%s}. 5602 5603A final fix is to change the input language so that the ambiguity for 5604@samp{data_} is removed, by adding characters to it that don't match the 5605identifier rule, or by removing characters (such as @samp{_}) from the 5606identifier rule so it no longer matches @samp{data_}. (Of course, you might 5607also not have the option of changing the input language.) 5608 5609@node My actions are executing out of order or sometimes not at all. 5610@unnumberedsec My actions are executing out of order or sometimes not at all. 5611 5612Most likely, you have (in error) placed the opening @samp{@{} of the action 5613block on a different line than the rule, e.g., 5614 5615@example 5616@verbatim 5617^(foo|bar) 5618{ <<<--- WRONG! 5619 5620} 5621@end verbatim 5622@end example 5623 5624@code{flex} requires that the opening @samp{@{} of an action associated with a rule 5625begin on the same line as does the rule. You need instead to write your rules 5626as follows: 5627 5628@example 5629@verbatim 5630^(foo|bar) { // CORRECT! 5631 5632} 5633@end verbatim 5634@end example 5635 5636@node How can I have multiple input sources feed into the same scanner at the same time? 5637@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? 5638 5639If @dots{} 5640@itemize 5641@item 5642your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag), 5643@item 5644AND you run your scanner interactively (@samp{-I} option; default unless using special table 5645compression options), 5646@item 5647AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, 5648@end itemize 5649 5650then every time it matches a token, it will have exhausted its input 5651buffer (because the scanner is free of backtracking). This means you 5652can safely use @code{select()} at the point and only call @code{yylex()} for another 5653token if @code{select()} indicates there's data available. 5654 5655That is, move the @code{select()} out from the input function to a point where 5656it determines whether @code{yylex()} gets called for the next token. 5657 5658With this approach, you will still have problems if your input can arrive 5659piecemeal; @code{select()} could inform you that the beginning of a token is 5660available, you call @code{yylex()} to get it, but it winds up blocking waiting 5661for the later characters in the token. 5662 5663Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That 5664is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is 5665available. If input is available for the scanner, it reads and returns the 5666next byte. If input is available from another source, it calls whatever 5667function is responsible for reading from that source. (If no input is 5668available, it blocks until some input is available.) I've used this technique in an 5669interpreter I wrote that both reads keyboard input using a @code{flex} scanner and 5670IPC traffic from sockets, and it works fine. 5671 5672@node Can I build nested parsers that work with the same input file? 5673@unnumberedsec Can I build nested parsers that work with the same input file? 5674 5675This is not going to work without some additional effort. The reason is 5676that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the 5677``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K 5678of input available on yyin, and subsequent calls to other @code{yylex()}'s won't 5679see that input. You might be tempted to work around this problem by 5680redefining @code{YY_INPUT} to only return a small amount of text, but it turns out 5681that that approach is quite difficult. Instead, the best solution is to 5682combine all of your scanners into one large scanner, using a different 5683exclusive start condition for each. 5684 5685@node How can I match text only at the end of a file? 5686@unnumberedsec How can I match text only at the end of a file? 5687 5688There is no way to write a rule which is ``match this text, but only if 5689it comes at the end of the file''. You can fake it, though, if you happen 5690to have a character lying around that you don't allow in your input. 5691Then you redefine @code{YY_INPUT} to call your own routine which, if it sees 5692an @samp{EOF}, returns the magic character first (and remembers to return a 5693real @code{EOF} next time it's called). Then you could write: 5694 5695@example 5696@verbatim 5697<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ 5698@end verbatim 5699@end example 5700 5701@node How can I make REJECT cascade across start condition boundaries? 5702@unnumberedsec How can I make REJECT cascade across start condition boundaries? 5703 5704You can do this as follows. Suppose you have a start condition @samp{A}, and 5705after exhausting all of the possible matches in @samp{<A>}, you want to try 5706matches in @samp{<INITIAL>}. Then you could use the following: 5707 5708@example 5709@verbatim 5710%x A 5711%% 5712<A>rule_that_is_long ...; REJECT; 5713<A>rule ...; REJECT; /* shorter rule */ 5714<A>etc. 5715... 5716<A>.|\n { 5717/* Shortest and last rule in <A>, so 5718* cascaded REJECTs will eventually 5719* wind up matching this rule. We want 5720* to now switch to the initial state 5721* and try matching from there instead. 5722*/ 5723yyless(0); /* put back matched text */ 5724BEGIN(INITIAL); 5725} 5726@end verbatim 5727@end example 5728 5729@node Why cant I use fast or full tables with interactive mode? 5730@unnumberedsec Why can't I use fast or full tables with interactive mode? 5731 5732One of the assumptions 5733flex makes is that interactive applications are inherently slow (they're 5734waiting on a human after all). 5735It has to do with how the scanner detects that it must be finished scanning 5736a token. For interactive scanners, after scanning each character the current 5737state is looked up in a table (essentially) to see whether there's a chance 5738of another input character possibly extending the length of the match. If 5739not, the scanner halts. For non-interactive scanners, the end-of-token test 5740is much simpler, basically a compare with 0, so no memory bus cycles. Since 5741the test occurs in the innermost scanning loop, one would like to make it go 5742as fast as possible. 5743 5744Still, it seems reasonable to allow the user to choose to trade off a bit 5745of performance in this area to gain the corresponding flexibility. There 5746might be another reason, though, why fast scanners don't support the 5747interactive option. 5748 5749@node How much faster is -F or -f than -C? 5750@unnumberedsec How much faster is -F or -f than -C? 5751 5752Much faster (factor of 2-3). 5753 5754@node If I have a simple grammar cant I just parse it with flex? 5755@unnumberedsec If I have a simple grammar can't I just parse it with flex? 5756 5757Is your grammar recursive? That's almost always a sign that you're 5758better off using a parser/scanner rather than just trying to use a scanner 5759alone. 5760 5761@node Why doesn't yyrestart() set the start state back to INITIAL? 5762@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? 5763 5764There are two reasons. The first is that there might 5765be programs that rely on the start state not changing across file changes. 5766The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required, 5767so fixing the problem there doesn't solve the more general problem. 5768 5769@node How can I match C-style comments? 5770@unnumberedsec How can I match C-style comments? 5771 5772You might be tempted to try something like this: 5773 5774@example 5775@verbatim 5776"/*".*"*/" // WRONG! 5777@end verbatim 5778@end example 5779 5780or, worse, this: 5781 5782@example 5783@verbatim 5784"/*"(.|\n)"*/" // WRONG! 5785@end verbatim 5786@end example 5787 5788The above rules will eat too much input, and blow up on things like: 5789 5790@example 5791@verbatim 5792/* a comment */ do_my_thing( "oops */" ); 5793@end verbatim 5794@end example 5795 5796Here is one way which allows you to track line information: 5797 5798@example 5799@verbatim 5800<INITIAL>{ 5801"/*" BEGIN(IN_COMMENT); 5802} 5803<IN_COMMENT>{ 5804"*/" BEGIN(INITIAL); 5805[^*\n]+ // eat comment in chunks 5806"*" // eat the lone star 5807\n yylineno++; 5808} 5809@end verbatim 5810@end example 5811 5812@node The period isn't working the way I expected. 5813@unnumberedsec The '.' isn't working the way I expected. 5814 5815Here are some tips for using @samp{.}: 5816 5817@itemize 5818@item 5819A common mistake is to place the grouping parenthesis AFTER an operator, when 5820you really meant to place the parenthesis BEFORE the operator, e.g., you 5821probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. 5822 5823The first pattern matches the words @samp{foo} or @samp{bar} any number of 5824times, e.g., it matches the text @samp{barfoofoobarfoo}. The 5825second pattern matches a single instance of @code{foo} or a single instance of 5826@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . 5827@item 5828A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), 5829and NOT ``any character except newline''. 5830@item 5831Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). 5832If you really want to match ANY character, including newlines, then use @code{(.|\n)} 5833Beware that the regex @code{(.|\n)+} will match your entire input! 5834@item 5835Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} 5836@end itemize 5837 5838@node Can I get the flex manual in another format? 5839@unnumberedsec Can I get the flex manual in another format? 5840 5841The @code{flex} source distribution includes a texinfo manual. You are 5842free to convert that texinfo into whatever format you desire. The 5843@code{texinfo} package includes tools for conversion to a number of formats. 5844 5845@node Does there exist a "faster" NDFA->DFA algorithm? 5846@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? 5847 5848There's no way around the potential exponential running time - it 5849can take you exponential time just to enumerate all of the DFA states. 5850In practice, though, the running time is closer to linear, or sometimes 5851quadratic. 5852 5853@node How does flex compile the DFA so quickly? 5854@unnumberedsec How does flex compile the DFA so quickly? 5855 5856There are two big speed wins that @code{flex} uses: 5857 5858@enumerate 5859@item 5860It analyzes the input rules to construct equivalence classes for those 5861characters that always make the same transitions. It then rewrites the NFA 5862using equivalence classes for transitions instead of characters. This cuts 5863down the NFA->DFA computation time dramatically, to the point where, for 5864uncompressed DFA tables, the DFA generation is often I/O bound in writing out 5865the tables. 5866@item 5867It maintains hash values for previously computed DFA states, so testing 5868whether a newly constructed DFA state is equivalent to a previously constructed 5869state can be done very quickly, by first comparing hash values. 5870@end enumerate 5871 5872@node How can I use more than 8192 rules? 5873@unnumberedsec How can I use more than 8192 rules? 5874 5875@code{Flex} is compiled with an upper limit of 8192 rules per scanner. 5876If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex} 5877with the following changes in @file{flexdef.h}: 5878 5879@example 5880@verbatim 5881< #define YY_TRAILING_MASK 0x2000 5882< #define YY_TRAILING_HEAD_MASK 0x4000 5883-- 5884> #define YY_TRAILING_MASK 0x20000000 5885> #define YY_TRAILING_HEAD_MASK 0x40000000 5886@end verbatim 5887@end example 5888 5889This should work okay as long as your C compiler uses 32 bit integers. 5890But you might want to think about whether using such a huge number of rules 5891is the best way to solve your problem. 5892 5893The following may also be relevant: 5894 5895With luck, you should be able to increase the definitions in flexdef.h for: 5896 5897@example 5898@verbatim 5899#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 5900#define MAXIMUM_MNS 31999 5901#define BAD_SUBSCRIPT -32767 5902@end verbatim 5903@end example 5904 5905recompile everything, and it'll all work. Flex only has these 16-bit-like 5906values built into it because a long time ago it was developed on a machine 5907with 16-bit ints. I've given this advice to others in the past but haven't 5908heard back from them whether it worked okay or not... 5909 5910@node How do I abandon a file in the middle of a scan and switch to a new file? 5911@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? 5912 5913Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a 5914``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}. 5915 5916@node How do I execute code only during initialization (only before the first scan)? 5917@unnumberedsec How do I execute code only during initialization (only before the first scan)? 5918 5919You can specify an initial action by defining the macro @code{YY_USER_INIT} (though 5920note that @code{yyout} may not be available at the time this macro is executed). Or you 5921can add to the beginning of your rules section: 5922 5923@example 5924@verbatim 5925%% 5926 /* Must be indented! */ 5927 static int did_init = 0; 5928 5929 if ( ! did_init ){ 5930do_my_init(); 5931 did_init = 1; 5932 } 5933@end verbatim 5934@end example 5935 5936@node How do I execute code at termination? 5937@unnumberedsec How do I execute code at termination? 5938 5939You can specify an action for the @code{<<EOF>>} rule. 5940 5941@node Where else can I find help? 5942@unnumberedsec Where else can I find help? 5943 5944You can find the flex homepage on the web at 5945@uref{http://flex.sourceforge.net/}. See that page for details about flex 5946mailing lists as well. 5947 5948@node Can I include comments in the "rules" section of the file? 5949@unnumberedsec Can I include comments in the "rules" section of the file? 5950 5951Yes, just about anywhere you want to. See the manual for the specific syntax. 5952 5953@node I get an error about undefined yywrap(). 5954@unnumberedsec I get an error about undefined yywrap(). 5955 5956You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a} 5957(which provides one), or use 5958 5959@example 5960@verbatim 5961%option noyywrap 5962@end verbatim 5963@end example 5964 5965in your source to say you don't want a @code{yywrap()} function. 5966 5967@node How can I change the matching pattern at run time? 5968@unnumberedsec How can I change the matching pattern at run time? 5969 5970You can't, it's compiled into a static table when flex builds the scanner. 5971 5972@node How can I expand macros in the input? 5973@unnumberedsec How can I expand macros in the input? 5974 5975The best way to approach this problem is at a higher level, e.g., in the parser. 5976 5977However, you can do this using multiple input buffers. 5978 5979@example 5980@verbatim 5981%% 5982macro/[a-z]+ { 5983/* Saw the macro "macro" followed by extra stuff. */ 5984main_buffer = YY_CURRENT_BUFFER; 5985expansion_buffer = yy_scan_string(expand(yytext)); 5986yy_switch_to_buffer(expansion_buffer); 5987} 5988 5989<<EOF>> { 5990if ( expansion_buffer ) 5991{ 5992// We were doing an expansion, return to where 5993// we were. 5994yy_switch_to_buffer(main_buffer); 5995yy_delete_buffer(expansion_buffer); 5996expansion_buffer = 0; 5997} 5998else 5999yyterminate(); 6000} 6001@end verbatim 6002@end example 6003 6004You probably will want a stack of expansion buffers to allow nested macros. 6005From the above though hopefully the idea is clear. 6006 6007@node How can I build a two-pass scanner? 6008@unnumberedsec How can I build a two-pass scanner? 6009 6010One way to do it is to filter the first pass to a temporary file, 6011then process the temporary file on the second pass. You will probably see a 6012performance hit, due to all the disk I/O. 6013 6014When you need to look ahead far forward like this, it almost always means 6015that the right solution is to build a parse tree of the entire input, then 6016walk it after the parse in order to generate the output. In a sense, this 6017is a two-pass approach, once through the text and once through the parse 6018tree, but the performance hit for the latter is usually an order of magnitude 6019smaller, since everything is already classified, in binary format, and 6020residing in memory. 6021 6022@node How do I match any string not matched in the preceding rules? 6023@unnumberedsec How do I match any string not matched in the preceding rules? 6024 6025One way to assign precedence, is to place the more specific rules first. If 6026two rules would match the same input (same sequence of characters) then the 6027first rule listed in the @code{flex} input wins, e.g., 6028 6029@example 6030@verbatim 6031%% 6032foo[a-zA-Z_]+ return FOO_ID; 6033bar[a-zA-Z_]+ return BAR_ID; 6034[a-zA-Z_]+ return GENERIC_ID; 6035@end verbatim 6036@end example 6037 6038Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the 6039same amount of text as the more specific rules, and in that case the 6040@code{flex} scanner will pick the first rule listed in your scanner as the 6041one to match. 6042 6043@node I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6044@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. 6045 6046Those are internal variables pointing into the AT&T scanner's input buffer. I 6047imagine they're being manipulated in user versions of the @code{input()} and @code{unput()} 6048functions. If so, what you need to do is analyze those functions to figure out 6049what they're doing, and then replace @code{input()} with an appropriate definition of 6050@code{YY_INPUT}. You shouldn't need to (and must not) replace 6051@code{flex}'s @code{unput()} function. 6052 6053@node Is there a way to make flex treat NULL like a regular character? 6054@unnumberedsec Is there a way to make flex treat NULL like a regular character? 6055 6056Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient 6057version of @code{flex}. The latest release is version @value{VERSION}. 6058 6059@node Whenever flex can not match the input it says "flex scanner jammed". 6060@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". 6061 6062You need to add a rule that matches the otherwise-unmatched text, 6063e.g., 6064 6065@example 6066@verbatim 6067%option yylineno 6068%% 6069[[a bunch of rules here]] 6070 6071. printf("bad input character '%s' at line %d\n", yytext, yylineno); 6072@end verbatim 6073@end example 6074 6075See @code{%option default} for more information. 6076 6077@node Why doesn't flex have non-greedy operators like perl does? 6078@unnumberedsec Why doesn't flex have non-greedy operators like perl does? 6079 6080A DFA can do a non-greedy match by stopping 6081the first time it enters an accepting state, instead of consuming input until 6082it determines that no further matching is possible (a ``jam'' state). This 6083is actually easier to implement than longest leftmost match (which flex does). 6084 6085But it's also much less useful than longest leftmost match. In general, 6086when you find yourself wishing for non-greedy matching, that's usually a 6087sign that you're trying to make the scanner do some parsing. That's 6088generally the wrong approach, since it lacks the power to do a decent job. 6089Better is to either introduce a separate parser, or to split the scanner 6090into multiple scanners using (exclusive) start conditions. 6091 6092You might have 6093a separate start state once you've seen the @samp{BEGIN}. In that state, you 6094might then have a regex that will match @samp{END} (to kick you out of the 6095state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... 6096 6097This approach also has much better error-reporting properties. 6098 6099@node Memory leak - 16386 bytes allocated by malloc. 6100@unnumberedsec Memory leak - 16386 bytes allocated by malloc. 6101@anchor{faq-memory-leak} 6102 6103UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not 6104call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read 6105on. 6106 6107The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and 6108about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in 6109the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ 6110scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed. 6111 6112However, the leak won't multiply since the buffer is reused no matter how many 6113times you call @code{yylex()}. 6114 6115If you want to reclaim the memory when you are completely done scanning, then 6116you might try this: 6117 6118@example 6119@verbatim 6120/* For non-reentrant C scanner only. */ 6121yy_delete_buffer(YY_CURRENT_BUFFER); 6122yy_init = 1; 6123@end verbatim 6124@end example 6125 6126Note: @code{yy_init} is an "internal variable", and hasn't been tested in this 6127situation. It is possible that some other globals may need resetting as well. 6128 6129@node How do I track the byte offset for lseek()? 6130@unnumberedsec How do I track the byte offset for lseek()? 6131 6132@example 6133@verbatim 6134> We thought that it would be possible to have this number through the 6135> evaluation of the following expression: 6136> 6137> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf 6138@end verbatim 6139@end example 6140 6141While this is the right idea, it has two problems. The first is that 6142it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during 6143an invocation of @code{YY_INPUT} (or that your input source will return less 6144even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem 6145is that when refilling its internal buffer, @code{flex} keeps some characters 6146from the previous buffer (because usually it's in the middle of a match, 6147and needs those characters to construct @code{yytext} for the match once it's 6148done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't 6149be exactly the number of characters already read from the current buffer. 6150 6151An alternative solution is to count the number of characters you've matched 6152since starting to scan. This can be done by using @code{YY_USER_ACTION}. For 6153example, 6154 6155@example 6156@verbatim 6157#define YY_USER_ACTION num_chars += yyleng; 6158@end verbatim 6159@end example 6160 6161(You need to be careful to update your bookkeeping if you use @code{yymore(}), 6162@code{yyless()}, @code{unput()}, or @code{input()}.) 6163 6164@node How do I use my own I/O classes in a C++ scanner? 6165@section How do I use my own I/O classes in a C++ scanner? 6166 6167When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. 6168 6169@cindex LexerOutput, overriding 6170@cindex LexerInput, overriding 6171@cindex overriding LexerOutput 6172@cindex overriding LexerInput 6173@cindex customizing I/O in C++ scanners 6174@cindex C++ I/O, customizing 6175You can do this by passing the various functions (such as @code{LexerInput()} 6176and @code{LexerOutput()}) NULL @code{iostream*}'s, and then 6177dealing with your own I/O classes surreptitiously (i.e., stashing them in 6178special member variables). This works because the only assumption about 6179the lexer regarding what's done with the iostream's is that they're 6180ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever 6181is necessary with them. 6182 6183@c faq edit stopped here 6184@node How do I skip as many chars as possible? 6185@unnumberedsec How do I skip as many chars as possible? 6186 6187How do I skip as many chars as possible -- without interfering with the other 6188patterns? 6189 6190In the example below, we want to skip over characters until we see the phrase 6191"endskip". The following will @emph{NOT} work correctly (do you see why not?) 6192 6193@example 6194@verbatim 6195/* INCORRECT SCANNER */ 6196%x SKIP 6197%% 6198<INITIAL>startskip BEGIN(SKIP); 6199... 6200<SKIP>"endskip" BEGIN(INITIAL); 6201<SKIP>.* ; 6202@end verbatim 6203@end example 6204 6205The problem is that the pattern .* will eat up the word "endskip." 6206The simplest (but slow) fix is: 6207 6208@example 6209@verbatim 6210<SKIP>"endskip" BEGIN(INITIAL); 6211<SKIP>. ; 6212@end verbatim 6213@end example 6214 6215The fix involves making the second rule match more, without 6216making it match "endskip" plus something else. So for example: 6217 6218@example 6219@verbatim 6220<SKIP>"endskip" BEGIN(INITIAL); 6221<SKIP>[^e]+ ; 6222<SKIP>. ;/* so you eat up e's, too */ 6223@end verbatim 6224@end example 6225 6226@c TODO: Evaluate this faq. 6227@node deleteme00 6228@unnumberedsec deleteme00 6229@example 6230@verbatim 6231QUESTION: 6232When was flex born? 6233 6234Vern Paxson took over 6235the Software Tools lex project from Jef Poskanzer in 1982. At that point it 6236was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 6237a legend was born :-). 6238@end verbatim 6239@end example 6240 6241@c TODO: Evaluate this faq. 6242@node Are certain equivalent patterns faster than others? 6243@unnumberedsec Are certain equivalent patterns faster than others? 6244@example 6245@verbatim 6246To: Adoram Rogel <adoram@orna.hybridge.com> 6247Subject: Re: Flex 2.5.2 performance questions 6248In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. 6249Date: Wed, 18 Sep 96 10:51:02 PDT 6250From: Vern Paxson <vern> 6251 6252[Note, the most recent flex release is 2.5.4, which you can get from 6253ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] 6254 6255> 1. Using the pattern 6256> ([Ff](oot)?)?[Nn](ote)?(\.)? 6257> instead of 6258> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) 6259> (in a very complicated flex program) caused the program to slow from 6260> 300K+/min to 100K/min (no other changes were done). 6261 6262These two are not equivalent. For example, the first can match "footnote." 6263but the second can only match "footnote". This is almost certainly the 6264cause in the discrepancy - the slower scanner run is matching more tokens, 6265and/or having to do more backing up. 6266 6267> 2. Which of these two are better: [Ff]oot or (F|f)oot ? 6268 6269From a performance point of view, they're equivalent (modulo presumably 6270minor effects such as memory cache hit rates; and the presence of trailing 6271context, see below). From a space point of view, the first is slightly 6272preferable. 6273 6274> 3. I have a pattern that look like this: 6275> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) 6276> 6277> running yet another complicated program that includes the following rule: 6278> <snext>{and}/{no4}{bb}{pats} 6279> 6280> gets me to "too complicated - over 32,000 states"... 6281 6282I can't tell from this example whether the trailing context is variable-length 6283or fixed-length (it could be the latter if {and} is fixed-length). If it's 6284variable length, which flex -p will tell you, then this reflects a basic 6285performance problem, and if you can eliminate it by restructuring your 6286scanner, you will see significant improvement. 6287 6288> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about 6289> 10 patterns and changed the rule to be 5 rules. 6290> This did compile, but what is the rule of thumb here ? 6291 6292The rule is to avoid trailing context other than fixed-length, in which for 6293a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use 6294of the '|' operator automatically makes the pattern variable length, so in 6295this case '[Ff]oot' is preferred to '(F|f)oot'. 6296 6297> 4. I changed a rule that looked like this: 6298> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... 6299> 6300> to the next 2 rules: 6301> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} 6302> <snext8>{and}{bb}/{ROMAN} { BEGIN... 6303> 6304> Again, I understand the using [^...] will cause a great performance loss 6305 6306Actually, it doesn't cause any sort of performance loss. It's a surprising 6307fact about regular expressions that they always match in linear time 6308regardless of how complex they are. 6309 6310> but are there any specific rules about it ? 6311 6312See the "Performance Considerations" section of the man page, and also 6313the example in MISC/fastwc/. 6314 6315 Vern 6316@end verbatim 6317@end example 6318 6319@c TODO: Evaluate this faq. 6320@node Is backing up a big deal? 6321@unnumberedsec Is backing up a big deal? 6322@example 6323@verbatim 6324To: Adoram Rogel <adoram@hybridge.com> 6325Subject: Re: Flex 2.5.2 performance questions 6326In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. 6327Date: Thu, 19 Sep 96 09:58:00 PDT 6328From: Vern Paxson <vern> 6329 6330> a lot about the backing up problem. 6331> I believe that there lies my biggest problem, and I'll try to improve 6332> it. 6333 6334Since you have variable trailing context, this is a bigger performance 6335problem. Fixing it is usually easier than fixing backing up, which in a 6336complicated scanner (yours seems to fit the bill) can be extremely 6337difficult to do correctly. 6338 6339You also don't mention what flags you are using for your scanner. 6340-f makes a large speed difference, and -Cfe buys you nearly as much 6341speed but the resulting scanner is considerably smaller. 6342 6343> I have an | operator in {and} and in {pats} so both of them are variable 6344> length. 6345 6346-p should have reported this. 6347 6348> Is changing one of them to fixed-length is enough ? 6349 6350Yes. 6351 6352> Is it possible to change the 32,000 states limit ? 6353 6354Yes. I've appended instructions on how. Before you make this change, 6355though, you should think about whether there are ways to fundamentally 6356simplify your scanner - those are certainly preferable! 6357 6358 Vern 6359 6360To increase the 32K limit (on a machine with 32 bit integers), you increase 6361the magnitude of the following in flexdef.h: 6362 6363#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6364#define MAXIMUM_MNS 31999 6365#define BAD_SUBSCRIPT -32767 6366#define MAX_SHORT 32700 6367 6368Adding a 0 or two after each should do the trick. 6369@end verbatim 6370@end example 6371 6372@c TODO: Evaluate this faq. 6373@node Can I fake multi-byte character support? 6374@unnumberedsec Can I fake multi-byte character support? 6375@example 6376@verbatim 6377To: Heeman_Lee@hp.com 6378Subject: Re: flex - multi-byte support? 6379In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. 6380Date: Fri, 04 Oct 1996 11:42:18 PDT 6381From: Vern Paxson <vern> 6382 6383> I assume as long as my *.l file defines the 6384> range of expected character code values (in octal format), flex will 6385> scan the file and read multi-byte characters correctly. But I have no 6386> confidence in this assumption. 6387 6388Your lack of confidence is justified - this won't work. 6389 6390Flex has in it a widespread assumption that the input is processed 6391one byte at a time. Fixing this is on the to-do list, but is involved, 6392so it won't happen any time soon. In the interim, the best I can suggest 6393(unless you want to try fixing it yourself) is to write your rules in 6394terms of pairs of bytes, using definitions in the first section: 6395 6396 X \xfe\xc2 6397 ... 6398 %% 6399 foo{X}bar found_foo_fe_c2_bar(); 6400 6401etc. Definitely a pain - sorry about that. 6402 6403By the way, the email address you used for me is ancient, indicating you 6404have a very old version of flex. You can get the most recent, 2.5.4, from 6405ftp.ee.lbl.gov. 6406 6407 Vern 6408@end verbatim 6409@end example 6410 6411@c TODO: Evaluate this faq. 6412@node deleteme01 6413@unnumberedsec deleteme01 6414@example 6415@verbatim 6416To: moleary@primus.com 6417Subject: Re: Flex / Unicode compatibility question 6418In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. 6419Date: Tue, 22 Oct 1996 11:06:13 PDT 6420From: Vern Paxson <vern> 6421 6422Unfortunately flex at the moment has a widespread assumption within it 6423that characters are processed 8 bits at a time. I don't see any easy 6424fix for this (other than writing your rules in terms of double characters - 6425a pain). I also don't know of a wider lex, though you might try surfing 6426the Plan 9 stuff because I know it's a Unicode system, and also the PCCT 6427toolkit (try searching say Alta Vista for "Purdue Compiler Construction 6428Toolkit"). 6429 6430Fixing flex to handle wider characters is on the long-term to-do list. 6431But since flex is a strictly spare-time project these days, this probably 6432won't happen for quite a while, unless someone else does it first. 6433 6434 Vern 6435@end verbatim 6436@end example 6437 6438@c TODO: Evaluate this faq. 6439@node Can you discuss some flex internals? 6440@unnumberedsec Can you discuss some flex internals? 6441@example 6442@verbatim 6443To: Johan Linde <jl@theophys.kth.se> 6444Subject: Re: translation of flex 6445In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. 6446Date: Mon, 11 Nov 1996 10:33:50 PST 6447From: Vern Paxson <vern> 6448 6449> I'm working for the Swedish team translating GNU program, and I'm currently 6450> working with flex. I have a few questions about some of the messages which 6451> I hope you can answer. 6452 6453All of the things you're wondering about, by the way, concerning flex 6454internals - probably the only person who understands what they mean in 6455English is me! So I wouldn't worry too much about getting them right. 6456That said ... 6457 6458> #: main.c:545 6459> msgid " %d protos created\n" 6460> 6461> Does proto mean prototype? 6462 6463Yes - prototypes of state compression tables. 6464 6465> #: main.c:539 6466> msgid " %d/%d (peak %d) template nxt-chk entries created\n" 6467> 6468> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) 6469> However, 'template next-check entries' doesn't make much sense to me. To be 6470> able to find a good translation I need to know a little bit more about it. 6471 6472There is a scheme in the Aho/Sethi/Ullman compiler book for compressing 6473scanner tables. It involves creating two pairs of tables. The first has 6474"base" and "default" entries, the second has "next" and "check" entries. 6475The "base" entry is indexed by the current state and yields an index into 6476the next/check table. The "default" entry gives what to do if the state 6477transition isn't found in next/check. The "next" entry gives the next 6478state to enter, but only if the "check" entry verifies that this entry is 6479correct for the current state. Flex creates templates of series of 6480next/check entries and then encodes differences from these templates as a 6481way to compress the tables. 6482 6483> #: main.c:533 6484> msgid " %d/%d base-def entries created\n" 6485> 6486> The same problem here for 'base-def'. 6487 6488See above. 6489 6490 Vern 6491@end verbatim 6492@end example 6493 6494@c TODO: Evaluate this faq. 6495@node unput() messes up yy_at_bol 6496@unnumberedsec unput() messes up yy_at_bol 6497@example 6498@verbatim 6499To: Xinying Li <xli@npac.syr.edu> 6500Subject: Re: FLEX ? 6501In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. 6502Date: Wed, 13 Nov 1996 19:51:54 PST 6503From: Vern Paxson <vern> 6504 6505> "unput()" them to input flow, question occurs. If I do this after I scan 6506> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That 6507> means the carriage flag has gone. 6508 6509You can control this by calling yy_set_bol(). It's described in the manual. 6510 6511> And if in pre-reading it goes to the end of file, is anything done 6512> to control the end of curren buffer and end of file? 6513 6514No, there's no way to put back an end-of-file. 6515 6516> By the way I am using flex 2.5.2 and using the "-l". 6517 6518The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and 65192.5.3. You can get it from ftp.ee.lbl.gov. 6520 6521 Vern 6522@end verbatim 6523@end example 6524 6525@c TODO: Evaluate this faq. 6526@node The | operator is not doing what I want 6527@unnumberedsec The | operator is not doing what I want 6528@example 6529@verbatim 6530To: Alain.ISSARD@st.com 6531Subject: Re: Start condition with FLEX 6532In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. 6533Date: Mon, 18 Nov 1996 10:41:34 PST 6534From: Vern Paxson <vern> 6535 6536> I am not able to use the start condition scope and to use the | (OR) with 6537> rules having start conditions. 6538 6539The problem is that if you use '|' as a regular expression operator, for 6540example "a|b" meaning "match either 'a' or 'b'", then it must *not* have 6541any blanks around it. If you instead want the special '|' *action* (which 6542from your scanner appears to be the case), which is a way of giving two 6543different rules the same action: 6544 6545 foo | 6546 bar matched_foo_or_bar(); 6547 6548then '|' *must* be separated from the first rule by whitespace and *must* 6549be followed by a new line. You *cannot* write it as: 6550 6551 foo | bar matched_foo_or_bar(); 6552 6553even though you might think you could because yacc supports this syntax. 6554The reason for this unfortunately incompatibility is historical, but it's 6555unlikely to be changed. 6556 6557Your problems with start condition scope are simply due to syntax errors 6558from your use of '|' later confusing flex. 6559 6560Let me know if you still have problems. 6561 6562 Vern 6563@end verbatim 6564@end example 6565 6566@c TODO: Evaluate this faq. 6567@node Why can't flex understand this variable trailing context pattern? 6568@unnumberedsec Why can't flex understand this variable trailing context pattern? 6569@example 6570@verbatim 6571To: Gregory Margo <gmargo@newton.vip.best.com> 6572Subject: Re: flex-2.5.3 bug report 6573In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. 6574Date: Sat, 23 Nov 1996 17:07:32 PST 6575From: Vern Paxson <vern> 6576 6577> Enclosed is a lex file that "real" lex will process, but I cannot get 6578> flex to process it. Could you try it and maybe point me in the right direction? 6579 6580Your problem is that some of the definitions in the scanner use the '/' 6581trailing context operator, and have it enclosed in ()'s. Flex does not 6582allow this operator to be enclosed in ()'s because doing so allows undefined 6583regular expressions such as "(a/b)+". So the solution is to remove the 6584parentheses. Note that you must also be building the scanner with the -l 6585option for AT&T lex compatibility. Without this option, flex automatically 6586encloses the definitions in parentheses. 6587 6588 Vern 6589@end verbatim 6590@end example 6591 6592@c TODO: Evaluate this faq. 6593@node The ^ operator isn't working 6594@unnumberedsec The ^ operator isn't working 6595@example 6596@verbatim 6597To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> 6598Subject: Re: Flex Bug ? 6599In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. 6600Date: Tue, 26 Nov 1996 11:15:05 PST 6601From: Vern Paxson <vern> 6602 6603> In my lexer code, i have the line : 6604> ^\*.* { } 6605> 6606> Thus all lines starting with an astrix (*) are comment lines. 6607> This does not work ! 6608 6609I can't get this problem to reproduce - it works fine for me. Note 6610though that if what you have is slightly different: 6611 6612 COMMENT ^\*.* 6613 %% 6614 {COMMENT} { } 6615 6616then it won't work, because flex pushes back macro definitions enclosed 6617in ()'s, so the rule becomes 6618 6619 (^\*.*) { } 6620 6621and now that the '^' operator is not at the immediate beginning of the 6622line, it's interpreted as just a regular character. You can avoid this 6623behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". 6624 6625 Vern 6626@end verbatim 6627@end example 6628 6629@c TODO: Evaluate this faq. 6630@node Trailing context is getting confused with trailing optional patterns 6631@unnumberedsec Trailing context is getting confused with trailing optional patterns 6632@example 6633@verbatim 6634To: Adoram Rogel <adoram@hybridge.com> 6635Subject: Re: Flex 2.5.4 BOF ??? 6636In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. 6637Date: Wed, 27 Nov 1996 10:56:25 PST 6638From: Vern Paxson <vern> 6639 6640> Organization(s)?/[a-z] 6641> 6642> This matched "Organizations" (looking in debug mode, the trailing s 6643> was matched with trailing context instead of the optional (s) in the 6644> end of the word. 6645 6646That should only happen with lex. Flex can properly match this pattern. 6647(That might be what you're saying, I'm just not sure.) 6648 6649> Is there a way to avoid this dangerous trailing context problem ? 6650 6651Unfortunately, there's no easy way. On the other hand, I don't see why 6652it should be a problem. Lex's matching is clearly wrong, and I'd hope 6653that usually the intent remains the same as expressed with the pattern, 6654so flex's matching will be correct. 6655 6656 Vern 6657@end verbatim 6658@end example 6659 6660@c TODO: Evaluate this faq. 6661@node Is flex GNU or not? 6662@unnumberedsec Is flex GNU or not? 6663@example 6664@verbatim 6665To: Cameron MacKinnon <mackin@interlog.com> 6666Subject: Re: Flex documentation bug 6667In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. 6668Date: Sun, 01 Dec 1996 22:29:39 PST 6669From: Vern Paxson <vern> 6670 6671> I'm not sure how or where to submit bug reports (documentation or 6672> otherwise) for the GNU project stuff ... 6673 6674Well, strictly speaking flex isn't part of the GNU project. They just 6675distribute it because no one's written a decent GPL'd lex replacement. 6676So you should send bugs directly to me. Those sent to the GNU folks 6677sometimes find there way to me, but some may drop between the cracks. 6678 6679> In GNU Info, under the section 'Start Conditions', and also in the man 6680> page (mine's dated April '95) is a nice little snippet showing how to 6681> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in 6682> size. Unfortunately, no overflow checking is ever done ... 6683 6684This is already mentioned in the manual: 6685 6686Finally, here's an example of how to match C-style quoted 6687strings using exclusive start conditions, including expanded 6688escape sequences (but not including checking for a string 6689that's too long): 6690 6691The reason for not doing the overflow checking is that it will needlessly 6692clutter up an example whose main purpose is just to demonstrate how to 6693use flex. 6694 6695The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. 6696 6697 Vern 6698@end verbatim 6699@end example 6700 6701@c TODO: Evaluate this faq. 6702@node ERASEME53 6703@unnumberedsec ERASEME53 6704@example 6705@verbatim 6706To: tsv@cs.UManitoba.CA 6707Subject: Re: Flex (reg).. 6708In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. 6709Date: Thu, 06 Mar 1997 15:54:19 PST 6710From: Vern Paxson <vern> 6711 6712> [:alpha:] ([:alnum:] | \\_)* 6713 6714If your rule really has embedded blanks as shown above, then it won't 6715work, as the first blank delimits the rule from the action. (It wouldn't 6716even compile ...) You need instead: 6717 6718[:alpha:]([:alnum:]|\\_)* 6719 6720and that should work fine - there's no restriction on what can go inside 6721of ()'s except for the trailing context operator, '/'. 6722 6723 Vern 6724@end verbatim 6725@end example 6726 6727@c TODO: Evaluate this faq. 6728@node I need to scan if-then-else blocks and while loops 6729@unnumberedsec I need to scan if-then-else blocks and while loops 6730@example 6731@verbatim 6732To: "Mike Stolnicki" <mstolnic@ford.com> 6733Subject: Re: FLEX help 6734In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. 6735Date: Fri, 30 May 1997 10:46:35 PDT 6736From: Vern Paxson <vern> 6737 6738> We'd like to add "if-then-else", "while", and "for" statements to our 6739> language ... 6740> We've investigated many possible solutions. The one solution that seems 6741> the most reasonable involves knowing the position of a TOKEN in yyin. 6742 6743I strongly advise you to instead build a parse tree (abstract syntax tree) 6744and loop over that instead. You'll find this has major benefits in keeping 6745your interpreter simple and extensible. 6746 6747That said, the functionality you mention for get_position and set_position 6748have been on the to-do list for a while. As flex is a purely spare-time 6749project for me, no guarantees when this will be added (in particular, it 6750for sure won't be for many months to come). 6751 6752 Vern 6753@end verbatim 6754@end example 6755 6756@c TODO: Evaluate this faq. 6757@node ERASEME55 6758@unnumberedsec ERASEME55 6759@example 6760@verbatim 6761To: Colin Paul Adams <colin@colina.demon.co.uk> 6762Subject: Re: Flex C++ classes and Bison 6763In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. 6764Date: Fri, 15 Aug 1997 10:48:19 PDT 6765From: Vern Paxson <vern> 6766 6767> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control 6768> *parm) 6769> 6770> I have been trying to get this to work as a C++ scanner, but it does 6771> not appear to be possible (warning that it matches no declarations in 6772> yyFlexLexer, or something like that). 6773> 6774> Is this supposed to be possible, or is it being worked on (I DID 6775> notice the comment that scanner classes are still experimental, so I'm 6776> not too hopeful)? 6777 6778What you need to do is derive a subclass from yyFlexLexer that provides 6779the above yylex() method, squirrels away lvalp and parm into member 6780variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. 6781 6782 Vern 6783@end verbatim 6784@end example 6785 6786@c TODO: Evaluate this faq. 6787@node ERASEME56 6788@unnumberedsec ERASEME56 6789@example 6790@verbatim 6791To: Mikael.Latvala@lmf.ericsson.se 6792Subject: Re: Possible mistake in Flex v2.5 document 6793In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. 6794Date: Fri, 05 Sep 1997 10:01:54 PDT 6795From: Vern Paxson <vern> 6796 6797> In that example you show how to count comment lines when using 6798> C style /* ... */ comments. My question is, shouldn't you take into 6799> account a scenario where end of a comment marker occurs inside 6800> character or string literals? 6801 6802The scanner certainly needs to also scan character and string literals. 6803However it does that (there's an example in the man page for strings), the 6804lexer will recognize the beginning of the literal before it runs across the 6805embedded "/*". Consequently, it will finish scanning the literal before it 6806even considers the possibility of matching "/*". 6807 6808Example: 6809 6810 '([^']*|{ESCAPE_SEQUENCE})' 6811 6812will match all the text between the ''s (inclusive). So the lexer 6813considers this as a token beginning at the first ', and doesn't even 6814attempt to match other tokens inside it. 6815 6816I thinnk this subtlety is not worth putting in the manual, as I suspect 6817it would confuse more people than it would enlighten. 6818 6819 Vern 6820@end verbatim 6821@end example 6822 6823@c TODO: Evaluate this faq. 6824@node ERASEME57 6825@unnumberedsec ERASEME57 6826@example 6827@verbatim 6828To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> 6829Subject: Re: flex limitations 6830In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. 6831Date: Mon, 08 Sep 1997 11:38:08 PDT 6832From: Vern Paxson <vern> 6833 6834> %% 6835> [a-zA-Z]+ /* skip a line */ 6836> { printf("got %s\n", yytext); } 6837> %% 6838 6839What version of flex are you using? If I feed this to 2.5.4, it complains: 6840 6841 "bug.l", line 5: EOF encountered inside an action 6842 "bug.l", line 5: unrecognized rule 6843 "bug.l", line 5: fatal parse error 6844 6845Not the world's greatest error message, but it manages to flag the problem. 6846 6847(With the introduction of start condition scopes, flex can't accommodate 6848an action on a separate line, since it's ambiguous with an indented rule.) 6849 6850You can get 2.5.4 from ftp.ee.lbl.gov. 6851 6852 Vern 6853@end verbatim 6854@end example 6855 6856@c TODO: Evaluate this faq. 6857@node Is there a repository for flex scanners? 6858@unnumberedsec Is there a repository for flex scanners? 6859 6860Not that we know of. You might try asking on comp.compilers. 6861 6862@c TODO: Evaluate this faq. 6863@node How can I conditionally compile or preprocess my flex input file? 6864@unnumberedsec How can I conditionally compile or preprocess my flex input file? 6865 6866 6867Flex doesn't have a preprocessor like C does. You might try using m4, or the C 6868preprocessor plus a sed script to clean up the result. 6869 6870 6871@c TODO: Evaluate this faq. 6872@node Where can I find grammars for lex and yacc? 6873@unnumberedsec Where can I find grammars for lex and yacc? 6874 6875In the sources for flex and bison. 6876 6877@c TODO: Evaluate this faq. 6878@node I get an end-of-buffer message for each character scanned. 6879@unnumberedsec I get an end-of-buffer message for each character scanned. 6880 6881This will happen if your LexerInput() function returns only one character 6882at a time, which can happen either if you're scanner is "interactive", or 6883if the streams library on your platform always returns 1 for yyin->gcount(). 6884 6885Solution: override LexerInput() with a version that returns whole buffers. 6886 6887@c TODO: Evaluate this faq. 6888@node unnamed-faq-62 6889@unnumberedsec unnamed-faq-62 6890@example 6891@verbatim 6892To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 6893Subject: Re: Flex maximums 6894In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. 6895Date: Mon, 17 Nov 1997 17:16:15 PST 6896From: Vern Paxson <vern> 6897 6898> I took a quick look into the flex-sources and altered some #defines in 6899> flexdefs.h: 6900> 6901> #define INITIAL_MNS 64000 6902> #define MNS_INCREMENT 1024000 6903> #define MAXIMUM_MNS 64000 6904 6905The things to fix are to add a couple of zeroes to: 6906 6907#define JAMSTATE -32766 /* marks a reference to the state that always jams */ 6908#define MAXIMUM_MNS 31999 6909#define BAD_SUBSCRIPT -32767 6910#define MAX_SHORT 32700 6911 6912and, if you get complaints about too many rules, make the following change too: 6913 6914 #define YY_TRAILING_MASK 0x200000 6915 #define YY_TRAILING_HEAD_MASK 0x400000 6916 6917- Vern 6918@end verbatim 6919@end example 6920 6921@c TODO: Evaluate this faq. 6922@node unnamed-faq-63 6923@unnumberedsec unnamed-faq-63 6924@example 6925@verbatim 6926To: jimmey@lexis-nexis.com (Jimmey Todd) 6927Subject: Re: FLEX question regarding istream vs ifstream 6928In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. 6929Date: Mon, 15 Dec 1997 13:21:35 PST 6930From: Vern Paxson <vern> 6931 6932> stdin_handle = YY_CURRENT_BUFFER; 6933> ifstream fin( "aFile" ); 6934> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); 6935> 6936> What I'm wanting to do, is pass the contents of a file thru one set 6937> of rules and then pass stdin thru another set... It works great if, I 6938> don't use the C++ classes. But since everything else that I'm doing is 6939> in C++, I thought I'd be consistent. 6940> 6941> The problem is that 'yy_create_buffer' is expecting an istream* as it's 6942> first argument (as stated in the man page). However, fin is a ifstream 6943> object. Any ideas on what I might be doing wrong? Any help would be 6944> appreciated. Thanks!! 6945 6946You need to pass &fin, to turn it into an ifstream* instead of an ifstream. 6947Then its type will be compatible with the expected istream*, because ifstream 6948is derived from istream. 6949 6950 Vern 6951@end verbatim 6952@end example 6953 6954@c TODO: Evaluate this faq. 6955@node unnamed-faq-64 6956@unnumberedsec unnamed-faq-64 6957@example 6958@verbatim 6959To: Enda Fadian <fadiane@piercom.ie> 6960Subject: Re: Question related to Flex man page? 6961In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. 6962Date: Tue, 16 Dec 1997 14:17:09 PST 6963From: Vern Paxson <vern> 6964 6965> Can you explain to me what is ment by a long-jump in relation to flex? 6966 6967Using the longjmp() function while inside yylex() or a routine called by it. 6968 6969> what is the flex activation frame. 6970 6971Just yylex()'s stack frame. 6972 6973> As far as I can see yyrestart will bring me back to the sart of the input 6974> file and using flex++ isnot really an option! 6975 6976No, yyrestart() doesn't imply a rewind, even though its name might sound 6977like it does. It tells the scanner to flush its internal buffers and 6978start reading from the given file at its present location. 6979 6980 Vern 6981@end verbatim 6982@end example 6983 6984@c TODO: Evaluate this faq. 6985@node unnamed-faq-65 6986@unnumberedsec unnamed-faq-65 6987@example 6988@verbatim 6989To: hassan@larc.info.uqam.ca (Hassan Alaoui) 6990Subject: Re: Need urgent Help 6991In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. 6992Date: Sun, 21 Dec 1997 21:30:46 PST 6993From: Vern Paxson <vern> 6994 6995> /usr/lib/yaccpar: In function `int yyparse()': 6996> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' 6997> 6998> ld: Undefined symbol 6999> _yylex 7000> _yyparse 7001> _yyin 7002 7003This is a known problem with Solaris C++ (and/or Solaris yacc). I believe 7004the fix is to explicitly insert some 'extern "C"' statements for the 7005corresponding routines/symbols. 7006 7007 Vern 7008@end verbatim 7009@end example 7010 7011@c TODO: Evaluate this faq. 7012@node unnamed-faq-66 7013@unnumberedsec unnamed-faq-66 7014@example 7015@verbatim 7016To: mc0307@mclink.it 7017Cc: gnu@prep.ai.mit.edu 7018Subject: Re: [mc0307@mclink.it: Help request] 7019In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. 7020Date: Sun, 21 Dec 1997 22:33:37 PST 7021From: Vern Paxson <vern> 7022 7023> This is my definition for float and integer types: 7024> . . . 7025> NZD [1-9] 7026> ... 7027> I've tested my program on other lex version (on UNIX Sun Solaris an HP 7028> UNIX) and it work well, so I think that my definitions are correct. 7029> There are any differences between Lex and Flex? 7030 7031There are indeed differences, as discussed in the man page. The one 7032you are probably running into is that when flex expands a name definition, 7033it puts parentheses around the expansion, while lex does not. There's 7034an example in the man page of how this can lead to different matching. 7035Flex's behavior complies with the POSIX standard (or at least with the 7036last POSIX draft I saw). 7037 7038 Vern 7039@end verbatim 7040@end example 7041 7042@c TODO: Evaluate this faq. 7043@node unnamed-faq-67 7044@unnumberedsec unnamed-faq-67 7045@example 7046@verbatim 7047To: hassan@larc.info.uqam.ca (Hassan Alaoui) 7048Subject: Re: Thanks 7049In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. 7050Date: Mon, 22 Dec 1997 14:35:05 PST 7051From: Vern Paxson <vern> 7052 7053> Thank you very much for your help. I compile and link well with C++ while 7054> declaring 'yylex ...' extern, But a little problem remains. I get a 7055> segmentation default when executing ( I linked with lfl library) while it 7056> works well when using LEX instead of flex. Do you have some ideas about the 7057> reason for this ? 7058 7059The one possible reason for this that comes to mind is if you've defined 7060yytext as "extern char yytext[]" (which is what lex uses) instead of 7061"extern char *yytext" (which is what flex uses). If it's not that, then 7062I'm afraid I don't know what the problem might be. 7063 7064 Vern 7065@end verbatim 7066@end example 7067 7068@c TODO: Evaluate this faq. 7069@node unnamed-faq-68 7070@unnumberedsec unnamed-faq-68 7071@example 7072@verbatim 7073To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> 7074Subject: Re: flex 2.5: c++ scanners & start conditions 7075In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. 7076Date: Tue, 06 Jan 1998 19:19:30 PST 7077From: Vern Paxson <vern> 7078 7079> The problem is that when I do this (using %option c++) start 7080> conditions seem to not apply. 7081 7082The BEGIN macro modifies the yy_start variable. For C scanners, this 7083is a static with scope visible through the whole file. For C++ scanners, 7084it's a member variable, so it only has visible scope within a member 7085function. Your lexbegin() routine is not a member function when you 7086build a C++ scanner, so it's not modifying the correct yy_start. The 7087diagnostic that indicates this is that you found you needed to add 7088a declaration of yy_start in order to get your scanner to compile when 7089using C++; instead, the correct fix is to make lexbegin() a member 7090function (by deriving from yyFlexLexer). 7091 7092 Vern 7093@end verbatim 7094@end example 7095 7096@c TODO: Evaluate this faq. 7097@node unnamed-faq-69 7098@unnumberedsec unnamed-faq-69 7099@example 7100@verbatim 7101To: "Boris Zinin" <boris@ippe.rssi.ru> 7102Subject: Re: current position in flex buffer 7103In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. 7104Date: Mon, 12 Jan 1998 12:03:15 PST 7105From: Vern Paxson <vern> 7106 7107> The problem is how to determine the current position in flex active 7108> buffer when a rule is matched.... 7109 7110You will need to keep track of this explicitly, such as by redefining 7111YY_USER_ACTION to count the number of characters matched. 7112 7113The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. 7114 7115 Vern 7116@end verbatim 7117@end example 7118 7119@c TODO: Evaluate this faq. 7120@node unnamed-faq-70 7121@unnumberedsec unnamed-faq-70 7122@example 7123@verbatim 7124To: Bik.Dhaliwal@bis.org 7125Subject: Re: Flex question 7126In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. 7127Date: Tue, 27 Jan 1998 22:41:52 PST 7128From: Vern Paxson <vern> 7129 7130> That requirement involves knowing 7131> the character position at which a particular token was matched 7132> in the lexer. 7133 7134The way you have to do this is by explicitly keeping track of where 7135you are in the file, by counting the number of characters scanned 7136for each token (available in yyleng). It may prove convenient to 7137do this by redefining YY_USER_ACTION, as described in the manual. 7138 7139 Vern 7140@end verbatim 7141@end example 7142 7143@c TODO: Evaluate this faq. 7144@node unnamed-faq-71 7145@unnumberedsec unnamed-faq-71 7146@example 7147@verbatim 7148To: Vladimir Alexiev <vladimir@cs.ualberta.ca> 7149Subject: Re: flex: how to control start condition from parser? 7150In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. 7151Date: Tue, 27 Jan 1998 22:45:37 PST 7152From: Vern Paxson <vern> 7153 7154> It seems useful for the parser to be able to tell the lexer about such 7155> context dependencies, because then they don't have to be limited to 7156> local or sequential context. 7157 7158One way to do this is to have the parser call a stub routine that's 7159included in the scanner's .l file, and consequently that has access ot 7160BEGIN. The only ugliness is that the parser can't pass in the state 7161it wants, because those aren't visible - but if you don't have many 7162such states, then using a different set of names doesn't seem like 7163to much of a burden. 7164 7165While generating a .h file like you suggests is certainly cleaner, 7166flex development has come to a virtual stand-still :-(, so a workaround 7167like the above is much more pragmatic than waiting for a new feature. 7168 7169 Vern 7170@end verbatim 7171@end example 7172 7173@c TODO: Evaluate this faq. 7174@node unnamed-faq-72 7175@unnumberedsec unnamed-faq-72 7176@example 7177@verbatim 7178To: Barbara Denny <denny@3com.com> 7179Subject: Re: freebsd flex bug? 7180In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. 7181Date: Fri, 30 Jan 1998 12:42:32 PST 7182From: Vern Paxson <vern> 7183 7184> lex.yy.c:1996: parse error before `=' 7185 7186This is the key, identifying this error. (It may help to pinpoint 7187it by using flex -L, so it doesn't generate #line directives in its 7188output.) I will bet you heavy money that you have a start condition 7189name that is also a variable name, or something like that; flex spits 7190out #define's for each start condition name, mapping them to a number, 7191so you can wind up with: 7192 7193 %x foo 7194 %% 7195 ... 7196 %% 7197 void bar() 7198 { 7199 int foo = 3; 7200 } 7201 7202and the penultimate will turn into "int 1 = 3" after C preprocessing, 7203since flex will put "#define foo 1" in the generated scanner. 7204 7205 Vern 7206@end verbatim 7207@end example 7208 7209@c TODO: Evaluate this faq. 7210@node unnamed-faq-73 7211@unnumberedsec unnamed-faq-73 7212@example 7213@verbatim 7214To: Maurice Petrie <mpetrie@infoscigroup.com> 7215Subject: Re: Lost flex .l file 7216In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. 7217Date: Mon, 02 Feb 1998 11:15:12 PST 7218From: Vern Paxson <vern> 7219 7220> I am curious as to 7221> whether there is a simple way to backtrack from the generated source to 7222> reproduce the lost list of tokens we are searching on. 7223 7224In theory, it's straight-forward to go from the DFA representation 7225back to a regular-expression representation - the two are isomorphic. 7226In practice, a huge headache, because you have to unpack all the tables 7227back into a single DFA representation, and then write a program to munch 7228on that and translate it into an RE. 7229 7230Sorry for the less-than-happy news ... 7231 7232 Vern 7233@end verbatim 7234@end example 7235 7236@c TODO: Evaluate this faq. 7237@node unnamed-faq-74 7238@unnumberedsec unnamed-faq-74 7239@example 7240@verbatim 7241To: jimmey@lexis-nexis.com (Jimmey Todd) 7242Subject: Re: Flex performance question 7243In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7244Date: Thu, 19 Feb 1998 08:48:51 PST 7245From: Vern Paxson <vern> 7246 7247> What I have found, is that the smaller the data chunk, the faster the 7248> program executes. This is the opposite of what I expected. Should this be 7249> happening this way? 7250 7251This is exactly what will happen if your input file has embedded NULs. 7252From the man page: 7253 7254A final note: flex is slow when matching NUL's, particularly 7255when a token contains multiple NUL's. It's best to write 7256rules which match short amounts of text if it's anticipated 7257that the text will often include NUL's. 7258 7259So that's the first thing to look for. 7260 7261 Vern 7262@end verbatim 7263@end example 7264 7265@c TODO: Evaluate this faq. 7266@node unnamed-faq-75 7267@unnumberedsec unnamed-faq-75 7268@example 7269@verbatim 7270To: jimmey@lexis-nexis.com (Jimmey Todd) 7271Subject: Re: Flex performance question 7272In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 7273Date: Thu, 19 Feb 1998 15:42:25 PST 7274From: Vern Paxson <vern> 7275 7276So there are several problems. 7277 7278First, to go fast, you want to match as much text as possible, which 7279your scanners don't in the case that what they're scanning is *not* 7280a <RN> tag. So you want a rule like: 7281 7282 [^<]+ 7283 7284Second, C++ scanners are particularly slow if they're interactive, 7285which they are by default. Using -B speeds it up by a factor of 3-4 7286on my workstation. 7287 7288Third, C++ scanners that use the istream interface are slow, because 7289of how poorly implemented istream's are. I built two versions of 7290the following scanner: 7291 7292 %% 7293 .*\n 7294 .* 7295 %% 7296 7297and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. 7298The C++ istream version, using -B, takes 3.8 seconds. 7299 7300 Vern 7301@end verbatim 7302@end example 7303 7304@c TODO: Evaluate this faq. 7305@node unnamed-faq-76 7306@unnumberedsec unnamed-faq-76 7307@example 7308@verbatim 7309To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> 7310Subject: Re: FLEX 2.5 & THE YEAR 2000 7311In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. 7312Date: Wed, 03 Jun 1998 10:22:26 PDT 7313From: Vern Paxson <vern> 7314 7315> I am researching the Y2K problem with General Electric R&D 7316> and need to know if there are any known issues concerning 7317> the above mentioned software and Y2K regardless of version. 7318 7319There shouldn't be, all it ever does with the date is ask the system 7320for it and then print it out. 7321 7322 Vern 7323@end verbatim 7324@end example 7325 7326@c TODO: Evaluate this faq. 7327@node unnamed-faq-77 7328@unnumberedsec unnamed-faq-77 7329@example 7330@verbatim 7331To: "Hans Dermot Doran" <htd@ibhdoran.com> 7332Subject: Re: flex problem 7333In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. 7334Date: Tue, 21 Jul 1998 14:23:34 PDT 7335From: Vern Paxson <vern> 7336 7337> To overcome this, I gets() the stdin into a string and lex the string. The 7338> string is lexed OK except that the end of string isn't lexed properly 7339> (yy_scan_string()), that is the lexer dosn't recognise the end of string. 7340 7341Flex doesn't contain mechanisms for recognizing buffer endpoints. But if 7342you use fgets instead (which you should anyway, to protect against buffer 7343overflows), then the final \n will be preserved in the string, and you can 7344scan that in order to find the end of the string. 7345 7346 Vern 7347@end verbatim 7348@end example 7349 7350@c TODO: Evaluate this faq. 7351@node unnamed-faq-78 7352@unnumberedsec unnamed-faq-78 7353@example 7354@verbatim 7355To: soumen@almaden.ibm.com 7356Subject: Re: Flex++ 2.5.3 instance member vs. static member 7357In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. 7358Date: Tue, 28 Jul 1998 01:10:34 PDT 7359From: Vern Paxson <vern> 7360 7361> %{ 7362> int mylineno = 0; 7363> %} 7364> ws [ \t]+ 7365> alpha [A-Za-z] 7366> dig [0-9] 7367> %% 7368> 7369> Now you'd expect mylineno to be a member of each instance of class 7370> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to 7371> indicate otherwise; unless I am missing something the declaration of 7372> mylineno seems to be outside any class scope. 7373> 7374> How will this work if I want to run a multi-threaded application with each 7375> thread creating a FlexLexer instance? 7376 7377Derive your own subclass and make mylineno a member variable of it. 7378 7379 Vern 7380@end verbatim 7381@end example 7382 7383@c TODO: Evaluate this faq. 7384@node unnamed-faq-79 7385@unnumberedsec unnamed-faq-79 7386@example 7387@verbatim 7388To: Adoram Rogel <adoram@hybridge.com> 7389Subject: Re: More than 32K states change hangs 7390In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. 7391Date: Tue, 04 Aug 1998 22:28:45 PDT 7392From: Vern Paxson <vern> 7393 7394> Vern Paxson, 7395> 7396> I followed your advice, posted on Usenet bu you, and emailed to me 7397> personally by you, on how to overcome the 32K states limit. I'm running 7398> on Linux machines. 7399> I took the full source of version 2.5.4 and did the following changes in 7400> flexdef.h: 7401> #define JAMSTATE -327660 7402> #define MAXIMUM_MNS 319990 7403> #define BAD_SUBSCRIPT -327670 7404> #define MAX_SHORT 327000 7405> 7406> and compiled. 7407> All looked fine, including check and bigcheck, so I installed. 7408 7409Hmmm, you shouldn't increase MAX_SHORT, though looking through my email 7410archives I see that I did indeed recommend doing so. Try setting it back 7411to 32700; that should suffice that you no longer need -Ca. If it still 7412hangs, then the interesting question is - where? 7413 7414> Compiling the same hanged program with a out-of-the-box (RedHat 4.2 7415> distribution of Linux) 7416> flex 2.5.4 binary works. 7417 7418Since Linux comes with source code, you should diff it against what 7419you have to see what problems they missed. 7420 7421> Should I always compile with the -Ca option now ? even short and simple 7422> filters ? 7423 7424No, definitely not. It's meant to be for those situations where you 7425absolutely must squeeze every last cycle out of your scanner. 7426 7427 Vern 7428@end verbatim 7429@end example 7430 7431@c TODO: Evaluate this faq. 7432@node unnamed-faq-80 7433@unnumberedsec unnamed-faq-80 7434@example 7435@verbatim 7436To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> 7437Subject: Re: flex output for static code portion 7438In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. 7439Date: Mon, 17 Aug 1998 23:57:42 PDT 7440From: Vern Paxson <vern> 7441 7442> I would like to use flex under the hood to generate a binary file 7443> containing the data structures that control the parse. 7444 7445This has been on the wish-list for a long time. In principle it's 7446straight-forward - you redirect mkdata() et al's I/O to another file, 7447and modify the skeleton to have a start-up function that slurps these 7448into dynamic arrays. The concerns are (1) the scanner generation code 7449is hairy and full of corner cases, so it's easy to get surprised when 7450going down this path :-( ; and (2) being careful about buffering so 7451that when the tables change you make sure the scanner starts in the 7452correct state and reading at the right point in the input file. 7453 7454> I was wondering if you know of anyone who has used flex in this way. 7455 7456I don't - but it seems like a reasonable project to undertake (unlike 7457numerous other flex tweaks :-). 7458 7459 Vern 7460@end verbatim 7461@end example 7462 7463@c TODO: Evaluate this faq. 7464@node unnamed-faq-81 7465@unnumberedsec unnamed-faq-81 7466@example 7467@verbatim 7468Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) 7469 by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 7470 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) 7471Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) 7472 by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 7473 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 7474Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 7475From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> 7476Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> 7477Subject: "flex scanner push-back overflow" 7478To: vern@ee.lbl.gov 7479Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) 7480Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7481X-NoJunk: Do NOT send commercial mail, spam or ads to this address! 7482X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ 7483X-Mailer: ELM [version 2.4ME+ PL28 (25)] 7484MIME-Version: 1.0 7485Content-Type: text/plain; charset=US-ASCII 7486Content-Transfer-Encoding: 7bit 7487 7488Hi Vern, 7489 7490Yesterday, I encountered a strange problem: I use the macro processor m4 7491to include some lengthy lists into a .l file. Following is a flex macro 7492definition that causes some serious pain in my neck: 7493 7494AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) 7495 7496The complete list contains about 10kB. When I try to "flex" this file 7497(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased 7498some of the predefined values in flexdefs.h) I get the error: 7499 7500myflex/flex -8 sentag.tmp.l 7501flex scanner push-back overflow 7502 7503When I remove the slashes in the macro definition everything works fine. 7504As I understand it, the double quotes escape the slash-character so it 7505really means "/" and not "trailing context". Furthermore, I tried to 7506escape the slashes with backslashes, but with no use, the same error message 7507appeared when flexing the code. 7508 7509Do you have an idea what's going on here? 7510 7511Greetings from Germany, 7512 Georg 7513-- 7514Georg Rehm georg@cl-ki.uni-osnabrueck.de 7515Institute for Semantic Information Processing, University of Osnabrueck, FRG 7516@end verbatim 7517@end example 7518 7519@c TODO: Evaluate this faq. 7520@node unnamed-faq-82 7521@unnumberedsec unnamed-faq-82 7522@example 7523@verbatim 7524To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 7525Subject: Re: "flex scanner push-back overflow" 7526In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. 7527Date: Thu, 20 Aug 1998 07:05:35 PDT 7528From: Vern Paxson <vern> 7529 7530> myflex/flex -8 sentag.tmp.l 7531> flex scanner push-back overflow 7532 7533Flex itself uses a flex scanner. That scanner is running out of buffer 7534space when it tries to unput() the humongous macro you've defined. When 7535you remove the '/'s, you make it small enough so that it fits in the buffer; 7536removing spaces would do the same thing. 7537 7538The fix is to either rethink how come you're using such a big macro and 7539perhaps there's another/better way to do it; or to rebuild flex's own 7540scan.c with a larger value for 7541 7542 #define YY_BUF_SIZE 16384 7543 7544- Vern 7545@end verbatim 7546@end example 7547 7548@c TODO: Evaluate this faq. 7549@node unnamed-faq-83 7550@unnumberedsec unnamed-faq-83 7551@example 7552@verbatim 7553To: Jan Kort <jan@research.techforce.nl> 7554Subject: Re: Flex 7555In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. 7556Date: Sat, 05 Sep 1998 00:59:49 PDT 7557From: Vern Paxson <vern> 7558 7559> %% 7560> 7561> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } 7562> ^\n { fprintf(stderr, "empty line\n"); } 7563> . { } 7564> \n { fprintf(stderr, "new line\n"); } 7565> 7566> %% 7567> -- input --------------------------------------- 7568> TEST1 7569> -- output -------------------------------------- 7570> TEST1 7571> empty line 7572> ------------------------------------------------ 7573 7574IMHO, it's not clear whether or not this is in fact a bug. It depends 7575on whether you view yyless() as backing up in the input stream, or as 7576pushing new characters onto the beginning of the input stream. Flex 7577interprets it as the latter (for implementation convenience, I'll admit), 7578and so considers the newline as in fact matching at the beginning of a 7579line, as after all the last token scanned an entire line and so the 7580scanner is now at the beginning of a new line. 7581 7582I agree that this is counter-intuitive for yyless(), given its 7583functional description (it's less so for unput(), depending on whether 7584you're unput()'ing new text or scanned text). But I don't plan to 7585change it any time soon, as it's a pain to do so. Consequently, 7586you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak 7587your scanner into the behavior you desire. 7588 7589Sorry for the less-than-completely-satisfactory answer. 7590 7591 Vern 7592@end verbatim 7593@end example 7594 7595@c TODO: Evaluate this faq. 7596@node unnamed-faq-84 7597@unnumberedsec unnamed-faq-84 7598@example 7599@verbatim 7600To: Patrick Krusenotto <krusenot@mac-info-link.de> 7601Subject: Re: Problems with restarting flex-2.5.2-generated scanner 7602In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. 7603Date: Thu, 24 Sep 1998 23:28:43 PDT 7604From: Vern Paxson <vern> 7605 7606> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately 7607> trying to make my scanner restart with a new file after my parser stops 7608> with a parse error. When my compiler restarts, the parser always 7609> receives the token after the token (in the old file!) that caused the 7610> parser error. 7611 7612I suspect the problem is that your parser has read ahead in order 7613to attempt to resolve an ambiguity, and when it's restarted it picks 7614up with that token rather than reading a fresh one. If you're using 7615yacc, then the special "error" production can sometimes be used to 7616consume tokens in an attempt to get the parser into a consistent state. 7617 7618 Vern 7619@end verbatim 7620@end example 7621 7622@c TODO: Evaluate this faq. 7623@node unnamed-faq-85 7624@unnumberedsec unnamed-faq-85 7625@example 7626@verbatim 7627To: Henric Jungheim <junghelh@pe-nelson.com> 7628Subject: Re: flex 2.5.4a 7629In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. 7630Date: Tue, 27 Oct 1998 16:50:14 PST 7631From: Vern Paxson <vern> 7632 7633> This brings up a feature request: How about a command line 7634> option to specify the filename when reading from stdin? That way one 7635> doesn't need to create a temporary file in order to get the "#line" 7636> directives to make sense. 7637 7638Use -o combined with -t (per the man page description of -o). 7639 7640> P.S., Is there any simple way to use non-blocking IO to parse multiple 7641> streams? 7642 7643Simple, no. 7644 7645One approach might be to return a magic character on EWOULDBLOCK and 7646have a rule 7647 7648 .*<magic-character> // put back .*, eat magic character 7649 7650This is off the top of my head, not sure it'll work. 7651 7652 Vern 7653@end verbatim 7654@end example 7655 7656@c TODO: Evaluate this faq. 7657@node unnamed-faq-86 7658@unnumberedsec unnamed-faq-86 7659@example 7660@verbatim 7661To: "Repko, Billy D" <billy.d.repko@intel.com> 7662Subject: Re: Compiling scanners 7663In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. 7664Date: Thu, 14 Jan 1999 00:25:30 PST 7665From: Vern Paxson <vern> 7666 7667> It appears that maybe it cannot find the lfl library. 7668 7669The Makefile in the distribution builds it, so you should have it. 7670It's exceedingly trivial, just a main() that calls yylex() and 7671a yyrap() that always returns 1. 7672 7673> %% 7674> \n ++num_lines; ++num_chars; 7675> . ++num_chars; 7676 7677You can't indent your rules like this - that's where the errors are coming 7678from. Flex copies indented text to the output file, it's how you do things 7679like 7680 7681 int num_lines_seen = 0; 7682 7683to declare local variables. 7684 7685 Vern 7686@end verbatim 7687@end example 7688 7689@c TODO: Evaluate this faq. 7690@node unnamed-faq-87 7691@unnumberedsec unnamed-faq-87 7692@example 7693@verbatim 7694To: Erick Branderhorst <Erick.Branderhorst@asml.nl> 7695Subject: Re: flex input buffer 7696In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. 7697Date: Tue, 09 Feb 1999 21:03:37 PST 7698From: Vern Paxson <vern> 7699 7700> In the flex.skl file the size of the default input buffers is set. Can you 7701> explain why this size is set and why it is such a high number. 7702 7703It's large to optimize performance when scanning large files. You can 7704safely make it a lot lower if needed. 7705 7706 Vern 7707@end verbatim 7708@end example 7709 7710@c TODO: Evaluate this faq. 7711@node unnamed-faq-88 7712@unnumberedsec unnamed-faq-88 7713@example 7714@verbatim 7715To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> 7716Subject: Re: Flex error message 7717In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. 7718Date: Thu, 25 Feb 1999 00:11:31 PST 7719From: Vern Paxson <vern> 7720 7721> I'm extending a larger scanner written in Flex and I keep running into 7722> problems. More specifically, I get the error message: 7723> "flex: input rules are too complicated (>= 32000 NFA states)" 7724 7725Increase the definitions in flexdef.h for: 7726 7727#define JAMSTATE -32766 /* marks a reference to the state that always j 7728ams */ 7729#define MAXIMUM_MNS 31999 7730#define BAD_SUBSCRIPT -32767 7731 7732recompile everything, and it should all work. 7733 7734 Vern 7735@end verbatim 7736@end example 7737 7738@c TODO: Evaluate this faq. 7739@node unnamed-faq-90 7740@unnumberedsec unnamed-faq-90 7741@example 7742@verbatim 7743To: "Dmitriy Goldobin" <gold@ems.chel.su> 7744Subject: Re: FLEX trouble 7745In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. 7746Date: Tue, 01 Jun 1999 00:15:07 PDT 7747From: Vern Paxson <vern> 7748 7749> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 7750> but rule "/*"(.|\n)*"*/" don't work ? 7751 7752The second of these will have to scan the entire input stream (because 7753"(.|\n)*" matches an arbitrary amount of any text) in order to see if 7754it ends with "*/", terminating the comment. That potentially will overflow 7755the input buffer. 7756 7757> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error 7758> 'unrecognized rule'. 7759 7760You can't use the '/' operator inside parentheses. It's not clear 7761what "(a/b)*" actually means. 7762 7763> I now use workaround with state <comment>, but single-rule is 7764> better, i think. 7765 7766Single-rule is nice but will always have the problem of either setting 7767restrictions on comments (like not allowing multi-line comments) and/or 7768running the risk of consuming the entire input stream, as noted above. 7769 7770 Vern 7771@end verbatim 7772@end example 7773 7774@c TODO: Evaluate this faq. 7775@node unnamed-faq-91 7776@unnumberedsec unnamed-faq-91 7777@example 7778@verbatim 7779Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) 7780 by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 7781 for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) 7782Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 7783To: vern@ee.lbl.gov 7784Date: Tue, 15 Jun 1999 08:55:43 -0700 7785From: "Aki Niimura" <neko@my-deja.com> 7786Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> 7787Mime-Version: 1.0 7788Cc: 7789X-Sent-Mail: on 7790Reply-To: 7791X-Mailer: MailCity Service 7792Subject: A question on flex C++ scanner 7793X-Sender-Ip: 12.72.207.61 7794Organization: My Deja Email (http://www.my-deja.com:80) 7795Content-Type: text/plain; charset=us-ascii 7796Content-Transfer-Encoding: 7bit 7797 7798Dear Dr. Paxon, 7799 7800I have been using flex for years. 7801It works very well on many projects. 7802Most case, I used it to generate a scanner on C language. 7803However, one project I needed to generate a scanner 7804on C++ lanuage. Thanks to your enhancement, flex did 7805the job. 7806 7807Currently, I'm working on enhancing my previous project. 7808I need to deal with multiple input streams (recursive 7809inclusion) in this scanner (C++). 7810I did similar thing for another scanner (C) as you 7811explained in your documentation. 7812 7813The generated scanner (C++) has necessary methods: 7814- switch_to_buffer(struct yy_buffer_state *b) 7815- yy_create_buffer(istream *is, int sz) 7816- yy_delete_buffer(struct yy_buffer_state *b) 7817 7818However, I couldn't figure out how to access current 7819buffer (yy_current_buffer). 7820 7821yy_current_buffer is a protected member of yyFlexLexer. 7822I can't access it directly. 7823Then, I thought yy_create_buffer() with is = 0 might 7824return current stream buffer. But it seems not as far 7825as I checked the source. (flex 2.5.4) 7826 7827I went through the Web in addition to Flex documentation. 7828However, it hasn't been successful, so far. 7829 7830It is not my intention to bother you, but, can you 7831comment about how to obtain the current stream buffer? 7832 7833Your response would be highly appreciated. 7834 7835Best regards, 7836Aki Niimura 7837 7838--== Sent via Deja.com http://www.deja.com/ ==-- 7839Share what you know. Learn what you don't. 7840@end verbatim 7841@end example 7842 7843@c TODO: Evaluate this faq. 7844@node unnamed-faq-92 7845@unnumberedsec unnamed-faq-92 7846@example 7847@verbatim 7848To: neko@my-deja.com 7849Subject: Re: A question on flex C++ scanner 7850In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. 7851Date: Tue, 15 Jun 1999 09:04:24 PDT 7852From: Vern Paxson <vern> 7853 7854> However, I couldn't figure out how to access current 7855> buffer (yy_current_buffer). 7856 7857Derive your own subclass from yyFlexLexer. 7858 7859 Vern 7860@end verbatim 7861@end example 7862 7863@c TODO: Evaluate this faq. 7864@node unnamed-faq-93 7865@unnumberedsec unnamed-faq-93 7866@example 7867@verbatim 7868To: "Stones, Darren" <Darren.Stones@nectech.co.uk> 7869Subject: Re: You're the man to see? 7870In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. 7871Date: Wed, 23 Jun 1999 09:01:40 PDT 7872From: Vern Paxson <vern> 7873 7874> I hope you can help me. I am using Flex and Bison to produce an interpreted 7875> language. However all goes well until I try to implement an IF statement or 7876> a WHILE. I cannot get this to work as the parser parses all the conditions 7877> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot 7878> make a decision!! 7879 7880You need to use the parser to build a parse tree (= abstract syntax trwee), 7881and when that's all done you recursively evaluate the tree, binding variables 7882to values at that time. 7883 7884 Vern 7885@end verbatim 7886@end example 7887 7888@c TODO: Evaluate this faq. 7889@node unnamed-faq-94 7890@unnumberedsec unnamed-faq-94 7891@example 7892@verbatim 7893To: Petr Danecek <petr@ics.cas.cz> 7894Subject: Re: flex - question 7895In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. 7896Date: Fri, 02 Jul 1999 16:52:13 PDT 7897From: Vern Paxson <vern> 7898 7899> file, it takes an enormous amount of time. It is funny, because the 7900> source code has only 12 rules!!! I think it looks like an exponencial 7901> growth. 7902 7903Right, that's the problem - some patterns (those with a lot of 7904ambiguity, where yours has because at any given time the scanner can 7905be in the middle of all sorts of combinations of the different 7906rules) blow up exponentially. 7907 7908For your rules, there is an easy fix. Change the ".*" that comes fater 7909the directory name to "[^ ]*". With that in place, the rules are no 7910longer nearly so ambiguous, because then once one of the directories 7911has been matched, no other can be matched (since they all require a 7912leading blank). 7913 7914If that's not an acceptable solution, then you can enter a start state 7915to pick up the .*\n after each directory is matched. 7916 7917Also note that for speed, you'll want to add a ".*" rule at the end, 7918otherwise rules that don't match any of the patterns will be matched 7919very slowly, a character at a time. 7920 7921 Vern 7922@end verbatim 7923@end example 7924 7925@c TODO: Evaluate this faq. 7926@node unnamed-faq-95 7927@unnumberedsec unnamed-faq-95 7928@example 7929@verbatim 7930To: Tielman Koekemoer <tielman@spi.co.za> 7931Subject: Re: Please help. 7932In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. 7933Date: Thu, 08 Jul 1999 08:20:39 PDT 7934From: Vern Paxson <vern> 7935 7936> I was hoping you could help me with my problem. 7937> 7938> I tried compiling (gnu)flex on a Solaris 2.4 machine 7939> but when I ran make (after configure) I got an error. 7940> 7941> -------------------------------------------------------------- 7942> gcc -c -I. -I. -g -O parse.c 7943> ./flex -t -p ./scan.l >scan.c 7944> sh: ./flex: not found 7945> *** Error code 1 7946> make: Fatal error: Command failed for target `scan.c' 7947> ------------------------------------------------------------- 7948> 7949> What's strange to me is that I'm only 7950> trying to install flex now. I then edited the Makefile to 7951> and changed where it says "FLEX = flex" to "FLEX = lex" 7952> ( lex: the native Solaris one ) but then it complains about 7953> the "-p" option. Is there any way I can compile flex without 7954> using flex or lex? 7955> 7956> Thanks so much for your time. 7957 7958You managed to step on the bootstrap sequence, which first copies 7959initscan.c to scan.c in order to build flex. Try fetching a fresh 7960distribution from ftp.ee.lbl.gov. (Or you can first try removing 7961".bootstrap" and doing a make again.) 7962 7963 Vern 7964@end verbatim 7965@end example 7966 7967@c TODO: Evaluate this faq. 7968@node unnamed-faq-96 7969@unnumberedsec unnamed-faq-96 7970@example 7971@verbatim 7972To: Tielman Koekemoer <tielman@spi.co.za> 7973Subject: Re: Please help. 7974In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. 7975Date: Fri, 09 Jul 1999 00:27:20 PDT 7976From: Vern Paxson <vern> 7977 7978> First I removed .bootstrap (and ran make) - no luck. I downloaded the 7979> software but I still have the same problem. Is there anything else I 7980> could try. 7981 7982Try: 7983 7984 cp initscan.c scan.c 7985 touch scan.c 7986 make scan.o 7987 7988If this last tries to first build scan.c from scan.l using ./flex, then 7989your "make" is broken, in which case compile scan.c to scan.o by hand. 7990 7991 Vern 7992@end verbatim 7993@end example 7994 7995@c TODO: Evaluate this faq. 7996@node unnamed-faq-97 7997@unnumberedsec unnamed-faq-97 7998@example 7999@verbatim 8000To: Sumanth Kamenani <skamenan@crl.nmsu.edu> 8001Subject: Re: Error 8002In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. 8003Date: Tue, 20 Jul 1999 00:18:26 PDT 8004From: Vern Paxson <vern> 8005 8006> I am getting a compilation error. The error is given as "unknown symbol- yylex". 8007 8008The parser relies on calling yylex(), but you're instead using the C++ scanning 8009class, so you need to supply a yylex() "glue" function that calls an instance 8010scanner of the scanner (e.g., "scanner->yylex()"). 8011 8012 Vern 8013@end verbatim 8014@end example 8015 8016@c TODO: Evaluate this faq. 8017@node unnamed-faq-98 8018@unnumberedsec unnamed-faq-98 8019@example 8020@verbatim 8021To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) 8022Subject: Re: lex 8023In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. 8024Date: Tue, 23 Nov 1999 15:54:30 PST 8025From: Vern Paxson <vern> 8026 8027Well, your problem is the 8028 8029switch (yybgin-yysvec-1) { /* witchcraft */ 8030 8031at the beginning of lex rules. "witchcraft" == "non-portable". It's 8032assuming knowledge of the AT&T lex's internal variables. 8033 8034For flex, you can probably do the equivalent using a switch on YYSTATE. 8035 8036 Vern 8037@end verbatim 8038@end example 8039 8040@c TODO: Evaluate this faq. 8041@node unnamed-faq-99 8042@unnumberedsec unnamed-faq-99 8043@example 8044@verbatim 8045To: archow@hss.hns.com 8046Subject: Re: Regarding distribution of flex and yacc based grammars 8047In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. 8048Date: Wed, 22 Dec 1999 01:56:24 PST 8049From: Vern Paxson <vern> 8050 8051> When we provide the customer with an object code distribution, is it 8052> necessary for us to provide source 8053> for the generated C files from flex and bison since they are generated by 8054> flex and bison ? 8055 8056For flex, no. I don't know what the current state of this is for bison. 8057 8058> Also, is there any requrirement for us to neccessarily provide source for 8059> the grammar files which are fed into flex and bison ? 8060 8061Again, for flex, no. 8062 8063See the file "COPYING" in the flex distribution for the legalese. 8064 8065 Vern 8066@end verbatim 8067@end example 8068 8069@c TODO: Evaluate this faq. 8070@node unnamed-faq-100 8071@unnumberedsec unnamed-faq-100 8072@example 8073@verbatim 8074To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> 8075Subject: Re: Flex, and self referencing rules 8076In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. 8077Date: Sat, 19 Feb 2000 18:33:16 PST 8078From: Vern Paxson <vern> 8079 8080> However, I do not use unput anywhere. I do use self-referencing 8081> rules like this: 8082> 8083> UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) 8084 8085You can't do this - flex is *not* a parser like yacc (which does indeed 8086allow recursion), it is a scanner that's confined to regular expressions. 8087 8088 Vern 8089@end verbatim 8090@end example 8091 8092@c TODO: Evaluate this faq. 8093@node unnamed-faq-101 8094@unnumberedsec unnamed-faq-101 8095@example 8096@verbatim 8097To: slg3@lehigh.edu (SAMUEL L. GULDEN) 8098Subject: Re: Flex problem 8099In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. 8100Date: Thu, 02 Mar 2000 23:00:46 PST 8101From: Vern Paxson <vern> 8102 8103If this is exactly your program: 8104 8105> digit [0-9] 8106> digits {digit}+ 8107> whitespace [ \t\n]+ 8108> 8109> %% 8110> "[" { printf("open_brac\n");} 8111> "]" { printf("close_brac\n");} 8112> "+" { printf("addop\n");} 8113> "*" { printf("multop\n");} 8114> {digits} { printf("NUMBER = %s\n", yytext);} 8115> whitespace ; 8116 8117then the problem is that the last rule needs to be "{whitespace}" ! 8118 8119 Vern 8120@end verbatim 8121@end example 8122 8123@node What is the difference between YYLEX_PARAM and YY_DECL? 8124@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL? 8125 8126YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra 8127params when it calls yylex() from the parser. 8128 8129YY_DECL is the Flex declaration of yylex. The default is similar to this: 8130 8131@example 8132@verbatim 8133#define int yy_lex () 8134@end verbatim 8135@end example 8136 8137 8138@node Why do I get "conflicting types for yylex" error? 8139@unnumberedsec Why do I get "conflicting types for yylex" error? 8140 8141This is a compiler error regarding a generated Bison parser, not a Flex scanner. 8142It means you need a prototype of yylex() in the top of the Bison file. 8143Be sure the prototype matches YY_DECL. 8144 8145@node How do I access the values set in a Flex action from within a Bison action? 8146@unnumberedsec How do I access the values set in a Flex action from within a Bison action? 8147 8148With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual. 8149See @ref{Top, , , bison, the GNU Bison Manual}. 8150 8151@node Appendices, Indices, FAQ, Top 8152@appendix Appendices 8153 8154@menu 8155* Makefiles and Flex:: 8156* Bison Bridge:: 8157* M4 Dependency:: 8158* Common Patterns:: 8159@end menu 8160 8161@node Makefiles and Flex, Bison Bridge, Appendices, Appendices 8162@appendixsec Makefiles and Flex 8163 8164@cindex Makefile, syntax 8165 8166In this appendix, we provide tips for writing Makefiles to build your scanners. 8167 8168In a traditional build environment, we say that the @file{.c} files are the 8169sources, and the @file{.o} files are the intermediate files. When using 8170@code{flex}, however, the @file{.l} files are the sources, and the generated 8171@file{.c} files (along with the @file{.o} files) are the intermediate files. 8172This requires you to carefully plan your Makefile. 8173 8174Modern @command{make} programs understand that @file{foo.l} is intended to 8175generate @file{lex.yy.c} or @file{foo.c}, and will behave 8176accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such 8177programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake} 8178may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want, 8179then you should provide an explicit rule in your Makefile.am}. The 8180following Makefile does not explicitly instruct @command{make} how to build 8181@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the 8182@command{make} program to build the intermediate file, @file{scan.c}: 8183 8184@cindex Makefile, example of implicit rules 8185@example 8186@verbatim 8187 # Basic Makefile -- relies on implicit rules 8188 # Creates "myprogram" from "scan.l" and "myprogram.c" 8189 # 8190 LEX=flex 8191 myprogram: scan.o myprogram.o 8192 scan.o: scan.l 8193 8194@end verbatim 8195@end example 8196 8197 8198For simple cases, the above may be sufficient. For other cases, 8199you may have to explicitly instruct @command{make} how to build your scanner. 8200The following is an example of a Makefile containing explicit rules: 8201 8202@cindex Makefile, explicit example 8203@example 8204@verbatim 8205 # Basic Makefile -- provides explicit rules 8206 # Creates "myprogram" from "scan.l" and "myprogram.c" 8207 # 8208 LEX=flex 8209 myprogram: scan.o myprogram.o 8210 $(CC) -o $@ $(LDFLAGS) $^ 8211 8212 myprogram.o: myprogram.c 8213 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8214 8215 scan.o: scan.c 8216 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 8217 8218 scan.c: scan.l 8219 $(LEX) $(LFLAGS) -o $@ $^ 8220 8221 clean: 8222 $(RM) *.o scan.c 8223 8224@end verbatim 8225@end example 8226 8227Notice in the above example that @file{scan.c} is in the @code{clean} target. 8228This is because we consider the file @file{scan.c} to be an intermediate file. 8229 8230Finally, we provide a realistic example of a @code{flex} scanner used with a 8231@code{bison} parser@footnote{This example also applies to yacc parsers.}. 8232There is a tricky problem we have to deal with. Since a @code{flex} scanner 8233will typically include a header file (e.g., @file{y.tab.h}) generated by the 8234parser, we need to be sure that the header file is generated BEFORE the scanner 8235is compiled. We handle this case in the following example: 8236 8237@example 8238@verbatim 8239 # Makefile example -- scanner and parser. 8240 # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" 8241 # 8242 LEX = flex 8243 YACC = bison -y 8244 YFLAGS = -d 8245 objects = scan.o parse.o myprogram.o 8246 8247 myprogram: $(objects) 8248 scan.o: scan.l parse.c 8249 parse.o: parse.y 8250 myprogram.o: myprogram.c 8251 8252@end verbatim 8253@end example 8254 8255In the above example, notice the line, 8256 8257@example 8258@verbatim 8259 scan.o: scan.l parse.c 8260@end verbatim 8261@end example 8262 8263, which lists the file @file{parse.c} (the generated parser) as a dependency of 8264@file{scan.o}. We want to ensure that the parser is created before the scanner 8265is compiled, and the above line seems to do the trick. Feel free to experiment 8266with your specific implementation of @command{make}. 8267 8268 8269For more details on writing Makefiles, see @ref{Top, , , make, The 8270GNU Make Manual}. 8271 8272@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices 8273@section C Scanners with Bison Parsers 8274 8275@cindex bison, bridging with flex 8276@vindex yylval 8277@vindex yylloc 8278@tindex YYLTYPE 8279@tindex YYSTYPE 8280 8281This section describes the @code{flex} features useful when integrating 8282@code{flex} with @code{GNU bison}@footnote{The features described here are 8283purely optional, and are by no means the only way to use flex with bison. 8284We merely provide some glue to ease development of your parser-scanner pair.}. 8285Skip this section if you are not using 8286@code{bison} with your scanner. Here we discuss only the @code{flex} 8287half of the @code{flex} and @code{bison} pair. We do not discuss 8288@code{bison} in any detail. For more information about generating 8289@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}. 8290 8291A compatible @code{bison} scanner is generated by declaring @samp{%option 8292bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex} 8293from the command line. This instructs @code{flex} that the macro 8294@code{yylval} may be used. The data type for 8295@code{yylval}, @code{YYSTYPE}, 8296is typically defined in a header file, included in section 1 of the 8297@code{flex} input file. For a list of functions and macros 8298available, @xref{bison-functions}. 8299 8300The declaration of yylex becomes, 8301 8302@findex yylex (reentrant version) 8303@example 8304@verbatim 8305 int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); 8306@end verbatim 8307@end example 8308 8309If @code{%option bison-locations} is specified, then the declaration 8310becomes, 8311 8312@findex yylex (reentrant version) 8313@example 8314@verbatim 8315 int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); 8316@end verbatim 8317@end example 8318 8319Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers. 8320Support for @code{yylloc} is optional in @code{bison}, so it is optional in 8321@code{flex} as well. The following is an example of a @code{flex} scanner that 8322is compatible with @code{bison}. 8323 8324@cindex bison, scanner to be called from bison 8325@example 8326@verbatim 8327 /* Scanner for "C" assignment statements... sort of. */ 8328 %{ 8329 #include "y.tab.h" /* Generated by bison. */ 8330 %} 8331 8332 %option bison-bridge bison-locations 8333 % 8334 8335 [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} 8336 [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} 8337 "="|";" { return yytext[0];} 8338 . {} 8339 % 8340@end verbatim 8341@end example 8342 8343As you can see, there really is no magic here. We just use 8344@code{yylval} as we would any other variable. The data type of 8345@code{yylval} is generated by @code{bison}, and included in the file 8346@file{y.tab.h}. Here is the corresponding @code{bison} parser: 8347 8348@cindex bison, parser 8349@example 8350@verbatim 8351 /* Parser to convert "C" assignments to lisp. */ 8352 %{ 8353 /* Pass the argument to yyparse through to yylex. */ 8354 #define YYPARSE_PARAM scanner 8355 #define YYLEX_PARAM scanner 8356 %} 8357 %locations 8358 %pure_parser 8359 %union { 8360 int num; 8361 char* str; 8362 } 8363 %token <str> STRING 8364 %token <num> NUMBER 8365 %% 8366 assignment: 8367 STRING '=' NUMBER ';' { 8368 printf( "(setf %s %d)", $1, $3 ); 8369 } 8370 ; 8371@end verbatim 8372@end example 8373 8374@node M4 Dependency, Common Patterns, Bison Bridge, Appendices 8375@section M4 Dependency 8376@cindex m4 8377The macro processor @code{m4}@footnote{The use of m4 is subject to change in 8378future revisions of flex. It is not part of the public API of flex. Do not depend on it.} 8379must be installed wherever flex is installed. 8380@code{flex} invokes @samp{m4}, found by searching the directories in the 8381@code{PATH} environment variable. Any code you place in section 1 or in the 8382actions will be sent through m4. Please follow these rules to protect your 8383code from unwanted @code{m4} processing. 8384 8385@itemize 8386 8387@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, 8388or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for 8389some reason you need m4_ as a prefix, use a preprocessor #define to get your 8390symbol past m4 unmangled. 8391 8392@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The 8393former is not valid in C, except within comments and strings, but the latter is valid in 8394code such as @code{x[y[z]]}. The solution is simple. To get the literal string 8395@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]}, 8396use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and 8397escape them. However, it's best to avoid this complexity where possible, by 8398removing such sequences from your code. 8399 8400@end itemize 8401 8402@code{m4} is only required at the time you run @code{flex}. The generated 8403scanner is ordinary C or C++, and does @emph{not} require @code{m4}. 8404 8405@node Common Patterns, ,M4 Dependency, Appendices 8406@section Common Patterns 8407@cindex patterns, common 8408 8409This appendix provides examples of common regular expressions you might use 8410in your scanner. 8411 8412@menu 8413* Numbers:: 8414* Identifiers:: 8415* Quoted Constructs:: 8416* Addresses:: 8417@end menu 8418 8419 8420@node Numbers, Identifiers, ,Common Patterns 8421@subsection Numbers 8422 8423@table @asis 8424 8425@item C99 decimal constant 8426@code{([[:digit:]]@{-@}[0])[[:digit:]]*} 8427 8428@item C99 hexadecimal constant 8429@code{0[xX][[:xdigit:]]+} 8430 8431@item C99 octal constant 8432@code{0[0123456]*} 8433 8434@item C99 floating point constant 8435@verbatim 8436 {dseq} ([[:digit:]]+) 8437 {dseq_opt} ([[:digit:]]*) 8438 {frac} (({dseq_opt}"."{dseq})|{dseq}".") 8439 {exp} ([eE][+-]?{dseq}) 8440 {exp_opt} ({exp}?) 8441 {fsuff} [flFL] 8442 {fsuff_opt} ({fsuff}?) 8443 {hpref} (0[xX]) 8444 {hdseq} ([[:xdigit:]]+) 8445 {hdseq_opt} ([[:xdigit:]]*) 8446 {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) 8447 {bexp} ([pP][+-]?{dseq}) 8448 {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) 8449 {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) 8450 8451 {c99_floating_point_constant} ({dfc}|{hfc}) 8452@end verbatim 8453 8454See C99 section 6.4.4.2 for the gory details. 8455 8456@end table 8457 8458@node Identifiers, Quoted Constructs, Numbers, Common Patterns 8459@subsection Identifiers 8460 8461@table @asis 8462 8463@item C99 Identifier 8464@verbatim 8465ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) 8466nondigit [_[:alpha:]] 8467c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* 8468@end verbatim 8469 8470Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for 8471"implementation-defined" characters. In practice, C compilers follow the above pattern, with the 8472addition of the @samp{$} character. 8473 8474@item UTF-8 Encoded Unicode Code Point 8475@verbatim 8476[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) 8477@end verbatim 8478 8479@end table 8480 8481@node Quoted Constructs, Addresses, Identifiers, Common Patterns 8482@subsection Quoted Constructs 8483 8484@table @asis 8485@item C99 String Literal 8486@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"} 8487 8488@item C99 Comment 8489@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)} 8490 8491Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief, 8492does not include the trailing @samp{\n} character. 8493 8494A better way to scan @samp{/* */} comments is by line, rather than matching 8495possibly huge comments all at once. This will allow you to scan comments of 8496unlimited length, as long as line breaks appear at sane intervals. This is also 8497more efficient when used with automatic line number processing. @xref{option-yylineno}. 8498 8499@verbatim 8500<INITIAL>{ 8501 "/*" BEGIN(COMMENT); 8502} 8503<COMMENT>{ 8504 "*/" BEGIN(0); 8505 [^*\n]+ ; 8506 "*"[^/] ; 8507 \n ; 8508} 8509@end verbatim 8510 8511@end table 8512 8513@node Addresses, ,Quoted Constructs, Common Patterns 8514@subsection Addresses 8515 8516@table @asis 8517 8518@item IPv4 Address 8519@code{(([[:digit:]]@{1,3@}".")@{3@}([[:digit:]]@{1,3@}))} 8520 8521@item IPv6 Address 8522@verbatim 8523hex4 ([[:xdigit:]]{1,4}) 8524hexseq ({hex4}(:{hex4}*)) 8525hexpart ({hexseq}|({hexseq}::({hexseq}?))|::{hexseq}) 8526IPv6address ({hexpart}(":"{IPv4address})?) 8527@end verbatim 8528 8529See RFC2373 for details. 8530 8531@item URI 8532@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} 8533 8534This pattern is nearly useless, since it allows just about any character to 8535appear in a URI, including spaces and control characters. See RFC2396 for 8536details. 8537 8538@end table 8539 8540 8541@node Indices, , Appendices, Top 8542@unnumbered Indices 8543 8544@menu 8545* Concept Index:: 8546* Index of Functions and Macros:: 8547* Index of Variables:: 8548* Index of Data Types:: 8549* Index of Hooks:: 8550* Index of Scanner Options:: 8551@end menu 8552 8553@node Concept Index, Index of Functions and Macros, Indices, Indices 8554@unnumberedsec Concept Index 8555 8556@printindex cp 8557 8558@node Index of Functions and Macros, Index of Variables, Concept Index, Indices 8559@unnumberedsec Index of Functions and Macros 8560 8561This is an index of functions and preprocessor macros that look like functions. 8562For macros that expand to variables or constants, see @ref{Index of Variables}. 8563 8564@printindex fn 8565 8566@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices 8567@unnumberedsec Index of Variables 8568 8569This is an index of variables, constants, and preprocessor macros 8570that expand to variables or constants. 8571 8572@printindex vr 8573 8574@node Index of Data Types, Index of Hooks, Index of Variables, Indices 8575@unnumberedsec Index of Data Types 8576@printindex tp 8577 8578@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices 8579@unnumberedsec Index of Hooks 8580 8581This is an index of "hooks" that the user may define. These hooks typically correspond 8582to specific locations in the generated scanner, and may be used to insert arbitrary code. 8583 8584@printindex hk 8585 8586@node Index of Scanner Options, , Index of Hooks, Indices 8587@unnumberedsec Index of Scanner Options 8588 8589@printindex op 8590 8591@c A vim script to name the faq entries. delete this when faqs are no longer 8592@c named "unnamed-faq-XXX". 8593@c 8594@c fu! Faq2 () range abort 8595@c let @r=input("Rename to: ") 8596@c exe "%s/" . @w . "/" . @r . "/g" 8597@c normal 'f 8598@c endf 8599@c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> 8600 8601@bye 8602