flex.texi revision 1.1.1.4
1\input texinfo.tex @c -*-texinfo-*-
2@c %**start of header
3@setfilename flex.info
4@include version.texi
5@settitle Lexical Analysis With Flex, for Flex @value{VERSION}
6@set authors Vern Paxson, Will Estes and John Millaway
7@c  "Macro Hooks" index
8@defindex hk
9@c  "Options" index
10@defindex op
11@dircategory Programming
12@direntry
13* flex: (flex).      Fast lexical analyzer generator (lex replacement).
14@end direntry
15@c %**end of header
16
17@copying
18
19The flex manual is placed under the same licensing conditions as the
20rest of flex:
21
22Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012
23The Flex Project.
24
25Copyright @copyright{} 1990, 1997 The Regents of the University of California.
26All rights reserved.
27
28This code is derived from software contributed to Berkeley by
29Vern Paxson.
30
31The United States Government has rights in this work pursuant
32to contract no. DE-AC03-76SF00098 between the United States
33Department of Energy and the University of California.
34
35Redistribution and use in source and binary forms, with or without
36modification, are permitted provided that the following conditions
37are met:
38
39@enumerate
40@item
41 Redistributions of source code must retain the above copyright
42notice, this list of conditions and the following disclaimer.
43
44@item
45Redistributions in binary form must reproduce the above copyright
46notice, this list of conditions and the following disclaimer in the
47documentation and/or other materials provided with the distribution.
48@end enumerate
49
50Neither the name of the University nor the names of its contributors
51may be used to endorse or promote products derived from this software
52without specific prior written permission.
53
54THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
55IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
56WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
57PURPOSE.
58@end copying
59
60@titlepage
61@title Lexical Analysis with Flex
62@subtitle Edition @value{EDITION}, @value{UPDATED}
63@author @value{authors}
64@page
65@vskip 0pt plus 1filll
66@insertcopying
67@end titlepage
68@contents
69@ifnottex
70@node Top, Copyright, (dir), (dir)
71@top flex
72
73This manual describes @code{flex}, a tool for generating programs that
74perform pattern-matching on text.  The manual includes both tutorial and
75reference sections.
76
77This edition of @cite{The flex Manual} documents @code{flex} version
78@value{VERSION}. It was last updated on @value{UPDATED}.
79
80This manual was written by @value{authors}.
81
82@menu
83* Copyright::                   
84* Reporting Bugs::              
85* Introduction::                
86* Simple Examples::             
87* Format::                      
88* Patterns::                    
89* Matching::                    
90* Actions::                     
91* Generated Scanner::           
92* Start Conditions::            
93* Multiple Input Buffers::      
94* EOF::                         
95* Misc Macros::                 
96* User Values::                 
97* Yacc::                        
98* Scanner Options::             
99* Performance::                 
100* Cxx::                         
101* Reentrant::                   
102* Lex and Posix::               
103* Memory Management::           
104* Serialized Tables::           
105* Diagnostics::                 
106* Limitations::                 
107* Bibliography::                
108* FAQ::                         
109* Appendices::                  
110* Indices::                     
111
112@detailmenu
113 --- The Detailed Node Listing ---
114
115Format of the Input File
116
117* Definitions Section::         
118* Rules Section::               
119* User Code Section::           
120* Comments in the Input::       
121
122Scanner Options
123
124* Options for Specifying Filenames::  
125* Options Affecting Scanner Behavior::  
126* Code-Level And API Options::  
127* Options for Scanner Speed and Size::  
128* Debugging Options::           
129* Miscellaneous Options::       
130
131Reentrant C Scanners
132
133* Reentrant Uses::              
134* Reentrant Overview::          
135* Reentrant Example::           
136* Reentrant Detail::            
137* Reentrant Functions::         
138
139The Reentrant API in Detail
140
141* Specify Reentrant::           
142* Extra Reentrant Argument::    
143* Global Replacement::          
144* Init and Destroy Functions::  
145* Accessor Methods::            
146* Extra Data::                  
147* About yyscan_t::              
148
149Memory Management
150
151* The Default Memory Management::  
152* Overriding The Default Memory Management::  
153* A Note About yytext And Memory::  
154
155Serialized Tables
156
157* Creating Serialized Tables::  
158* Loading and Unloading Serialized Tables::  
159* Tables File Format::          
160
161FAQ
162
163* When was flex born?::         
164* How do I expand backslash-escape sequences in C-style quoted strings?::  
165* Why do flex scanners call fileno if it is not ANSI compatible?::  
166* Does flex support recursive pattern definitions?::  
167* How do I skip huge chunks of input (tens of megabytes) while using flex?::  
168* Flex is not matching my patterns in the same order that I defined them.::  
169* My actions are executing out of order or sometimes not at all.::  
170* How can I have multiple input sources feed into the same scanner at the same time?::  
171* Can I build nested parsers that work with the same input file?::  
172* How can I match text only at the end of a file?::  
173* How can I make REJECT cascade across start condition boundaries?::  
174* Why cant I use fast or full tables with interactive mode?::  
175* How much faster is -F or -f than -C?::  
176* If I have a simple grammar cant I just parse it with flex?::  
177* Why doesn't yyrestart() set the start state back to INITIAL?::  
178* How can I match C-style comments?::  
179* The period isn't working the way I expected.::  
180* Can I get the flex manual in another format?::  
181* Does there exist a "faster" NDFA->DFA algorithm?::  
182* How does flex compile the DFA so quickly?::  
183* How can I use more than 8192 rules?::  
184* How do I abandon a file in the middle of a scan and switch to a new file?::  
185* How do I execute code only during initialization (only before the first scan)?::  
186* How do I execute code at termination?::  
187* Where else can I find help?::  
188* Can I include comments in the "rules" section of the file?::  
189* I get an error about undefined yywrap().::  
190* How can I change the matching pattern at run time?::  
191* How can I expand macros in the input?::  
192* How can I build a two-pass scanner?::  
193* How do I match any string not matched in the preceding rules?::  
194* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::  
195* Is there a way to make flex treat NULL like a regular character?::  
196* Whenever flex can not match the input it says "flex scanner jammed".::  
197* Why doesn't flex have non-greedy operators like perl does?::  
198* Memory leak - 16386 bytes allocated by malloc.::  
199* How do I track the byte offset for lseek()?::  
200* How do I use my own I/O classes in a C++ scanner?::  
201* How do I skip as many chars as possible?::  
202* deleteme00::              
203* Are certain equivalent patterns faster than others?::              
204* Is backing up a big deal?::              
205* Can I fake multi-byte character support?::              
206* deleteme01::              
207* Can you discuss some flex internals?::              
208* unput() messes up yy_at_bol::              
209* The | operator is not doing what I want::              
210* Why can't flex understand this variable trailing context pattern?::              
211* The ^ operator isn't working::              
212* Trailing context is getting confused with trailing optional patterns::              
213* Is flex GNU or not?::              
214* ERASEME53::              
215* I need to scan if-then-else blocks and while loops::              
216* ERASEME55::              
217* ERASEME56::              
218* ERASEME57::              
219* Is there a repository for flex scanners?::              
220* How can I conditionally compile or preprocess my flex input file?::              
221* Where can I find grammars for lex and yacc?::              
222* I get an end-of-buffer message for each character scanned.::              
223* unnamed-faq-62::              
224* unnamed-faq-63::              
225* unnamed-faq-64::              
226* unnamed-faq-65::              
227* unnamed-faq-66::              
228* unnamed-faq-67::              
229* unnamed-faq-68::              
230* unnamed-faq-69::              
231* unnamed-faq-70::              
232* unnamed-faq-71::              
233* unnamed-faq-72::              
234* unnamed-faq-73::              
235* unnamed-faq-74::              
236* unnamed-faq-75::              
237* unnamed-faq-76::              
238* unnamed-faq-77::              
239* unnamed-faq-78::              
240* unnamed-faq-79::              
241* unnamed-faq-80::              
242* unnamed-faq-81::              
243* unnamed-faq-82::              
244* unnamed-faq-83::              
245* unnamed-faq-84::              
246* unnamed-faq-85::              
247* unnamed-faq-86::              
248* unnamed-faq-87::              
249* unnamed-faq-88::              
250* unnamed-faq-90::              
251* unnamed-faq-91::              
252* unnamed-faq-92::              
253* unnamed-faq-93::              
254* unnamed-faq-94::              
255* unnamed-faq-95::              
256* unnamed-faq-96::              
257* unnamed-faq-97::              
258* unnamed-faq-98::              
259* unnamed-faq-99::              
260* unnamed-faq-100::             
261* unnamed-faq-101::             
262* What is the difference between YYLEX_PARAM and YY_DECL?::
263* Why do I get "conflicting types for yylex" error?::
264* How do I access the values set in a Flex action from within a Bison action?::
265
266Appendices
267
268* Makefiles and Flex::          
269* Bison Bridge::                
270* M4 Dependency::               
271* Common Patterns::               
272
273Indices
274
275* Concept Index::               
276* Index of Functions and Macros::  
277* Index of Variables::          
278* Index of Data Types::         
279* Index of Hooks::              
280* Index of Scanner Options::    
281
282@end detailmenu
283@end menu
284@end ifnottex
285@node Copyright, Reporting Bugs, Top, Top
286@chapter Copyright
287
288@cindex copyright of flex
289@cindex distributing flex
290@insertcopying
291
292@node Reporting Bugs, Introduction, Copyright, Top
293@chapter Reporting Bugs
294
295@cindex bugs, reporting
296@cindex reporting bugs
297
298If you find a bug in @code{flex}, please report it using
299the SourceForge Bug Tracking facilities which can be found on
300@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}.
301
302@node Introduction, Simple Examples, Reporting Bugs, Top
303@chapter Introduction
304
305@cindex scanner, definition of
306@code{flex} is a tool for generating @dfn{scanners}.  A scanner is a
307program which recognizes lexical patterns in text.  The @code{flex}
308program reads the given input files, or its standard input if no file
309names are given, for a description of a scanner to generate.  The
310description is in the form of pairs of regular expressions and C code,
311called @dfn{rules}. @code{flex} generates as output a C source file,
312@file{lex.yy.c} by default, which defines a routine @code{yylex()}.
313This file can be compiled and linked with the flex runtime library to
314produce an executable.  When the executable is run, it analyzes its
315input for occurrences of the regular expressions.  Whenever it finds
316one, it executes the corresponding C code.
317
318@node Simple Examples, Format, Introduction, Top
319@chapter Some Simple Examples
320
321First some simple examples to get the flavor of how one uses
322@code{flex}.
323
324@cindex username expansion
325The following @code{flex} input specifies a scanner which, when it
326encounters the string @samp{username} will replace it with the user's
327login name:
328
329@example
330@verbatim
331    %%
332    username    printf( "%s", getlogin() );
333@end verbatim
334@end example
335
336@cindex default rule
337@cindex rules, default
338By default, any text not matched by a @code{flex} scanner is copied to
339the output, so the net effect of this scanner is to copy its input file
340to its output with each occurrence of @samp{username} expanded.  In this
341input, there is just one rule.  @samp{username} is the @dfn{pattern} and
342the @samp{printf} is the @dfn{action}.  The @samp{%%} symbol marks the
343beginning of the rules.
344
345Here's another simple example:
346
347@cindex counting characters and lines
348@example
349@verbatim
350            int num_lines = 0, num_chars = 0;
351
352    %%
353    \n      ++num_lines; ++num_chars;
354    .       ++num_chars;
355
356    %%
357
358    int main()
359            {
360            yylex();
361            printf( "# of lines = %d, # of chars = %d\n",
362                    num_lines, num_chars );
363            }
364@end verbatim
365@end example
366
367This scanner counts the number of characters and the number of lines in
368its input. It produces no output other than the final report on the
369character and line counts.  The first line declares two globals,
370@code{num_lines} and @code{num_chars}, which are accessible both inside
371@code{yylex()} and in the @code{main()} routine declared after the
372second @samp{%%}.  There are two rules, one which matches a newline
373(@samp{\n}) and increments both the line count and the character count,
374and one which matches any character other than a newline (indicated by
375the @samp{.} regular expression).
376
377A somewhat more complicated example:
378
379@cindex Pascal-like language
380@example
381@verbatim
382    /* scanner for a toy Pascal-like language */
383
384    %{
385    /* need this for the call to atof() below */
386    #include <math.h>
387    %}
388
389    DIGIT    [0-9]
390    ID       [a-z][a-z0-9]*
391
392    %%
393
394    {DIGIT}+    {
395                printf( "An integer: %s (%d)\n", yytext,
396                        atoi( yytext ) );
397                }
398
399    {DIGIT}+"."{DIGIT}*        {
400                printf( "A float: %s (%g)\n", yytext,
401                        atof( yytext ) );
402                }
403
404    if|then|begin|end|procedure|function        {
405                printf( "A keyword: %s\n", yytext );
406                }
407
408    {ID}        printf( "An identifier: %s\n", yytext );
409
410    "+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
411
412    "{"[\^{}}\n]*"}"     /* eat up one-line comments */
413
414    [ \t\n]+          /* eat up whitespace */
415
416    .           printf( "Unrecognized character: %s\n", yytext );
417
418    %%
419
420    int main( int argc, char **argv )
421        {
422        ++argv, --argc;  /* skip over program name */
423        if ( argc > 0 )
424                yyin = fopen( argv[0], "r" );
425        else
426                yyin = stdin;
427
428        yylex();
429        }
430@end verbatim
431@end example
432
433This is the beginnings of a simple scanner for a language like Pascal.
434It identifies different types of @dfn{tokens} and reports on what it has
435seen.
436
437The details of this example will be explained in the following
438sections.
439
440@node Format, Patterns, Simple Examples, Top
441@chapter Format of the Input File
442
443
444@cindex format of flex input
445@cindex input, format of
446@cindex file format
447@cindex sections of flex input
448
449The @code{flex} input file consists of three sections, separated by a
450line containing only @samp{%%}.
451
452@cindex format of input file
453@example
454@verbatim
455    definitions
456    %%
457    rules
458    %%
459    user code
460@end verbatim
461@end example
462
463@menu
464* Definitions Section::         
465* Rules Section::               
466* User Code Section::           
467* Comments in the Input::       
468@end menu
469
470@node Definitions Section, Rules Section, Format, Format
471@section Format of the Definitions Section
472
473@cindex input file, Definitions section
474@cindex Definitions, in flex input
475The @dfn{definitions section} contains declarations of simple @dfn{name}
476definitions to simplify the scanner specification, and declarations of
477@dfn{start conditions}, which are explained in a later section.
478
479@cindex aliases, how to define
480@cindex pattern aliases, how to define
481Name definitions have the form:
482
483@example
484@verbatim
485    name definition
486@end verbatim
487@end example
488
489The @samp{name} is a word beginning with a letter or an underscore
490(@samp{_}) followed by zero or more letters, digits, @samp{_}, or
491@samp{-} (dash).  The definition is taken to begin at the first
492non-whitespace character following the name and continuing to the end of
493the line.  The definition can subsequently be referred to using
494@samp{@{name@}}, which will expand to @samp{(definition)}.  For example,
495
496@cindex pattern aliases, defining
497@cindex defining pattern aliases
498@example
499@verbatim
500    DIGIT    [0-9]
501    ID       [a-z][a-z0-9]*
502@end verbatim
503@end example
504
505Defines @samp{DIGIT} to be a regular expression which matches a single
506digit, and @samp{ID} to be a regular expression which matches a letter
507followed by zero-or-more letters-or-digits.  A subsequent reference to
508
509@cindex pattern aliases, use of
510@example
511@verbatim
512    {DIGIT}+"."{DIGIT}*
513@end verbatim
514@end example
515
516is identical to
517
518@example
519@verbatim
520    ([0-9])+"."([0-9])*
521@end verbatim
522@end example
523
524and matches one-or-more digits followed by a @samp{.} followed by
525zero-or-more digits.
526
527@cindex comments in flex input
528An unindented comment (i.e., a line
529beginning with @samp{/*}) is copied verbatim to the output up
530to the next @samp{*/}.
531
532@cindex %@{ and %@}, in Definitions Section
533@cindex embedding C code in flex input
534@cindex C code in flex input
535Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
536is also copied verbatim to the output (with the %@{ and %@} symbols
537removed).  The %@{ and %@} symbols must appear unindented on lines by
538themselves.
539
540@cindex %top
541
542A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except
543that the code in a @code{%top} block is relocated to the @emph{top} of the
544generated file, before any flex definitions @footnote{Actually,
545@code{yyIN_HEADER} is defined before the @samp{%top} block.}. 
546The @code{%top} block is useful when you want certain preprocessor macros to be
547defined or certain files to be included before the generated code.
548The single characters, @samp{@{}  and @samp{@}} are used to delimit the
549@code{%top} block, as show in the example below:
550
551@example
552@verbatim
553    %top{
554        /* This code goes at the "top" of the generated file. */
555        #include <stdint.h>
556        #include <inttypes.h>
557    }
558@end verbatim
559@end example
560
561Multiple @code{%top} blocks are allowed, and their order is preserved.
562
563@node Rules Section, User Code Section, Definitions Section, Format
564@section Format of the Rules Section
565
566@cindex input file, Rules Section
567@cindex rules, in flex input
568The @dfn{rules} section of the @code{flex} input contains a series of
569rules of the form:
570
571@example
572@verbatim
573    pattern   action
574@end verbatim
575@end example
576
577where the pattern must be unindented and the action must begin
578on the same line.
579@xref{Patterns}, for a further description of patterns and actions.
580
581In the rules section, any indented or %@{ %@} enclosed text appearing
582before the first rule may be used to declare variables which are local
583to the scanning routine and (after the declarations) code which is to be
584executed whenever the scanning routine is entered.  Other indented or
585%@{ %@} text in the rule section is still copied to the output, but its
586meaning is not well-defined and it may well cause compile-time errors
587(this feature is present for @acronym{POSIX} compliance. @xref{Lex and
588Posix}, for other such features).
589
590Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
591is copied verbatim to the output (with the %@{ and %@} symbols removed).
592The %@{ and %@} symbols must appear unindented on lines by themselves.
593
594@node User Code Section, Comments in the Input, Rules Section, Format
595@section Format of the User Code Section
596
597@cindex input file, user code Section
598@cindex user code, in flex input
599The user code section is simply copied to @file{lex.yy.c} verbatim.  It
600is used for companion routines which call or are called by the scanner.
601The presence of this section is optional; if it is missing, the second
602@samp{%%} in the input file may be skipped, too.
603
604@node Comments in the Input,  , User Code Section, Format
605@section Comments in the Input
606
607@cindex comments, syntax of
608Flex supports C-style comments, that is, anything between @samp{/*} and
609@samp{*/} is
610considered a comment. Whenever flex encounters a comment, it copies the
611entire comment verbatim to the generated source code. Comments may
612appear just about anywhere, but with the following exceptions:
613
614@itemize
615@cindex comments, in rules section
616@item
617Comments may not appear in the Rules Section wherever flex is expecting
618a regular expression. This means comments may not appear at the
619beginning of a line, or immediately following a list of scanner states.
620@item
621Comments may not appear on an @samp{%option} line in the Definitions
622Section.
623@end itemize
624
625If you want to follow a simple rule, then always begin a comment on a
626new line, with one or more whitespace characters before the initial
627@samp{/*}).  This rule will work anywhere in the input file.
628
629All the comments in the following example are valid:
630
631@cindex comments, valid uses of
632@cindex comments in the input
633@example
634@verbatim
635%{
636/* code block */
637%}
638
639/* Definitions Section */
640%x STATE_X
641
642%%
643    /* Rules Section */
644ruleA   /* after regex */ { /* code block */ } /* after code block */
645        /* Rules Section (indented) */
646<STATE_X>{
647ruleC   ECHO;
648ruleD   ECHO;
649%{
650/* code block */
651%}
652}
653%%
654/* User Code Section */
655
656@end verbatim
657@end example
658
659@node Patterns, Matching, Format, Top
660@chapter Patterns
661
662@cindex patterns, in rules section
663@cindex regular expressions, in patterns
664The patterns in the input (see @ref{Rules Section}) are written using an
665extended set of regular expressions.  These are:
666
667@cindex patterns, syntax
668@cindex patterns, syntax
669@table @samp
670@item x
671match the character 'x'
672
673@item .
674any character (byte) except newline
675
676@cindex [] in patterns
677@cindex character classes in patterns, syntax of
678@cindex POSIX, character classes in patterns, syntax of
679@item [xyz]
680a @dfn{character class}; in this case, the pattern
681matches either an 'x', a 'y', or a 'z'
682
683@cindex ranges in patterns
684@item [abj-oZ]
685a "character class" with a range in it; matches
686an 'a', a 'b', any letter from 'j' through 'o',
687or a 'Z'
688
689@cindex ranges in patterns, negating
690@cindex negating ranges in patterns
691@item [^A-Z]
692a "negated character class", i.e., any character
693but those in the class.  In this case, any
694character EXCEPT an uppercase letter.
695
696@item [^A-Z\n]
697any character EXCEPT an uppercase letter or
698a newline
699
700@item [a-z]@{-@}[aeiou]
701the lowercase consonants
702
703@item r*
704zero or more r's, where r is any regular expression
705
706@item r+
707one or more r's
708
709@item r?
710zero or one r's (that is, ``an optional r'')
711
712@cindex braces in patterns
713@item r@{2,5@}
714anywhere from two to five r's
715
716@item r@{2,@}
717two or more r's
718
719@item r@{4@}
720exactly 4 r's
721
722@cindex pattern aliases, expansion of
723@item @{name@}
724the expansion of the @samp{name} definition
725(@pxref{Format}).
726
727@cindex literal text in patterns, syntax of
728@cindex verbatim text in patterns, syntax of
729@item "[xyz]\"foo"
730the literal string: @samp{[xyz]"foo}
731
732@cindex escape sequences in patterns, syntax of
733@item \X
734if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or
735@samp{v}, then the ANSI-C interpretation of @samp{\x}.  Otherwise, a
736literal @samp{X} (used to escape operators such as @samp{*})
737
738@cindex NULL character in patterns, syntax of
739@item \0
740a NUL character (ASCII code 0)
741
742@cindex octal characters in patterns
743@item \123
744the character with octal value 123
745
746@item \x2a
747the character with hexadecimal value 2a
748
749@item (r)
750match an @samp{r}; parentheses are used to override precedence (see below)
751
752@item (?r-s:pattern)
753apply option @samp{r} and omit option @samp{s} while interpreting pattern.
754Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}.
755
756@samp{i} means case-insensitive. @samp{-i} means case-sensitive.
757
758@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever.
759@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}.
760
761@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless
762it is backslash-escaped, contained within @samp{""}s, or appears inside a 
763character class.
764
765The following are all valid:
766
767@verbatim
768(?:foo)         same as  (foo)
769(?i:ab7)        same as  ([aA][bB]7)
770(?-i:ab)        same as  (ab)
771(?s:.)          same as  [\x00-\xFF]
772(?-s:.)         same as  [^\n]
773(?ix-s: a . b)  same as  ([Aa][^\n][bB])
774(?x:a  b)       same as  ("ab")
775(?x:a\ b)       same as  ("a b")
776(?x:a" "b)      same as  ("a b")
777(?x:a[ ]b)      same as  ("a b")
778(?x:a
779    /* comment */
780    b
781    c)          same as  (abc)
782@end verbatim
783
784@item (?# comment )
785omit everything within @samp{()}. The first @samp{)}
786character encountered ends the pattern. It is not possible to for the comment
787to contain a @samp{)} character. The comment may span lines.
788
789@cindex concatenation, in patterns
790@item rs
791the regular expression @samp{r} followed by the regular expression @samp{s}; called
792@dfn{concatenation}
793
794@item r|s
795either an @samp{r} or an @samp{s}
796
797@cindex trailing context, in patterns
798@item r/s
799an @samp{r} but only if it is followed by an @samp{s}.  The text matched by @samp{s} is
800included when determining whether this rule is the longest match, but is
801then returned to the input before the action is executed.  So the action
802only sees the text matched by @samp{r}.  This type of pattern is called
803@dfn{trailing context}.  (There are some combinations of @samp{r/s} that flex
804cannot match correctly. @xref{Limitations}, regarding dangerous trailing
805context.)
806
807@cindex beginning of line, in patterns
808@cindex BOL, in patterns
809@item ^r
810an @samp{r}, but only at the beginning of a line (i.e.,
811when just starting to scan, or right after a
812newline has been scanned).
813
814@cindex end of line, in patterns
815@cindex EOL, in patterns
816@item r$
817an @samp{r}, but only at the end of a line (i.e., just before a
818newline).  Equivalent to @samp{r/\n}.
819
820@cindex newline, matching in patterns
821Note that @code{flex}'s notion of ``newline'' is exactly
822whatever the C compiler used to compile @code{flex}
823interprets @samp{\n} as; in particular, on some DOS
824systems you must either filter out @samp{\r}s in the
825input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}.
826
827@cindex start conditions, in patterns
828@item <s>r
829an @samp{r}, but only in start condition @code{s} (see @ref{Start
830Conditions} for discussion of start conditions).
831
832@item <s1,s2,s3>r
833same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}.
834
835@item <*>r
836an @samp{r} in any start condition, even an exclusive one.
837
838@cindex end of file, in patterns
839@cindex EOF in patterns, syntax of
840@item <<EOF>>
841an end-of-file.
842
843@item <s1,s2><<EOF>>
844an end-of-file when in start condition @code{s1} or @code{s2}
845@end table
846
847Note that inside of a character class, all regular expression operators
848lose their special meaning except escape (@samp{\}) and the character class
849operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}.
850
851@cindex patterns, precedence of operators
852The regular expressions listed above are grouped according to
853precedence, from highest precedence at the top to lowest at the bottom.
854Those grouped together have equal precedence (see special note on the
855precedence of the repeat operator, @samp{@{@}}, under the documentation
856for the @samp{--posix} POSIX compliance option).  For example,
857
858@cindex patterns, grouping and precedence
859@example
860@verbatim
861    foo|bar*
862@end verbatim
863@end example
864
865is the same as
866
867@example
868@verbatim
869    (foo)|(ba(r*))
870@end verbatim
871@end example
872
873since the @samp{*} operator has higher precedence than concatenation,
874and concatenation higher than alternation (@samp{|}).  This pattern
875therefore matches @emph{either} the string @samp{foo} @emph{or} the
876string @samp{ba} followed by zero-or-more @samp{r}'s.  To match
877@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use:
878
879@example
880@verbatim
881    foo|(bar)*
882@end verbatim
883@end example
884
885And to match a sequence of zero or more repetitions of @samp{foo} and
886@samp{bar}:
887
888@cindex patterns, repetitions with grouping
889@example
890@verbatim
891    (foo|bar)*
892@end verbatim
893@end example
894
895@cindex character classes in patterns
896In addition to characters and ranges of characters, character classes
897can also contain @dfn{character class expressions}.  These are
898expressions enclosed inside @samp{[:} and @samp{:]} delimiters (which
899themselves must appear between the @samp{[} and @samp{]} of the
900character class. Other elements may occur inside the character class,
901too).  The valid expressions are:
902
903@cindex patterns, valid character classes
904@example
905@verbatim
906    [:alnum:] [:alpha:] [:blank:]
907    [:cntrl:] [:digit:] [:graph:]
908    [:lower:] [:print:] [:punct:]
909    [:space:] [:upper:] [:xdigit:]
910@end verbatim
911@end example
912
913These expressions all designate a set of characters equivalent to the
914corresponding standard C @code{isXXX} function.  For example,
915@samp{[:alnum:]} designates those characters for which @code{isalnum()}
916returns true - i.e., any alphabetic or numeric character.  Some systems
917don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a
918blank or a tab.
919
920For example, the following character classes are all equivalent:
921
922@cindex character classes, equivalence of
923@cindex patterns, character class equivalence
924@example
925@verbatim
926    [[:alnum:]]
927    [[:alpha:][:digit:]]
928    [[:alpha:][0-9]]
929    [a-zA-Z0-9]
930@end verbatim
931@end example
932
933A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. 
934This means the character classes are sensitive to the locale in which @code{flex}
935is executed, and the resulting scanner will not be sensitive to the runtime locale.
936This may or may not be desirable.
937
938
939@itemize
940@cindex case-insensitive, effect on character classes
941@item If your scanner is case-insensitive (the @samp{-i} flag), then
942@samp{[:upper:]} and @samp{[:lower:]} are equivalent to
943@samp{[:alpha:]}.
944
945@anchor{case and character ranges}
946@item Character classes with ranges, such as @samp{[a-Z]}, should be used with
947caution in a case-insensitive scanner if the range spans upper or lowercase
948characters. Flex does not know if you want to fold all upper and lowercase
949characters together, or if you want the literal numeric range specified (with
950no case folding). When in doubt, flex will assume that you meant the literal
951numeric range, and will issue a warning. The exception to this rule is a
952character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you
953want case-folding to occur. Here are some examples with the @samp{-i} flag
954enabled:
955
956@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}}
957@item Range @tab Result @tab Literal Range @tab Alternate Range
958@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab
959@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab
960@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]}
961@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]}
962@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]}
963@end multitable
964
965@cindex end of line, in negated character classes
966@cindex EOL, in negated character classes
967@item
968A negated character class such as the example @samp{[^A-Z]} above
969@emph{will} match a newline unless @samp{\n} (or an equivalent escape
970sequence) is one of the characters explicitly present in the negated
971character class (e.g., @samp{[^A-Z\n]}).  This is unlike how many other
972regular expression tools treat negated character classes, but
973unfortunately the inconsistency is historically entrenched.  Matching
974newlines means that a pattern like @samp{[^"]*} can match the entire
975input unless there's another quote in the input.
976
977Flex allows negation of character class expressions by prepending @samp{^} to
978the POSIX character class name.
979
980@example
981@verbatim
982    [:^alnum:] [:^alpha:] [:^blank:]
983    [:^cntrl:] [:^digit:] [:^graph:]
984    [:^lower:] [:^print:] [:^punct:]
985    [:^space:] [:^upper:] [:^xdigit:]
986@end verbatim
987@end example
988
989Flex will issue a warning if the expressions @samp{[:^upper:]} and
990@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is
991unclear. The current behavior is to skip them entirely, but this may change
992without notice in future revisions of flex.
993
994@item
995
996The @samp{@{-@}} operator computes the difference of two character classes. For
997example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class
998@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is
999just the single character @samp{a}). The @samp{@{-@}} operator is left
1000associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful
1001not to accidentally create an empty set, which will never match.
1002
1003@item
1004
1005The @samp{@{+@}} operator computes the union of two character classes. For
1006example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator
1007is useful when preceded by the result of a difference operation, as in,
1008@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to
1009@samp{[A-Zq]} in the "C" locale.
1010
1011@cindex trailing context, limits of
1012@cindex ^ as non-special character in patterns
1013@cindex $ as normal character in patterns
1014@item
1015A rule can have at most one instance of trailing context (the @samp{/} operator
1016or the @samp{$} operator).  The start condition, @samp{^}, and @samp{<<EOF>>} patterns
1017can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$},
1018cannot be grouped inside parentheses.  A @samp{^} which does not occur at
1019the beginning of a rule or a @samp{$} which does not occur at the end of
1020a rule loses its special properties and is treated as a normal character.
1021
1022@item
1023The following are invalid:
1024
1025@cindex patterns, invalid trailing context
1026@example
1027@verbatim
1028    foo/bar$
1029    <sc1>foo<sc2>bar
1030@end verbatim
1031@end example
1032
1033Note that the first of these can be written @samp{foo/bar\n}.
1034
1035@item
1036The following will result in @samp{$} or @samp{^} being treated as a normal character:
1037
1038@cindex patterns, special characters treated as non-special
1039@example
1040@verbatim
1041    foo|(bar$)
1042    foo|^bar
1043@end verbatim
1044@end example
1045
1046If the desired meaning is a @samp{foo} or a
1047@samp{bar}-followed-by-a-newline, the following could be used (the
1048special @code{|} action is explained below, @pxref{Actions}):
1049
1050@cindex patterns, end of line
1051@example
1052@verbatim
1053    foo      |
1054    bar$     /* action goes here */
1055@end verbatim
1056@end example
1057
1058A similar trick will work for matching a @samp{foo} or a
1059@samp{bar}-at-the-beginning-of-a-line.
1060@end itemize
1061
1062@node Matching, Actions, Patterns, Top
1063@chapter How the Input Is Matched
1064
1065@cindex patterns, matching
1066@cindex input, matching
1067@cindex trailing context, matching
1068@cindex matching, and trailing context
1069@cindex matching, length of
1070@cindex matching, multiple matches
1071When the generated scanner is run, it analyzes its input looking for
1072strings which match any of its patterns.  If it finds more than one
1073match, it takes the one matching the most text (for trailing context
1074rules, this includes the length of the trailing part, even though it
1075will then be returned to the input).  If it finds two or more matches of
1076the same length, the rule listed first in the @code{flex} input file is
1077chosen.
1078
1079@cindex token
1080@cindex yytext
1081@cindex yyleng
1082Once the match is determined, the text corresponding to the match
1083(called the @dfn{token}) is made available in the global character
1084pointer @code{yytext}, and its length in the global integer
1085@code{yyleng}.  The @dfn{action} corresponding to the matched pattern is
1086then executed (@pxref{Actions}), and then the remaining input is scanned
1087for another match.
1088
1089@cindex default rule
1090If no match is found, then the @dfn{default rule} is executed: the next
1091character in the input is considered matched and copied to the standard
1092output.  Thus, the simplest valid @code{flex} input is:
1093
1094@cindex minimal scanner
1095@example
1096@verbatim
1097    %%
1098@end verbatim
1099@end example
1100
1101which generates a scanner that simply copies its input (one character at
1102a time) to its output.
1103
1104@cindex yytext, two types of
1105@cindex %array, use of
1106@cindex %pointer, use of
1107@vindex yytext
1108Note that @code{yytext} can be defined in two different ways: either as
1109a character @emph{pointer} or as a character @emph{array}. You can
1110control which definition @code{flex} uses by including one of the
1111special directives @code{%pointer} or @code{%array} in the first
1112(definitions) section of your flex input.  The default is
1113@code{%pointer}, unless you use the @samp{-l} lex compatibility option,
1114in which case @code{yytext} will be an array.  The advantage of using
1115@code{%pointer} is substantially faster scanning and no buffer overflow
1116when matching very large tokens (unless you run out of dynamic memory).
1117The disadvantage is that you are restricted in how your actions can
1118modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()}
1119function destroys the present contents of @code{yytext}, which can be a
1120considerable porting headache when moving between different @code{lex}
1121versions.
1122
1123@cindex %array, advantages of
1124The advantage of @code{%array} is that you can then modify @code{yytext}
1125to your heart's content, and calls to @code{unput()} do not destroy
1126@code{yytext} (@pxref{Actions}).  Furthermore, existing @code{lex}
1127programs sometimes access @code{yytext} externally using declarations of
1128the form:
1129
1130@example
1131@verbatim
1132    extern char yytext[];
1133@end verbatim
1134@end example
1135
1136This definition is erroneous when used with @code{%pointer}, but correct
1137for @code{%array}.
1138
1139The @code{%array} declaration defines @code{yytext} to be an array of
1140@code{YYLMAX} characters, which defaults to a fairly large value.  You
1141can change the size by simply #define'ing @code{YYLMAX} to a different
1142value in the first section of your @code{flex} input.  As mentioned
1143above, with @code{%pointer} yytext grows dynamically to accommodate
1144large tokens.  While this means your @code{%pointer} scanner can
1145accommodate very large tokens (such as matching entire blocks of
1146comments), bear in mind that each time the scanner must resize
1147@code{yytext} it also must rescan the entire token from the beginning,
1148so matching such tokens can prove slow.  @code{yytext} presently does
1149@emph{not} dynamically grow if a call to @code{unput()} results in too
1150much text being pushed back; instead, a run-time error results.
1151
1152@cindex %array, with C++
1153Also note that you cannot use @code{%array} with C++ scanner classes
1154(@pxref{Cxx}).
1155
1156@node Actions, Generated Scanner, Matching, Top
1157@chapter Actions
1158
1159@cindex actions
1160Each pattern in a rule has a corresponding @dfn{action}, which can be
1161any arbitrary C statement.  The pattern ends at the first non-escaped
1162whitespace character; the remainder of the line is its action.  If the
1163action is empty, then when the pattern is matched the input token is
1164simply discarded.  For example, here is the specification for a program
1165which deletes all occurrences of @samp{zap me} from its input:
1166
1167@cindex deleting lines from input
1168@example
1169@verbatim
1170    %%
1171    "zap me"
1172@end verbatim
1173@end example
1174
1175This example will copy all other characters in the input to the output
1176since they will be matched by the default rule.
1177
1178Here is a program which compresses multiple blanks and tabs down to a
1179single blank, and throws away whitespace found at the end of a line:
1180
1181@cindex whitespace, compressing
1182@cindex compressing whitespace
1183@example
1184@verbatim
1185    %%
1186    [ \t]+        putchar( ' ' );
1187    [ \t]+$       /* ignore this token */
1188@end verbatim
1189@end example
1190
1191@cindex %@{ and %@}, in Rules Section
1192@cindex actions, use of @{ and @}
1193@cindex actions, embedded C strings
1194@cindex C-strings, in actions
1195@cindex comments, in actions
1196If the action contains a @samp{@{}, then the action spans till the
1197balancing @samp{@}} is found, and the action may cross multiple lines.
1198@code{flex} knows about C strings and comments and won't be fooled by
1199braces found within them, but also allows actions to begin with
1200@samp{%@{} and will consider the action to be all the text up to the
1201next @samp{%@}} (regardless of ordinary braces inside the action).
1202
1203@cindex |, in actions
1204An action consisting solely of a vertical bar (@samp{|}) means ``same as the
1205action for the next rule''.  See below for an illustration.
1206
1207Actions can include arbitrary C code, including @code{return} statements
1208to return a value to whatever routine called @code{yylex()}.  Each time
1209@code{yylex()} is called it continues processing tokens from where it
1210last left off until it either reaches the end of the file or executes a
1211return.
1212
1213@cindex yytext, modification of
1214Actions are free to modify @code{yytext} except for lengthening it
1215(adding characters to its end--these will overwrite later characters in
1216the input stream).  This however does not apply when using @code{%array}
1217(@pxref{Matching}). In that case, @code{yytext} may be freely modified
1218in any way.
1219
1220@cindex yyleng, modification of
1221@cindex yymore, and yyleng
1222Actions are free to modify @code{yyleng} except they should not do so if
1223the action also includes use of @code{yymore()} (see below).
1224
1225@cindex preprocessor macros, for use in actions
1226There are a number of special directives which can be included within an
1227action:
1228
1229@table @code
1230@item  ECHO
1231@cindex ECHO
1232copies yytext to the scanner's output.
1233
1234@item  BEGIN
1235@cindex BEGIN
1236followed by the name of a start condition places the scanner in the
1237corresponding start condition (see below).
1238
1239@item  REJECT
1240@cindex REJECT
1241directs the scanner to proceed on to the ``second best'' rule which
1242matched the input (or a prefix of the input).  The rule is chosen as
1243described above in @ref{Matching}, and @code{yytext} and @code{yyleng}
1244set up appropriately.  It may either be one which matched as much text
1245as the originally chosen rule but came later in the @code{flex} input
1246file, or one which matched less text.  For example, the following will
1247both count the words in the input and call the routine @code{special()}
1248whenever @samp{frob} is seen:
1249
1250@example
1251@verbatim
1252            int word_count = 0;
1253    %%
1254
1255    frob        special(); REJECT;
1256    [^ \t\n]+   ++word_count;
1257@end verbatim
1258@end example
1259
1260Without the @code{REJECT}, any occurrences of @samp{frob} in the input
1261would not be counted as words, since the scanner normally executes only
1262one action per token.  Multiple uses of @code{REJECT} are allowed, each
1263one finding the next best choice to the currently active rule.  For
1264example, when the following scanner scans the token @samp{abcd}, it will
1265write @samp{abcdabcaba} to the output:
1266
1267@cindex REJECT, calling multiple times
1268@cindex |, use of
1269@example
1270@verbatim
1271    %%
1272    a        |
1273    ab       |
1274    abc      |
1275    abcd     ECHO; REJECT;
1276    .|\n     /* eat up any unmatched character */
1277@end verbatim
1278@end example
1279
1280The first three rules share the fourth's action since they use the
1281special @samp{|} action.
1282
1283@code{REJECT} is a particularly expensive feature in terms of scanner
1284performance; if it is used in @emph{any} of the scanner's actions it
1285will slow down @emph{all} of the scanner's matching.  Furthermore,
1286@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options
1287(@pxref{Scanner Options}).
1288
1289Note also that unlike the other special actions, @code{REJECT} is a
1290@emph{branch}.  Code immediately following it in the action will
1291@emph{not} be executed.
1292
1293@item  yymore()
1294@cindex yymore()
1295tells the scanner that the next time it matches a rule, the
1296corresponding token should be @emph{appended} onto the current value of
1297@code{yytext} rather than replacing it.  For example, given the input
1298@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to
1299the output:
1300
1301@cindex yymore(), mega-kludge
1302@cindex yymore() to append token to previous token
1303@example
1304@verbatim
1305    %%
1306    mega-    ECHO; yymore();
1307    kludge   ECHO;
1308@end verbatim
1309@end example
1310
1311First @samp{mega-} is matched and echoed to the output.  Then @samp{kludge}
1312is matched, but the previous @samp{mega-} is still hanging around at the
1313beginning of
1314@code{yytext}
1315so the
1316@code{ECHO}
1317for the @samp{kludge} rule will actually write @samp{mega-kludge}.
1318@end table
1319
1320@cindex yymore, performance penalty of
1321Two notes regarding use of @code{yymore()}.  First, @code{yymore()}
1322depends on the value of @code{yyleng} correctly reflecting the size of
1323the current token, so you must not modify @code{yyleng} if you are using
1324@code{yymore()}.  Second, the presence of @code{yymore()} in the
1325scanner's action entails a minor performance penalty in the scanner's
1326matching speed.
1327
1328@cindex yyless()
1329@code{yyless(n)} returns all but the first @code{n} characters of the
1330current token back to the input stream, where they will be rescanned
1331when the scanner looks for the next match.  @code{yytext} and
1332@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now
1333be equal to @code{n}).  For example, on the input @samp{foobar} the
1334following will write out @samp{foobarbar}:
1335
1336@cindex yyless(), pushing back characters
1337@cindex pushing back characters with yyless
1338@example
1339@verbatim
1340    %%
1341    foobar    ECHO; yyless(3);
1342    [a-z]+    ECHO;
1343@end verbatim
1344@end example
1345
1346An argument of 0 to @code{yyless()} will cause the entire current input
1347string to be scanned again.  Unless you've changed how the scanner will
1348subsequently process its input (using @code{BEGIN}, for example), this
1349will result in an endless loop.
1350
1351Note that @code{yyless()} is a macro and can only be used in the flex
1352input file, not from other source files.
1353
1354@cindex unput()
1355@cindex pushing back characters with unput
1356@code{unput(c)} puts the character @code{c} back onto the input stream.
1357It will be the next character scanned.  The following action will take
1358the current token and cause it to be rescanned enclosed in parentheses.
1359
1360@cindex unput(), pushing back characters
1361@cindex pushing back characters with unput()
1362@example
1363@verbatim
1364    {
1365    int i;
1366    /* Copy yytext because unput() trashes yytext */
1367    char *yycopy = strdup( yytext );
1368    unput( ')' );
1369    for ( i = yyleng - 1; i >= 0; --i )
1370        unput( yycopy[i] );
1371    unput( '(' );
1372    free( yycopy );
1373    }
1374@end verbatim
1375@end example
1376
1377Note that since each @code{unput()} puts the given character back at the
1378@emph{beginning} of the input stream, pushing back strings must be done
1379back-to-front.
1380
1381@cindex %pointer, and unput()
1382@cindex unput(), and %pointer
1383An important potential problem when using @code{unput()} is that if you
1384are using @code{%pointer} (the default), a call to @code{unput()}
1385@emph{destroys} the contents of @code{yytext}, starting with its
1386rightmost character and devouring one character to the left with each
1387call.  If you need the value of @code{yytext} preserved after a call to
1388@code{unput()} (as in the above example), you must either first copy it
1389elsewhere, or build your scanner using @code{%array} instead
1390(@pxref{Matching}).
1391
1392@cindex pushing back EOF
1393@cindex EOF, pushing back
1394Finally, note that you cannot put back @samp{EOF} to attempt to mark the
1395input stream with an end-of-file.
1396
1397@cindex input()
1398@code{input()} reads the next character from the input stream.  For
1399example, the following is one way to eat up C comments:
1400
1401@cindex comments, discarding
1402@cindex discarding C comments
1403@example
1404@verbatim
1405    %%
1406    "/*"        {
1407                int c;
1408
1409                for ( ; ; )
1410                    {
1411                    while ( (c = input()) != '*' &&
1412                            c != EOF )
1413                        ;    /* eat up text of comment */
1414
1415                    if ( c == '*' )
1416                        {
1417                        while ( (c = input()) == '*' )
1418                            ;
1419                        if ( c == '/' )
1420                            break;    /* found the end */
1421                        }
1422
1423                    if ( c == EOF )
1424                        {
1425                        error( "EOF in comment" );
1426                        break;
1427                        }
1428                    }
1429                }
1430@end verbatim
1431@end example
1432
1433@cindex input(), and C++
1434@cindex yyinput()
1435(Note that if the scanner is compiled using @code{C++}, then
1436@code{input()} is instead referred to as @b{yyinput()}, in order to
1437avoid a name clash with the @code{C++} stream by the name of
1438@code{input}.)
1439
1440@cindex flushing the internal buffer
1441@cindex YY_FLUSH_BUFFER
1442@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that
1443the next time the scanner attempts to match a token, it will first
1444refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}).
1445This action is a special case of the more general
1446@code{yy_flush_buffer;} function, described below (@pxref{Multiple
1447Input Buffers})
1448
1449@cindex yyterminate()
1450@cindex terminating with yyterminate()
1451@cindex exiting with yyterminate()
1452@cindex halting with yyterminate()
1453@code{yyterminate()} can be used in lieu of a return statement in an
1454action.  It terminates the scanner and returns a 0 to the scanner's
1455caller, indicating ``all done''.  By default, @code{yyterminate()} is
1456also called when an end-of-file is encountered.  It is a macro and may
1457be redefined.
1458
1459@node Generated Scanner, Start Conditions, Actions, Top
1460@chapter The Generated Scanner
1461
1462@cindex yylex(), in generated scanner
1463The output of @code{flex} is the file @file{lex.yy.c}, which contains
1464the scanning routine @code{yylex()}, a number of tables used by it for
1465matching tokens, and a number of auxiliary routines and macros.  By
1466default, @code{yylex()} is declared as follows:
1467
1468@example
1469@verbatim
1470    int yylex()
1471        {
1472        ... various definitions and the actions in here ...
1473        }
1474@end verbatim
1475@end example
1476
1477@cindex yylex(), overriding
1478(If your environment supports function prototypes, then it will be
1479@code{int yylex( void )}.)  This definition may be changed by defining
1480the @code{YY_DECL} macro.  For example, you could use:
1481
1482@cindex yylex, overriding the prototype of
1483@example
1484@verbatim
1485    #define YY_DECL float lexscan( a, b ) float a, b;
1486@end verbatim
1487@end example
1488
1489to give the scanning routine the name @code{lexscan}, returning a float,
1490and taking two floats as arguments.  Note that if you give arguments to
1491the scanning routine using a K&R-style/non-prototyped function
1492declaration, you must terminate the definition with a semi-colon (;).
1493
1494@code{flex} generates @samp{C99} function definitions by
1495default. However flex does have the ability to generate obsolete, er,
1496@samp{traditional}, function definitions. This is to support
1497bootstrapping gcc on old systems.  Unfortunately, traditional
1498definitions prevent us from using any standard data types smaller than
1499int (such as short, char, or bool) as function arguments.  For this
1500reason, future versions of @code{flex} may generate standard C99 code
1501only, leaving K&R-style functions to the historians.  Currently, if you
1502do @strong{not} want @samp{C99} definitions, then you must use 
1503@code{%option noansi-definitions}.
1504
1505@cindex stdin, default for yyin
1506@cindex yyin
1507Whenever @code{yylex()} is called, it scans tokens from the global input
1508file @file{yyin} (which defaults to stdin).  It continues until it
1509either reaches an end-of-file (at which point it returns the value 0) or
1510one of its actions executes a @code{return} statement.
1511
1512@cindex EOF and yyrestart()
1513@cindex end-of-file, and yyrestart()
1514@cindex yyrestart()
1515If the scanner reaches an end-of-file, subsequent calls are undefined
1516unless either @file{yyin} is pointed at a new input file (in which case
1517scanning continues from that file), or @code{yyrestart()} is called.
1518@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which
1519can be NULL, if you've set up @code{YY_INPUT} to scan from a source other
1520than @code{yyin}), and initializes @file{yyin} for scanning from that
1521file.  Essentially there is no difference between just assigning
1522@file{yyin} to a new input file or using @code{yyrestart()} to do so;
1523the latter is available for compatibility with previous versions of
1524@code{flex}, and because it can be used to switch input files in the
1525middle of scanning.  It can also be used to throw away the current input
1526buffer, by calling it with an argument of @file{yyin}; but it would be
1527better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}).  Note that
1528@code{yyrestart()} does @emph{not} reset the start condition to
1529@code{INITIAL} (@pxref{Start Conditions}).
1530
1531@cindex RETURN, within actions
1532If @code{yylex()} stops scanning due to executing a @code{return}
1533statement in one of the actions, the scanner may then be called again
1534and it will resume scanning where it left off.
1535
1536@cindex YY_INPUT
1537By default (and for purposes of efficiency), the scanner uses
1538block-reads rather than simple @code{getc()} calls to read characters
1539from @file{yyin}.  The nature of how it gets its input can be controlled
1540by defining the @code{YY_INPUT} macro.  The calling sequence for
1541@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}.  Its action
1542is to place up to @code{max_size} characters in the character array
1543@code{buf} and return in the integer variable @code{result} either the
1544number of characters read or the constant @code{YY_NULL} (0 on Unix
1545systems) to indicate @samp{EOF}.  The default @code{YY_INPUT} reads from
1546the global file-pointer @file{yyin}.
1547
1548@cindex YY_INPUT, overriding
1549Here is a sample definition of @code{YY_INPUT} (in the definitions
1550section of the input file):
1551
1552@example
1553@verbatim
1554    %{
1555    #define YY_INPUT(buf,result,max_size) \
1556        { \
1557        int c = getchar(); \
1558        result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
1559        }
1560    %}
1561@end verbatim
1562@end example
1563
1564This definition will change the input processing to occur one character
1565at a time.
1566
1567@cindex yywrap()
1568When the scanner receives an end-of-file indication from YY_INPUT, it
1569then checks the @code{yywrap()} function.  If @code{yywrap()} returns
1570false (zero), then it is assumed that the function has gone ahead and
1571set up @file{yyin} to point to another input file, and scanning
1572continues.  If it returns true (non-zero), then the scanner terminates,
1573returning 0 to its caller.  Note that in either case, the start
1574condition remains unchanged; it does @emph{not} revert to
1575@code{INITIAL}.
1576
1577@cindex yywrap, default for
1578@cindex noyywrap, %option
1579@cindex %option noyywrapp
1580If you do not supply your own version of @code{yywrap()}, then you must
1581either use @code{%option noyywrap} (in which case the scanner behaves as
1582though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to
1583obtain the default version of the routine, which always returns 1.
1584
1585For scanning from in-memory buffers (e.g., scanning strings), see
1586@ref{Scanning Strings}. @xref{Multiple Input Buffers}.
1587
1588@cindex ECHO, and yyout
1589@cindex yyout
1590@cindex stdout, as default for yyout
1591The scanner writes its @code{ECHO} output to the @file{yyout} global
1592(default, @file{stdout}), which may be redefined by the user simply by
1593assigning it to some other @code{FILE} pointer.
1594
1595@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top
1596@chapter Start Conditions
1597
1598@cindex start conditions
1599@code{flex} provides a mechanism for conditionally activating rules.
1600Any rule whose pattern is prefixed with @samp{<sc>} will only be active
1601when the scanner is in the @dfn{start condition} named @code{sc}.  For
1602example,
1603
1604@example
1605@verbatim
1606    <STRING>[^"]*        { /* eat up the string body ... */
1607                ...
1608                }
1609@end verbatim
1610@end example
1611
1612will be active only when the scanner is in the @code{STRING} start
1613condition, and
1614
1615@cindex start conditions, multiple
1616@example
1617@verbatim
1618    <INITIAL,STRING,QUOTE>\.        { /* handle an escape ... */
1619                ...
1620                }
1621@end verbatim
1622@end example
1623
1624will be active only when the current start condition is either
1625@code{INITIAL}, @code{STRING}, or @code{QUOTE}.
1626
1627@cindex start conditions, inclusive v.s.@: exclusive
1628Start conditions are declared in the definitions (first) section of the
1629input using unindented lines beginning with either @samp{%s} or
1630@samp{%x} followed by a list of names.  The former declares
1631@dfn{inclusive} start conditions, the latter @dfn{exclusive} start
1632conditions.  A start condition is activated using the @code{BEGIN}
1633action.  Until the next @code{BEGIN} action is executed, rules with the
1634given start condition will be active and rules with other start
1635conditions will be inactive.  If the start condition is inclusive, then
1636rules with no start conditions at all will also be active.  If it is
1637exclusive, then @emph{only} rules qualified with the start condition
1638will be active.  A set of rules contingent on the same exclusive start
1639condition describe a scanner which is independent of any of the other
1640rules in the @code{flex} input.  Because of this, exclusive start
1641conditions make it easy to specify ``mini-scanners'' which scan portions
1642of the input that are syntactically different from the rest (e.g.,
1643comments).
1644
1645If the distinction between inclusive and exclusive start conditions
1646is still a little vague, here's a simple example illustrating the
1647connection between the two.  The set of rules:
1648
1649@cindex start conditions, inclusive
1650@example
1651@verbatim
1652    %s example
1653    %%
1654
1655    <example>foo   do_something();
1656
1657    bar            something_else();
1658@end verbatim
1659@end example
1660
1661is equivalent to
1662
1663@cindex start conditions, exclusive
1664@example
1665@verbatim
1666    %x example
1667    %%
1668
1669    <example>foo   do_something();
1670
1671    <INITIAL,example>bar    something_else();
1672@end verbatim
1673@end example
1674
1675Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in
1676the second example wouldn't be active (i.e., couldn't match) when in
1677start condition @code{example}.  If we just used @code{<example>} to
1678qualify @code{bar}, though, then it would only be active in
1679@code{example} and not in @code{INITIAL}, while in the first example
1680it's active in both, because in the first example the @code{example}
1681start condition is an inclusive @code{(%s)} start condition.
1682
1683@cindex start conditions, special wildcard condition
1684Also note that the special start-condition specifier
1685@code{<*>}
1686matches every start condition.  Thus, the above example could also
1687have been written:
1688
1689@cindex start conditions, use of wildcard condition (<*>)
1690@example
1691@verbatim
1692    %x example
1693    %%
1694
1695    <example>foo   do_something();
1696
1697    <*>bar    something_else();
1698@end verbatim
1699@end example
1700
1701The default rule (to @code{ECHO} any unmatched character) remains active
1702in start conditions.  It is equivalent to:
1703
1704@cindex start conditions, behavior of default rule
1705@example
1706@verbatim
1707    <*>.|\n     ECHO;
1708@end verbatim
1709@end example
1710
1711@cindex BEGIN, explanation
1712@findex BEGIN
1713@vindex INITIAL
1714@code{BEGIN(0)} returns to the original state where only the rules with
1715no start conditions are active.  This state can also be referred to as
1716the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is
1717equivalent to @code{BEGIN(0)}.  (The parentheses around the start
1718condition name are not required but are considered good style.)
1719
1720@code{BEGIN} actions can also be given as indented code at the beginning
1721of the rules section.  For example, the following will cause the scanner
1722to enter the @code{SPECIAL} start condition whenever @code{yylex()} is
1723called and the global variable @code{enter_special} is true:
1724
1725@cindex start conditions, using BEGIN
1726@example
1727@verbatim
1728            int enter_special;
1729
1730    %x SPECIAL
1731    %%
1732            if ( enter_special )
1733                BEGIN(SPECIAL);
1734
1735    <SPECIAL>blahblahblah
1736    ...more rules follow...
1737@end verbatim
1738@end example
1739
1740To illustrate the uses of start conditions, here is a scanner which
1741provides two different interpretations of a string like @samp{123.456}.
1742By default it will treat it as three tokens, the integer @samp{123}, a
1743dot (@samp{.}), and the integer @samp{456}.  But if the string is
1744preceded earlier in the line by the string @samp{expect-floats} it will
1745treat it as a single token, the floating-point number @samp{123.456}:
1746
1747@cindex start conditions, for different interpretations of same input
1748@example
1749@verbatim
1750    %{
1751    #include <math.h>
1752    %}
1753    %s expect
1754
1755    %%
1756    expect-floats        BEGIN(expect);
1757
1758    <expect>[0-9]+.[0-9]+      {
1759                printf( "found a float, = %f\n",
1760                        atof( yytext ) );
1761                }
1762    <expect>\n           {
1763                /* that's the end of the line, so
1764                 * we need another "expect-number"
1765                 * before we'll recognize any more
1766                 * numbers
1767                 */
1768                BEGIN(INITIAL);
1769                }
1770
1771    [0-9]+      {
1772                printf( "found an integer, = %d\n",
1773                        atoi( yytext ) );
1774                }
1775
1776    "."         printf( "found a dot\n" );
1777@end verbatim
1778@end example
1779
1780@cindex comments, example of scanning C comments
1781Here is a scanner which recognizes (and discards) C comments while
1782maintaining a count of the current input line.
1783
1784@cindex recognizing C comments
1785@example
1786@verbatim
1787    %x comment
1788    %%
1789            int line_num = 1;
1790
1791    "/*"         BEGIN(comment);
1792
1793    <comment>[^*\n]*        /* eat anything that's not a '*' */
1794    <comment>"*"+[^*/\n]*   /* eat up '*'s not followed by '/'s */
1795    <comment>\n             ++line_num;
1796    <comment>"*"+"/"        BEGIN(INITIAL);
1797@end verbatim
1798@end example
1799
1800This scanner goes to a bit of trouble to match as much
1801text as possible with each rule.  In general, when attempting to write
1802a high-speed scanner try to match as much possible in each rule, as
1803it's a big win.
1804
1805Note that start-conditions names are really integer values and
1806can be stored as such.  Thus, the above could be extended in the
1807following fashion:
1808
1809@cindex start conditions, integer values
1810@cindex using integer values of start condition names
1811@example
1812@verbatim
1813    %x comment foo
1814    %%
1815            int line_num = 1;
1816            int comment_caller;
1817
1818    "/*"         {
1819                 comment_caller = INITIAL;
1820                 BEGIN(comment);
1821                 }
1822
1823    ...
1824
1825    <foo>"/*"    {
1826                 comment_caller = foo;
1827                 BEGIN(comment);
1828                 }
1829
1830    <comment>[^*\n]*        /* eat anything that's not a '*' */
1831    <comment>"*"+[^*/\n]*   /* eat up '*'s not followed by '/'s */
1832    <comment>\n             ++line_num;
1833    <comment>"*"+"/"        BEGIN(comment_caller);
1834@end verbatim
1835@end example
1836
1837@cindex YY_START, example
1838Furthermore, you can access the current start condition using the
1839integer-valued @code{YY_START} macro.  For example, the above
1840assignments to @code{comment_caller} could instead be written
1841
1842@cindex getting current start state with YY_START
1843@example
1844@verbatim
1845    comment_caller = YY_START;
1846@end verbatim
1847@end example
1848
1849@vindex YY_START
1850Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that
1851is what's used by AT&T @code{lex}).
1852
1853For historical reasons, start conditions do not have their own
1854name-space within the generated scanner. The start condition names are
1855unmodified in the generated scanner and generated header.
1856@xref{option-header}. @xref{option-prefix}.
1857
1858
1859
1860Finally, here's an example of how to match C-style quoted strings using
1861exclusive start conditions, including expanded escape sequences (but
1862not including checking for a string that's too long):
1863
1864@cindex matching C-style double-quoted strings
1865@example
1866@verbatim
1867    %x str
1868
1869    %%
1870            char string_buf[MAX_STR_CONST];
1871            char *string_buf_ptr;
1872
1873
1874    \"      string_buf_ptr = string_buf; BEGIN(str);
1875
1876    <str>\"        { /* saw closing quote - all done */
1877            BEGIN(INITIAL);
1878            *string_buf_ptr = '\0';
1879            /* return string constant token type and
1880             * value to parser
1881             */
1882            }
1883
1884    <str>\n        {
1885            /* error - unterminated string constant */
1886            /* generate error message */
1887            }
1888
1889    <str>\\[0-7]{1,3} {
1890            /* octal escape sequence */
1891            int result;
1892
1893            (void) sscanf( yytext + 1, "%o", &result );
1894
1895            if ( result > 0xff )
1896                    /* error, constant is out-of-bounds */
1897
1898            *string_buf_ptr++ = result;
1899            }
1900
1901    <str>\\[0-9]+ {
1902            /* generate error - bad escape sequence; something
1903             * like '\48' or '\0777777'
1904             */
1905            }
1906
1907    <str>\\n  *string_buf_ptr++ = '\n';
1908    <str>\\t  *string_buf_ptr++ = '\t';
1909    <str>\\r  *string_buf_ptr++ = '\r';
1910    <str>\\b  *string_buf_ptr++ = '\b';
1911    <str>\\f  *string_buf_ptr++ = '\f';
1912
1913    <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];
1914
1915    <str>[^\\\n\"]+        {
1916            char *yptr = yytext;
1917
1918            while ( *yptr )
1919                    *string_buf_ptr++ = *yptr++;
1920            }
1921@end verbatim
1922@end example
1923
1924@cindex start condition, applying to multiple patterns
1925Often, such as in some of the examples above, you wind up writing a
1926whole bunch of rules all preceded by the same start condition(s).  Flex
1927makes this a little easier and cleaner by introducing a notion of start
1928condition @dfn{scope}.  A start condition scope is begun with:
1929
1930@example
1931@verbatim
1932    <SCs>{
1933@end verbatim
1934@end example
1935
1936where @code{<SCs>} is a list of one or more start conditions.  Inside the
1937start condition scope, every rule automatically has the prefix
1938@code{<SCs>} applied to it, until a @samp{@}} which matches the initial
1939@samp{@{}.  So, for example,
1940
1941@cindex extended scope of start conditions
1942@example
1943@verbatim
1944    <ESC>{
1945        "\\n"   return '\n';
1946        "\\r"   return '\r';
1947        "\\f"   return '\f';
1948        "\\0"   return '\0';
1949    }
1950@end verbatim
1951@end example
1952
1953is equivalent to:
1954
1955@example
1956@verbatim
1957    <ESC>"\\n"  return '\n';
1958    <ESC>"\\r"  return '\r';
1959    <ESC>"\\f"  return '\f';
1960    <ESC>"\\0"  return '\0';
1961@end verbatim
1962@end example
1963
1964Start condition scopes may be nested.
1965
1966@cindex stacks, routines for manipulating
1967@cindex start conditions, use of a stack
1968
1969The following routines are available for manipulating stacks of start conditions:
1970
1971@deftypefun  void yy_push_state ( int @code{new_state} )
1972pushes the current start condition onto the top of the start condition
1973stack and switches to
1974@code{new_state}
1975as though you had used
1976@code{BEGIN new_state}
1977(recall that start condition names are also integers).
1978@end deftypefun
1979
1980@deftypefun void yy_pop_state ()
1981pops the top of the stack and switches to it via
1982@code{BEGIN}.
1983@end deftypefun
1984
1985@deftypefun int yy_top_state ()
1986returns the top of the stack without altering the stack's contents.
1987@end deftypefun
1988
1989@cindex memory, for start condition stacks
1990The start condition stack grows dynamically and so has no built-in size
1991limitation.  If memory is exhausted, program execution aborts.
1992
1993To use start condition stacks, your scanner must include a @code{%option
1994stack} directive (@pxref{Scanner Options}).
1995
1996@node Multiple Input Buffers, EOF, Start Conditions, Top
1997@chapter Multiple Input Buffers
1998
1999@cindex multiple input streams
2000Some scanners (such as those which support ``include'' files) require
2001reading from several input streams.  As @code{flex} scanners do a large
2002amount of buffering, one cannot control where the next input will be
2003read from by simply writing a @code{YY_INPUT()} which is sensitive to
2004the scanning context.  @code{YY_INPUT()} is only called when the scanner
2005reaches the end of its buffer, which may be a long time after scanning a
2006statement such as an @code{include} statement which requires switching
2007the input source.
2008
2009To negotiate these sorts of problems, @code{flex} provides a mechanism
2010for creating and switching between multiple input buffers.  An input
2011buffer is created by using:
2012
2013@cindex memory, allocating input buffers
2014@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size )
2015@end deftypefun
2016
2017which takes a @code{FILE} pointer and a size and creates a buffer
2018associated with the given file and large enough to hold @code{size}
2019characters (when in doubt, use @code{YY_BUF_SIZE} for the size).  It
2020returns a @code{YY_BUFFER_STATE} handle, which may then be passed to
2021other routines (see below).
2022@tindex YY_BUFFER_STATE
2023The @code{YY_BUFFER_STATE} type is a
2024pointer to an opaque @code{struct yy_buffer_state} structure, so you may
2025safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE)
20260)} if you wish, and also refer to the opaque structure in order to
2027correctly declare input buffers in source files other than that of your
2028scanner.  Note that the @code{FILE} pointer in the call to
2029@code{yy_create_buffer} is only used as the value of @file{yyin} seen by
2030@code{YY_INPUT}.  If you redefine @code{YY_INPUT()} so it no longer uses
2031@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to
2032@code{yy_create_buffer}.  You select a particular buffer to scan from
2033using:
2034
2035@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer )
2036@end deftypefun
2037
2038The above function switches the scanner's input buffer so subsequent tokens
2039will come from @code{new_buffer}.  Note that @code{yy_switch_to_buffer()} may
2040be used by @code{yywrap()} to set things up for continued scanning, instead of
2041opening a new file and pointing @file{yyin} at it. If you are looking for a
2042stack of input buffers, then you want to use @code{yypush_buffer_state()}
2043instead of this function. Note also that switching input sources via either
2044@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the
2045start condition.
2046
2047@cindex memory, deleting input buffers
2048@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer )
2049@end deftypefun
2050
2051is used to reclaim the storage associated with a buffer.  (@code{buffer}
2052can be NULL, in which case the routine does nothing.)  You can also clear
2053the current contents of a buffer using:
2054
2055@cindex pushing an input buffer
2056@cindex stack, input buffer push
2057@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer )
2058@end deftypefun
2059
2060This function pushes the new buffer state onto an internal stack. The pushed
2061state becomes the new current state. The stack is maintained by flex and will
2062grow as required. This function is intended to be used instead of
2063@code{yy_switch_to_buffer}, when you want to change states, but preserve the
2064current state for later use. 
2065
2066@cindex popping an input buffer
2067@cindex stack, input buffer pop
2068@deftypefun void yypop_buffer_state ( )
2069@end deftypefun
2070
2071This function removes the current state from the top of the stack, and deletes
2072it by calling @code{yy_delete_buffer}.  The next state on the stack, if any,
2073becomes the new current state.
2074
2075@cindex clearing an input buffer
2076@cindex flushing an input buffer
2077@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer )
2078@end deftypefun
2079
2080This function discards the buffer's contents,
2081so the next time the scanner attempts to match a token from the
2082buffer, it will first fill the buffer anew using
2083@code{YY_INPUT()}.
2084
2085@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size )
2086@end deftypefun
2087
2088is an alias for @code{yy_create_buffer()},
2089provided for compatibility with the C++ use of @code{new} and
2090@code{delete} for creating and destroying dynamic objects.
2091
2092@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro
2093@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the
2094current buffer. It should not be used as an lvalue.
2095
2096@cindex EOF, example using multiple input buffers
2097Here are two examples of using these features for writing a scanner
2098which expands include files (the
2099@code{<<EOF>>}
2100feature is discussed below).
2101
2102This first example uses yypush_buffer_state and yypop_buffer_state. Flex
2103maintains the stack internally.
2104
2105@cindex handling include files with multiple input buffers
2106@example
2107@verbatim
2108    /* the "incl" state is used for picking up the name
2109     * of an include file
2110     */
2111    %x incl
2112    %%
2113    include             BEGIN(incl);
2114
2115    [a-z]+              ECHO;
2116    [^a-z\n]*\n?        ECHO;
2117
2118    <incl>[ \t]*      /* eat the whitespace */
2119    <incl>[^ \t\n]+   { /* got the include file name */
2120            yyin = fopen( yytext, "r" );
2121
2122            if ( ! yyin )
2123                error( ... );
2124
2125			yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE ));
2126
2127            BEGIN(INITIAL);
2128            }
2129
2130    <<EOF>> {
2131			yypop_buffer_state();
2132
2133            if ( !YY_CURRENT_BUFFER )
2134                {
2135                yyterminate();
2136                }
2137            }
2138@end verbatim
2139@end example
2140
2141The second example, below, does the same thing as the previous example did, but
2142manages its own input buffer stack manually (instead of letting flex do it).
2143
2144@cindex handling include files with multiple input buffers
2145@example
2146@verbatim
2147    /* the "incl" state is used for picking up the name
2148     * of an include file
2149     */
2150    %x incl
2151
2152    %{
2153    #define MAX_INCLUDE_DEPTH 10
2154    YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
2155    int include_stack_ptr = 0;
2156    %}
2157
2158    %%
2159    include             BEGIN(incl);
2160
2161    [a-z]+              ECHO;
2162    [^a-z\n]*\n?        ECHO;
2163
2164    <incl>[ \t]*      /* eat the whitespace */
2165    <incl>[^ \t\n]+   { /* got the include file name */
2166            if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
2167                {
2168                fprintf( stderr, "Includes nested too deeply" );
2169                exit( 1 );
2170                }
2171
2172            include_stack[include_stack_ptr++] =
2173                YY_CURRENT_BUFFER;
2174
2175            yyin = fopen( yytext, "r" );
2176
2177            if ( ! yyin )
2178                error( ... );
2179
2180            yy_switch_to_buffer(
2181                yy_create_buffer( yyin, YY_BUF_SIZE ) );
2182
2183            BEGIN(INITIAL);
2184            }
2185
2186    <<EOF>> {
2187            if ( --include_stack_ptr  0 )
2188                {
2189                yyterminate();
2190                }
2191
2192            else
2193                {
2194                yy_delete_buffer( YY_CURRENT_BUFFER );
2195                yy_switch_to_buffer(
2196                     include_stack[include_stack_ptr] );
2197                }
2198            }
2199@end verbatim
2200@end example
2201
2202@anchor{Scanning Strings}
2203@cindex strings, scanning strings instead of files
2204The following routines are available for setting up input buffers for
2205scanning in-memory strings instead of files.  All of them create a new
2206input buffer for scanning the string, and return a corresponding
2207@code{YY_BUFFER_STATE} handle (which you should delete with
2208@code{yy_delete_buffer()} when done with it).  They also switch to the
2209new buffer using @code{yy_switch_to_buffer()}, so the next call to
2210@code{yylex()} will start scanning the string.
2211
2212@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str )
2213scans a NUL-terminated string.
2214@end deftypefun
2215
2216@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len )
2217scans @code{len} bytes (including possibly @code{NUL}s) starting at location
2218@code{bytes}.
2219@end deftypefun
2220
2221Note that both of these functions create and scan a @emph{copy} of the
2222string or bytes.  (This may be desirable, since @code{yylex()} modifies
2223the contents of the buffer it is scanning.)  You can avoid the copy by
2224using:
2225
2226@vindex YY_END_OF_BUFFER_CHAR
2227@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size)
2228which scans in place the buffer starting at @code{base}, consisting of
2229@code{size} bytes, the last two bytes of which @emph{must} be
2230@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL).  These last two bytes are not
2231scanned; thus, scanning consists of @code{base[0]} through
2232@code{base[size-2]}, inclusive.
2233@end deftypefun
2234
2235If you fail to set up @code{base} in this manner (i.e., forget the final
2236two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()}
2237returns a NULL pointer instead of creating a new input buffer.
2238
2239@deftp  {Data type} yy_size_t
2240is an integral type to which you can cast an integer expression
2241reflecting the size of the buffer.
2242@end deftp
2243
2244@node EOF, Misc Macros, Multiple Input Buffers, Top
2245@chapter End-of-File Rules
2246
2247@cindex EOF, explanation
2248The special rule @code{<<EOF>>} indicates
2249actions which are to be taken when an end-of-file is
2250encountered and @code{yywrap()} returns non-zero (i.e., indicates
2251no further files to process).  The action must finish
2252by doing one of the following things:
2253
2254@itemize
2255@item
2256@findex YY_NEW_FILE  (now obsolete)
2257assigning @file{yyin} to a new input file (in previous versions of
2258@code{flex}, after doing the assignment you had to call the special
2259action @code{YY_NEW_FILE}.  This is no longer necessary.)
2260
2261@item
2262executing a @code{return} statement;
2263
2264@item
2265executing the special @code{yyterminate()} action.
2266
2267@item
2268or, switching to a new buffer using @code{yy_switch_to_buffer()} as
2269shown in the example above.
2270@end itemize
2271
2272<<EOF>> rules may not be used with other patterns; they may only be
2273qualified with a list of start conditions.  If an unqualified <<EOF>>
2274rule is given, it applies to @emph{all} start conditions which do not
2275already have <<EOF>> actions.  To specify an <<EOF>> rule for only the
2276initial start condition, use:
2277
2278@example
2279@verbatim
2280    <INITIAL><<EOF>>
2281@end verbatim
2282@end example
2283
2284These rules are useful for catching things like unclosed comments.  An
2285example:
2286
2287@cindex <<EOF>>, use of
2288@example
2289@verbatim
2290    %x quote
2291    %%
2292
2293    ...other rules for dealing with quotes...
2294
2295    <quote><<EOF>>   {
2296             error( "unterminated quote" );
2297             yyterminate();
2298             }
2299   <<EOF>>  {
2300             if ( *++filelist )
2301                 yyin = fopen( *filelist, "r" );
2302             else
2303                yyterminate();
2304             }
2305@end verbatim
2306@end example
2307
2308@node Misc Macros, User Values, EOF, Top
2309@chapter Miscellaneous Macros
2310
2311@hkindex YY_USER_ACTION
2312The macro @code{YY_USER_ACTION} can be defined to provide an action
2313which is always executed prior to the matched rule's action.  For
2314example, it could be #define'd to call a routine to convert yytext to
2315lower-case.  When @code{YY_USER_ACTION} is invoked, the variable
2316@code{yy_act} gives the number of the matched rule (rules are numbered
2317starting with 1).  Suppose you want to profile how often each of your
2318rules is matched.  The following would do the trick:
2319
2320@cindex YY_USER_ACTION to track each time a rule is matched
2321@example
2322@verbatim
2323    #define YY_USER_ACTION ++ctr[yy_act]
2324@end verbatim
2325@end example
2326
2327@vindex YY_NUM_RULES
2328where @code{ctr} is an array to hold the counts for the different rules.
2329Note that the macro @code{YY_NUM_RULES} gives the total number of rules
2330(including the default rule), even if you use @samp{-s)}, so a correct
2331declaration for @code{ctr} is:
2332
2333@example
2334@verbatim
2335    int ctr[YY_NUM_RULES];
2336@end verbatim
2337@end example
2338
2339@hkindex YY_USER_INIT
2340The macro @code{YY_USER_INIT} may be defined to provide an action which
2341is always executed before the first scan (and before the scanner's
2342internal initializations are done).  For example, it could be used to
2343call a routine to read in a data table or open a logging file.
2344
2345@findex yy_set_interactive
2346The macro @code{yy_set_interactive(is_interactive)} can be used to
2347control whether the current buffer is considered @dfn{interactive}.  An
2348interactive buffer is processed more slowly, but must be used when the
2349scanner's input source is indeed interactive to avoid problems due to
2350waiting to fill buffers (see the discussion of the @samp{-I} flag in
2351@ref{Scanner Options}).  A non-zero value in the macro invocation marks
2352the buffer as interactive, a zero value as non-interactive.  Note that
2353use of this macro overrides @code{%option always-interactive} or
2354@code{%option never-interactive} (@pxref{Scanner Options}).
2355@code{yy_set_interactive()} must be invoked prior to beginning to scan
2356the buffer that is (or is not) to be considered interactive.
2357
2358@cindex BOL, setting it
2359@findex yy_set_bol
2360The macro @code{yy_set_bol(at_bol)} can be used to control whether the
2361current buffer's scanning context for the next token match is done as
2362though at the beginning of a line.  A non-zero macro argument makes
2363rules anchored with @samp{^} active, while a zero argument makes
2364@samp{^} rules inactive.
2365
2366@cindex BOL, checking the BOL flag
2367@findex YY_AT_BOL
2368The macro @code{YY_AT_BOL()} returns true if the next token scanned from
2369the current buffer will have @samp{^} rules active, false otherwise.
2370
2371@cindex actions, redefining YY_BREAK
2372@hkindex YY_BREAK
2373In the generated scanner, the actions are all gathered in one large
2374switch statement and separated using @code{YY_BREAK}, which may be
2375redefined.  By default, it is simply a @code{break}, to separate each
2376rule's action from the following rule's.  Redefining @code{YY_BREAK}
2377allows, for example, C++ users to #define YY_BREAK to do nothing (while
2378being very careful that every rule ends with a @code{break} or a
2379@code{return}!) to avoid suffering from unreachable statement warnings
2380where because a rule's action ends with @code{return}, the
2381@code{YY_BREAK} is inaccessible.
2382
2383@node User Values, Yacc, Misc Macros, Top
2384@chapter Values Available To the User
2385
2386This chapter summarizes the various values available to the user in the
2387rule actions.
2388
2389@table @code
2390@vindex yytext
2391@item  char *yytext
2392holds the text of the current token.  It may be modified but not
2393lengthened (you cannot append characters to the end).
2394
2395@cindex yytext, default array size
2396@cindex array, default size for yytext
2397@vindex YYLMAX
2398If the special directive @code{%array} appears in the first section of
2399the scanner description, then @code{yytext} is instead declared
2400@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition
2401that you can redefine in the first section if you don't like the default
2402value (generally 8KB).  Using @code{%array} results in somewhat slower
2403scanners, but the value of @code{yytext} becomes immune to calls to
2404@code{unput()}, which potentially destroy its value when @code{yytext} is
2405a character pointer.  The opposite of @code{%array} is @code{%pointer},
2406which is the default.
2407
2408@cindex C++ and %array
2409You cannot use @code{%array} when generating C++ scanner classes (the
2410@samp{-+} flag).
2411
2412@vindex yyleng
2413@item  int yyleng
2414holds the length of the current token.
2415
2416@vindex yyin
2417@item  FILE *yyin
2418is the file which by default @code{flex} reads from.  It may be
2419redefined but doing so only makes sense before scanning begins or after
2420an EOF has been encountered.  Changing it in the midst of scanning will
2421have unexpected results since @code{flex} buffers its input; use
2422@code{yyrestart()} instead.  Once scanning terminates because an
2423end-of-file has been seen, you can assign @file{yyin} at the new input
2424file and then call the scanner again to continue scanning.
2425
2426@findex yyrestart
2427@item  void yyrestart( FILE *new_file )
2428may be called to point @file{yyin} at the new input file.  The
2429switch-over to the new file is immediate (any previously buffered-up
2430input is lost).  Note that calling @code{yyrestart()} with @file{yyin}
2431as an argument thus throws away the current input buffer and continues
2432scanning the same input file.
2433
2434@vindex yyout
2435@item  FILE *yyout
2436is the file to which @code{ECHO} actions are done.  It can be reassigned
2437by the user.
2438
2439@vindex YY_CURRENT_BUFFER
2440@item  YY_CURRENT_BUFFER
2441returns a @code{YY_BUFFER_STATE} handle to the current buffer.
2442
2443@vindex YY_START
2444@item  YY_START
2445returns an integer value corresponding to the current start condition.
2446You can subsequently use this value with @code{BEGIN} to return to that
2447start condition.
2448@end table
2449
2450@node Yacc, Scanner Options, User Values, Top
2451@chapter Interfacing with Yacc
2452
2453@cindex yacc, interface
2454
2455@vindex yylval, with yacc
2456One of the main uses of @code{flex} is as a companion to the @code{yacc}
2457parser-generator.  @code{yacc} parsers expect to call a routine named
2458@code{yylex()} to find the next input token.  The routine is supposed to
2459return the type of the next token as well as putting any associated
2460value in the global @code{yylval}.  To use @code{flex} with @code{yacc},
2461one specifies the @samp{-d} option to @code{yacc} to instruct it to
2462generate the file @file{y.tab.h} containing definitions of all the
2463@code{%tokens} appearing in the @code{yacc} input.  This file is then
2464included in the @code{flex} scanner.  For example, if one of the tokens
2465is @code{TOK_NUMBER}, part of the scanner might look like:
2466
2467@cindex yacc interface
2468@example
2469@verbatim
2470    %{
2471    #include "y.tab.h"
2472    %}
2473
2474    %%
2475
2476    [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
2477@end verbatim
2478@end example
2479
2480@node Scanner Options, Performance, Yacc, Top
2481@chapter Scanner Options
2482
2483@cindex command-line options
2484@cindex options, command-line
2485@cindex arguments, command-line
2486
2487The various @code{flex} options are categorized by function in the following
2488menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}.
2489
2490@menu
2491* Options for Specifying Filenames::  
2492* Options Affecting Scanner Behavior::  
2493* Code-Level And API Options::  
2494* Options for Scanner Speed and Size::  
2495* Debugging Options::           
2496* Miscellaneous Options::       
2497@end menu
2498
2499Even though there are many scanner options, a typical scanner might only
2500specify the following options:
2501
2502@example
2503@verbatim
2504%option   8bit reentrant bison-bridge
2505%option   warn nodefault
2506%option   yylineno
2507%option   outfile="scanner.c" header-file="scanner.h"
2508@end verbatim
2509@end example
2510
2511The first line specifies the general type of scanner we want. The second line
2512specifies that we are being careful. The third line asks flex to track line
2513numbers. The last line tells flex what to name the files. (The options can be
2514specified in any order. We just divided them.)
2515
2516@code{flex} also provides a mechanism for controlling options within the
2517scanner specification itself, rather than from the flex command-line.
2518This is done by including @code{%option} directives in the first section
2519of the scanner specification.  You can specify multiple options with a
2520single @code{%option} directive, and multiple directives in the first
2521section of your flex input file.
2522
2523Most options are given simply as names, optionally preceded by the
2524word @samp{no} (with no intervening whitespace) to negate their meaning.
2525The names are the same as their long-option equivalents (but without the
2526leading @samp{--} ).
2527
2528@code{flex} scans your rule actions to determine whether you use the
2529@code{REJECT} or @code{yymore()} features.  The @code{REJECT} and
2530@code{yymore} options are available to override its decision as to
2531whether you use the options, either by setting them (e.g., @code{%option
2532reject)} to indicate the feature is indeed used, or unsetting them to
2533indicate it actually is not used (e.g., @code{%option noyymore)}.
2534
2535
2536A number of options are available for lint purists who want to suppress
2537the appearance of unneeded routines in the generated scanner.  Each of
2538the following, if unset (e.g., @code{%option nounput}), results in the
2539corresponding routine not appearing in the generated scanner:
2540
2541@example
2542@verbatim
2543    input, unput
2544    yy_push_state, yy_pop_state, yy_top_state
2545    yy_scan_buffer, yy_scan_bytes, yy_scan_string
2546
2547    yyget_extra, yyset_extra, yyget_leng, yyget_text,
2548    yyget_lineno, yyset_lineno, yyget_in, yyset_in,
2549    yyget_out, yyset_out, yyget_lval, yyset_lval,
2550    yyget_lloc, yyset_lloc, yyget_debug, yyset_debug
2551@end verbatim
2552@end example
2553
2554(though @code{yy_push_state()} and friends won't appear anyway unless
2555you use @code{%option stack)}.
2556
2557@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options
2558@section Options for Specifying Filenames
2559
2560@table @samp
2561
2562@anchor{option-header}
2563@opindex ---header-file
2564@opindex header-file
2565@item --header-file=FILE, @code{%option header-file="FILE"}
2566instructs flex to write a C header to @file{FILE}. This file contains
2567function prototypes, extern variables, and types used by the scanner.
2568Only the external API is exported by the header file. Many macros that
2569are usable from within scanner actions are not exported to the header
2570file. This is due to namespace problems and the goal of a clean
2571external API.
2572
2573While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy}
2574is substituted with the appropriate prefix.
2575
2576The @samp{--header-file} option is not compatible with the @samp{--c++} option,
2577since the C++ scanner provides its own header in @file{yyFlexLexer.h}.
2578
2579
2580
2581@anchor{option-outfile}
2582@opindex -o
2583@opindex ---outfile
2584@opindex outfile
2585@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"}
2586directs flex to write the scanner to the file @file{FILE} instead of
2587@file{lex.yy.c}.  If you combine @samp{--outfile} with the @samp{--stdout} option,
2588then the scanner is written to @file{stdout} but its @code{#line}
2589directives (see the @samp{-l} option above) refer to the file
2590@file{FILE}.
2591
2592
2593
2594@anchor{option-stdout}
2595@opindex -t
2596@opindex ---stdout
2597@opindex stdout
2598@item -t, --stdout, @code{%option stdout}
2599instructs @code{flex} to write the scanner it generates to standard
2600output instead of @file{lex.yy.c}.
2601
2602
2603
2604@opindex ---skel
2605@item -SFILE, --skel=FILE
2606overrides the default skeleton file from which
2607@code{flex}
2608constructs its scanners.  You'll never need this option unless you are doing
2609@code{flex}
2610maintenance or development.
2611
2612@opindex ---tables-file
2613@opindex tables-file
2614@item --tables-file=FILE
2615Write serialized scanner dfa tables to FILE. The generated scanner will not
2616contain the tables, and requires them to be loaded at runtime.
2617@xref{serialization}.
2618
2619@opindex ---tables-verify
2620@opindex tables-verify
2621@item --tables-verify
2622This option is for flex development. We document it here in case you stumble
2623upon it by accident or in case you suspect some inconsistency in the serialized
2624tables.  Flex will serialize the scanner dfa tables but will also generate the
2625in-code tables as it normally does. At runtime, the scanner will verify that
2626the serialized tables match the in-code tables, instead of loading them. 
2627
2628@end table
2629
2630@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options
2631@section Options Affecting Scanner Behavior
2632
2633@table @samp
2634@anchor{option-case-insensitive}
2635@opindex -i
2636@opindex ---case-insensitive
2637@opindex case-insensitive
2638@item -i, --case-insensitive, @code{%option case-insensitive}
2639instructs @code{flex} to generate a @dfn{case-insensitive} scanner.  The
2640case of letters given in the @code{flex} input patterns will be ignored,
2641and tokens in the input will be matched regardless of case.  The matched
2642text given in @code{yytext} will have the preserved case (i.e., it will
2643not be folded).  For tricky behavior, see @ref{case and character ranges}.
2644
2645
2646
2647@anchor{option-lex-compat}
2648@opindex -l
2649@opindex ---lex-compat
2650@opindex lex-compat
2651@item -l, --lex-compat, @code{%option lex-compat}
2652turns on maximum compatibility with the original AT&T @code{lex}
2653implementation.  Note that this does not mean @emph{full} compatibility.
2654Use of this option costs a considerable amount of performance, and it
2655cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or
2656@samp{-CF} options.  For details on the compatibilities it provides, see
2657@ref{Lex and Posix}.  This option also results in the name
2658@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner.
2659
2660
2661
2662@anchor{option-batch}
2663@opindex -B
2664@opindex ---batch
2665@opindex batch
2666@item -B, --batch, @code{%option batch}
2667instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of
2668@emph{interactive} scanners generated by @samp{--interactive} (see below).  In
2669general, you use @samp{-B} when you are @emph{certain} that your scanner
2670will never be used interactively, and you want to squeeze a
2671@emph{little} more performance out of it.  If your goal is instead to
2672squeeze out a @emph{lot} more performance, you should be using the
2673@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically
2674anyway.
2675
2676
2677
2678@anchor{option-interactive}
2679@opindex -I
2680@opindex ---interactive
2681@opindex interactive
2682@item -I, --interactive, @code{%option interactive}
2683instructs @code{flex} to generate an @i{interactive} scanner.  An
2684interactive scanner is one that only looks ahead to decide what token
2685has been matched if it absolutely must.  It turns out that always
2686looking one extra character ahead, even if the scanner has already seen
2687enough text to disambiguate the current token, is a bit faster than only
2688looking ahead when necessary.  But scanners that always look ahead give
2689dreadful interactive performance; for example, when a user types a
2690newline, it is not recognized as a newline token until they enter
2691@emph{another} token, which often means typing in another whole line.
2692
2693@code{flex} scanners default to @code{interactive} unless you use the
2694@samp{-Cf} or @samp{-CF} table-compression options
2695(@pxref{Performance}).  That's because if you're looking for
2696high-performance you should be using one of these options, so if you
2697didn't, @code{flex} assumes you'd rather trade off a bit of run-time
2698performance for intuitive interactive behavior.  Note also that you
2699@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or
2700@samp{-CF}.  Thus, this option is not really needed; it is on by default
2701for all those cases in which it is allowed.
2702
2703You can force a scanner to
2704@emph{not}
2705be interactive by using
2706@samp{--batch}
2707
2708
2709
2710@anchor{option-7bit}
2711@opindex -7
2712@opindex ---7bit
2713@opindex 7bit
2714@item -7, --7bit, @code{%option 7bit}
2715instructs @code{flex} to generate a 7-bit scanner, i.e., one which can
2716only recognize 7-bit characters in its input.  The advantage of using
2717@samp{--7bit} is that the scanner's tables can be up to half the size of
2718those generated using the @samp{--8bit}.  The disadvantage is that such
2719scanners often hang or crash if their input contains an 8-bit character.
2720
2721Note, however, that unless you generate your scanner using the
2722@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit}
2723will save only a small amount of table space, and make your scanner
2724considerably less portable.  @code{Flex}'s default behavior is to
2725generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF},
2726in which case @code{flex} defaults to generating 7-bit scanners unless
2727your site was always configured to generate 8-bit scanners (as will
2728often be the case with non-USA sites).  You can tell whether flex
2729generated a 7-bit or an 8-bit scanner by inspecting the flag summary in
2730the @samp{--verbose} output as described above.
2731
2732Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still
2733defaults to generating an 8-bit scanner, since usually with these
2734compression options full 8-bit tables are not much more expensive than
27357-bit tables.
2736
2737
2738
2739@anchor{option-8bit}
2740@opindex -8
2741@opindex ---8bit
2742@opindex 8bit
2743@item -8, --8bit, @code{%option 8bit}
2744instructs @code{flex} to generate an 8-bit scanner, i.e., one which can
2745recognize 8-bit characters.  This flag is only needed for scanners
2746generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to
2747generating an 8-bit scanner anyway.
2748
2749See the discussion of
2750@samp{--7bit}
2751above for @code{flex}'s default behavior and the tradeoffs between 7-bit
2752and 8-bit scanners.
2753
2754
2755
2756@anchor{option-default}
2757@opindex ---default
2758@opindex default
2759@item --default, @code{%option default}
2760generate the default rule.
2761
2762
2763
2764@anchor{option-always-interactive}
2765@opindex ---always-interactive
2766@opindex always-interactive
2767@item --always-interactive, @code{%option always-interactive}
2768instructs flex to generate a scanner which always considers its input
2769@emph{interactive}.  Normally, on each new input file the scanner calls
2770@code{isatty()} in an attempt to determine whether the scanner's input
2771source is interactive and thus should be read a character at a time.
2772When this option is used, however, then no such call is made.
2773
2774
2775
2776@opindex ---never-interactive
2777@item --never-interactive, @code{--never-interactive}
2778instructs flex to generate a scanner which never considers its input
2779interactive.  This is the opposite of @code{always-interactive}.
2780
2781
2782@anchor{option-posix}
2783@opindex -X
2784@opindex ---posix
2785@opindex posix
2786@item -X, --posix, @code{%option posix}
2787turns on maximum compatibility with the POSIX 1003.2-1992 definition of
2788@code{lex}.  Since @code{flex} was originally designed to implement the
2789POSIX definition of @code{lex} this generally involves very few changes
2790in behavior.  At the current writing the known differences between
2791@code{flex} and the POSIX standard are:
2792
2793@itemize
2794@item
2795In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower
2796precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}).
2797Most POSIX utilities use an Extended Regular Expression (ERE) precedence
2798that has the precedence of the repeat operator higher than concatenation
2799(which causes @samp{ab@{3@}} to yield @samp{abbb}).  By default, @code{flex}
2800places the precedence of the repeat operator higher than concatenation
2801which matches the ERE processing of other POSIX utilities.  When either
2802@samp{--posix} or @samp{-l} are specified, @code{flex} will use the
2803traditional AT&T and POSIX-compliant precedence for the repeat operator
2804where concatenation has higher precedence than the repeat operator.
2805@end itemize
2806
2807
2808@anchor{option-stack}
2809@opindex ---stack
2810@opindex stack
2811@item --stack, @code{%option stack}
2812enables the use of
2813start condition stacks (@pxref{Start Conditions}).
2814
2815
2816
2817@anchor{option-stdinit}
2818@opindex ---stdinit
2819@opindex stdinit
2820@item --stdinit, @code{%option stdinit}
2821if set (i.e., @b{%option stdinit)} initializes @code{yyin} and
2822@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of
2823@file{NULL}.  Some existing @code{lex} programs depend on this behavior,
2824even though it is not compliant with ANSI C, which does not require
2825@file{stdin} and @file{stdout} to be compile-time constant. In a
2826reentrant scanner, however, this is not a problem since initialization
2827is performed in @code{yylex_init} at runtime.
2828
2829
2830
2831@anchor{option-yylineno}
2832@opindex ---yylineno
2833@opindex yylineno
2834@item --yylineno, @code{%option yylineno}
2835directs @code{flex} to generate a scanner
2836that maintains the number of the current line read from its input in the
2837global variable @code{yylineno}.  This option is implied by @code{%option
2838lex-compat}.  In a reentrant C scanner, the macro @code{yylineno} is
2839accessible regardless of the value of @code{%option yylineno}, however, its
2840value is not modified by @code{flex} unless @code{%option yylineno} is enabled.
2841
2842
2843
2844@anchor{option-yywrap}
2845@opindex ---yywrap
2846@opindex yywrap
2847@item --yywrap, @code{%option yywrap}
2848if unset (i.e., @code{--noyywrap)}, makes the scanner not call
2849@code{yywrap()} upon an end-of-file, but simply assume that there are no
2850more files to scan (until the user points @file{yyin} at a new file and
2851calls @code{yylex()} again).
2852
2853@end table
2854
2855@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options
2856@section Code-Level And API Options
2857
2858@table @samp
2859
2860@anchor{option-ansi-definitions}
2861@opindex ---option-ansi-definitions
2862@opindex ansi-definitions
2863@item --ansi-definitions, @code{%option ansi-definitions}
2864instruct flex to generate ANSI C99 definitions for functions.
2865This option is enabled by default.
2866If @code{%option noansi-definitions} is specified, then the obsolete style
2867is generated.
2868
2869@anchor{option-ansi-prototypes}
2870@opindex ---option-ansi-prototypes
2871@opindex ansi-prototypes
2872@item --ansi-prototypes, @code{%option ansi-prototypes}
2873instructs flex to generate ANSI C99 prototypes for functions. 
2874This option is enabled by default.
2875If @code{noansi-prototypes} is specified, then
2876prototypes will have empty parameter lists.
2877
2878@anchor{option-bison-bridge}
2879@opindex ---bison-bridge
2880@opindex bison-bridge
2881@item --bison-bridge, @code{%option bison-bridge}
2882instructs flex to generate a C scanner that is
2883meant to be called by a
2884@code{GNU bison}
2885parser. The scanner has minor API changes for
2886@code{bison}
2887compatibility. In particular, the declaration of
2888@code{yylex}
2889is modified to take an additional parameter,
2890@code{yylval}.
2891@xref{Bison Bridge}.
2892
2893@anchor{option-bison-locations}
2894@opindex ---bison-locations
2895@opindex bison-locations
2896@item --bison-locations, @code{%option bison-locations}
2897instruct flex that 
2898@code{GNU bison} @code{%locations} are being used.
2899This means @code{yylex} will be passed
2900an additional parameter, @code{yylloc}. This option
2901implies @code{%option bison-bridge}.
2902@xref{Bison Bridge}.
2903
2904@anchor{option-noline}
2905@opindex -L
2906@opindex ---noline
2907@opindex noline
2908@item -L, --noline, @code{%option noline}
2909instructs
2910@code{flex}
2911not to generate
2912@code{#line}
2913directives.  Without this option,
2914@code{flex}
2915peppers the generated scanner
2916with @code{#line} directives so error messages in the actions will be correctly
2917located with respect to either the original
2918@code{flex}
2919input file (if the errors are due to code in the input file), or
2920@file{lex.yy.c}
2921(if the errors are
2922@code{flex}'s
2923fault -- you should report these sorts of errors to the email address
2924given in @ref{Reporting Bugs}).
2925
2926
2927
2928@anchor{option-reentrant}
2929@opindex -R
2930@opindex ---reentrant
2931@opindex reentrant
2932@item -R, --reentrant, @code{%option reentrant}
2933instructs flex to generate a reentrant C scanner.  The generated scanner
2934may safely be used in a multi-threaded environment. The API for a
2935reentrant scanner is different than for a non-reentrant scanner
2936@pxref{Reentrant}).  Because of the API difference between
2937reentrant and non-reentrant @code{flex} scanners, non-reentrant flex
2938code must be modified before it is suitable for use with this option.
2939This option is not compatible with the @samp{--c++} option.
2940
2941The option @samp{--reentrant} does not affect the performance of
2942the scanner.
2943
2944
2945
2946@anchor{option-c++}
2947@opindex -+
2948@opindex ---c++
2949@opindex c++
2950@item -+, --c++, @code{%option c++}
2951specifies that you want flex to generate a C++
2952scanner class.  @xref{Cxx}, for
2953details.
2954
2955
2956
2957@anchor{option-array}
2958@opindex ---array
2959@opindex array
2960@item --array, @code{%option array}
2961specifies that you want yytext to be an array instead of a char*
2962
2963
2964
2965@anchor{option-pointer}
2966@opindex ---pointer
2967@opindex pointer
2968@item --pointer, @code{%option pointer}
2969specify that  @code{yytext} should be a @code{char *}, not an array.
2970This default is @code{char *}.
2971
2972
2973
2974@anchor{option-prefix}
2975@opindex -P
2976@opindex ---prefix
2977@opindex prefix
2978@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"}
2979changes the default @samp{yy} prefix used by @code{flex} for all
2980globally-visible variable and function names to instead be
2981@samp{PREFIX}.  For example, @samp{--prefix=foo} changes the name of
2982@code{yytext} to @code{footext}.  It also changes the name of the default
2983output file from @file{lex.yy.c} to @file{lex.foo.c}.  Here is a partial
2984list of the names affected:
2985
2986@example
2987@verbatim
2988    yy_create_buffer
2989    yy_delete_buffer
2990    yy_flex_debug
2991    yy_init_buffer
2992    yy_flush_buffer
2993    yy_load_buffer_state
2994    yy_switch_to_buffer
2995    yyin
2996    yyleng
2997    yylex
2998    yylineno
2999    yyout
3000    yyrestart
3001    yytext
3002    yywrap
3003    yyalloc
3004    yyrealloc
3005    yyfree
3006@end verbatim
3007@end example
3008
3009(If you are using a C++ scanner, then only @code{yywrap} and
3010@code{yyFlexLexer} are affected.)  Within your scanner itself, you can
3011still refer to the global variables and functions using either version
3012of their name; but externally, they have the modified name.
3013
3014This option lets you easily link together multiple
3015@code{flex}
3016programs into the same executable.  Note, though, that using this
3017option also renames
3018@code{yywrap()},
3019so you now
3020@emph{must}
3021either
3022provide your own (appropriately-named) version of the routine for your
3023scanner, or use
3024@code{%option noyywrap},
3025as linking with
3026@samp{-lfl}
3027no longer provides one for you by default.
3028
3029
3030
3031@anchor{option-main}
3032@opindex ---main
3033@opindex main
3034@item --main, @code{%option main}
3035 directs flex to provide a default @code{main()} program for the
3036scanner, which simply calls @code{yylex()}.  This option implies
3037@code{noyywrap} (see below).
3038
3039
3040
3041@anchor{option-nounistd}
3042@opindex ---nounistd
3043@opindex nounistd
3044@item --nounistd, @code{%option nounistd}
3045suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option
3046is meant to target environments in which @file{unistd.h} does not exist. Be aware
3047that certain options may cause flex to generate code that relies on functions
3048normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.)
3049If you wish to use these functions, you will have to inform your compiler where
3050to find them.
3051@xref{option-always-interactive}. @xref{option-read}.
3052
3053
3054
3055@anchor{option-yyclass}
3056@opindex ---yyclass
3057@opindex yyclass
3058@item --yyclass=NAME, @code{%option yyclass="NAME"}
3059only applies when generating a C++ scanner (the @samp{--c++} option).  It
3060informs @code{flex} that you have derived @code{NAME} as a subclass of
3061@code{yyFlexLexer}, so @code{flex} will place your actions in the member
3062function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}.  It
3063also generates a @code{yyFlexLexer::yylex()} member function that emits
3064a run-time error (by invoking @code{yyFlexLexer::LexerError())} if
3065called.  @xref{Cxx}.
3066
3067@end table
3068
3069@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options
3070@section Options for Scanner Speed and Size
3071
3072@table @samp
3073
3074@item -C[aefFmr]
3075controls the degree of table compression and, more generally, trade-offs
3076between small scanners and fast scanners.
3077
3078@table @samp
3079@opindex -C
3080@item -C
3081A lone @samp{-C} specifies that the scanner tables should be compressed
3082but neither equivalence classes nor meta-equivalence classes should be
3083used.
3084
3085@anchor{option-align}
3086@opindex -Ca
3087@opindex ---align
3088@opindex align
3089@item -Ca, --align, @code{%option align}
3090(``align'') instructs flex to trade off larger tables in the
3091generated scanner for faster performance because the elements of
3092the tables are better aligned for memory access and computation.  On some
3093RISC architectures, fetching and manipulating longwords is more efficient
3094than with smaller-sized units such as shortwords.  This option can
3095quadruple the size of the tables used by your scanner.
3096
3097@anchor{option-ecs}
3098@opindex -Ce
3099@opindex ---ecs
3100@opindex ecs
3101@item -Ce, --ecs, @code{%option ecs}
3102directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets
3103of characters which have identical lexical properties (for example, if
3104the only appearance of digits in the @code{flex} input is in the
3105character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be
3106put in the same equivalence class).  Equivalence classes usually give
3107dramatic reductions in the final table/object file sizes (typically a
3108factor of 2-5) and are pretty cheap performance-wise (one array look-up
3109per character scanned).
3110
3111@opindex -Cf
3112@item -Cf
3113specifies that the @dfn{full} scanner tables should be generated -
3114@code{flex} should not compress the tables by taking advantages of
3115similar transition functions for different states.
3116
3117@opindex -CF
3118@item -CF
3119specifies that the alternate fast scanner representation (described
3120above under the @samp{--fast} flag) should be used.  This option cannot be
3121used with @samp{--c++}.
3122
3123@anchor{option-meta-ecs}
3124@opindex -Cm
3125@opindex ---meta-ecs
3126@opindex meta-ecs
3127@item -Cm, --meta-ecs, @code{%option meta-ecs}
3128directs
3129@code{flex}
3130to construct
3131@dfn{meta-equivalence classes},
3132which are sets of equivalence classes (or characters, if equivalence
3133classes are not being used) that are commonly used together.  Meta-equivalence
3134classes are often a big win when using compressed tables, but they
3135have a moderate performance impact (one or two @code{if} tests and one
3136array look-up per character scanned).
3137
3138@anchor{option-read}
3139@opindex -Cr
3140@opindex ---read
3141@opindex read
3142@item -Cr, --read, @code{%option read}
3143causes the generated scanner to @emph{bypass} use of the standard I/O
3144library (@code{stdio}) for input.  Instead of calling @code{fread()} or
3145@code{getc()}, the scanner will use the @code{read()} system call,
3146resulting in a performance gain which varies from system to system, but
3147in general is probably negligible unless you are also using @samp{-Cf}
3148or @samp{-CF}.  Using @samp{-Cr} can cause strange behavior if, for
3149example, you read from @file{yyin} using @code{stdio} prior to calling
3150the scanner (because the scanner will miss whatever text your previous
3151reads left in the @code{stdio} input buffer).  @samp{-Cr} has no effect
3152if you define @code{YY_INPUT()} (@pxref{Generated Scanner}).
3153@end table
3154
3155The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense
3156together - there is no opportunity for meta-equivalence classes if the
3157table is not being compressed.  Otherwise the options may be freely
3158mixed, and are cumulative.
3159
3160The default setting is @samp{-Cem}, which specifies that @code{flex}
3161should generate equivalence classes and meta-equivalence classes.  This
3162setting provides the highest degree of table compression.  You can trade
3163off faster-executing scanners at the cost of larger tables with the
3164following generally being true:
3165
3166@example
3167@verbatim
3168    slowest & smallest
3169          -Cem
3170          -Cm
3171          -Ce
3172          -C
3173          -C{f,F}e
3174          -C{f,F}
3175          -C{f,F}a
3176    fastest & largest
3177@end verbatim
3178@end example
3179
3180Note that scanners with the smallest tables are usually generated and
3181compiled the quickest, so during development you will usually want to
3182use the default, maximal compression.
3183
3184@samp{-Cfe} is often a good compromise between speed and size for
3185production scanners.
3186
3187@anchor{option-full}
3188@opindex -f
3189@opindex ---full
3190@opindex full
3191@item -f, --full, @code{%option full}
3192specifies
3193@dfn{fast scanner}.
3194No table compression is done and @code{stdio} is bypassed.
3195The result is large but fast.  This option is equivalent to
3196@samp{--Cfr}
3197
3198
3199@anchor{option-fast}
3200@opindex -F
3201@opindex ---fast
3202@opindex fast
3203@item -F, --fast, @code{%option fast}
3204specifies that the @emph{fast} scanner table representation should be
3205used (and @code{stdio} bypassed).  This representation is about as fast
3206as the full table representation @samp{--full}, and for some sets of
3207patterns will be considerably smaller (and for others, larger).  In
3208general, if the pattern set contains both @emph{keywords} and a
3209catch-all, @emph{identifier} rule, such as in the set:
3210
3211@example
3212@verbatim
3213    "case"    return TOK_CASE;
3214    "switch"  return TOK_SWITCH;
3215    ...
3216    "default" return TOK_DEFAULT;
3217    [a-z]+    return TOK_ID;
3218@end verbatim
3219@end example
3220
3221then you're better off using the full table representation.  If only
3222the @emph{identifier} rule is present and you then use a hash table or some such
3223to detect the keywords, you're better off using
3224@samp{--fast}.
3225
3226This option is equivalent to @samp{-CFr}.  It cannot be used
3227with @samp{--c++}.
3228
3229@end table
3230
3231@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options
3232@section Debugging Options
3233
3234@table @samp
3235
3236@anchor{option-backup}
3237@opindex -b
3238@opindex ---backup
3239@opindex backup
3240@item -b, --backup, @code{%option backup}
3241Generate backing-up information to @file{lex.backup}.  This is a list of
3242scanner states which require backing up and the input characters on
3243which they do so.  By adding rules one can remove backing-up states.  If
3244@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF}
3245is used, the generated scanner will run faster (see the @samp{--perf-report} flag).
3246Only users who wish to squeeze every last cycle out of their scanners
3247need worry about this option.  (@pxref{Performance}).
3248
3249
3250
3251@anchor{option-debug}
3252@opindex -d
3253@opindex ---debug
3254@opindex debug
3255@item -d, --debug, @code{%option debug}
3256makes the generated scanner run in @dfn{debug} mode.  Whenever a pattern
3257is recognized and the global variable @code{yy_flex_debug} is non-zero
3258(which is the default), the scanner will write to @file{stderr} a line
3259of the form:
3260
3261@example
3262@verbatim
3263    -accepting rule at line 53 ("the matched text")
3264@end verbatim
3265@end example
3266
3267The line number refers to the location of the rule in the file defining
3268the scanner (i.e., the file that was fed to flex).  Messages are also
3269generated when the scanner backs up, accepts the default rule, reaches
3270the end of its input buffer (or encounters a NUL; at this point, the two
3271look the same as far as the scanner's concerned), or reaches an
3272end-of-file.
3273
3274
3275
3276@anchor{option-perf-report}
3277@opindex -p
3278@opindex ---perf-report
3279@opindex perf-report
3280@item -p, --perf-report, @code{%option perf-report}
3281generates a performance report to @file{stderr}.  The report consists of
3282comments regarding features of the @code{flex} input file which will
3283cause a serious loss of performance in the resulting scanner.  If you
3284give the flag twice, you will also get comments regarding features that
3285lead to minor performance losses.
3286
3287Note that the use of @code{REJECT}, and
3288variable trailing context (@pxref{Limitations}) entails a substantial
3289performance penalty; use of @code{yymore()}, the @samp{^} operator, and
3290the @samp{--interactive} flag entail minor performance penalties.
3291
3292
3293
3294@anchor{option-nodefault}
3295@opindex -s
3296@opindex ---nodefault
3297@opindex nodefault
3298@item -s, --nodefault, @code{%option nodefault}
3299causes the @emph{default rule} (that unmatched scanner input is echoed
3300to @file{stdout)} to be suppressed.  If the scanner encounters input
3301that does not match any of its rules, it aborts with an error.  This
3302option is useful for finding holes in a scanner's rule set.
3303
3304
3305
3306@anchor{option-trace}
3307@opindex -T
3308@opindex ---trace
3309@opindex trace
3310@item -T, --trace, @code{%option trace}
3311makes @code{flex} run in @dfn{trace} mode.  It will generate a lot of
3312messages to @file{stderr} concerning the form of the input and the
3313resultant non-deterministic and deterministic finite automata.  This
3314option is mostly for use in maintaining @code{flex}.
3315
3316
3317
3318@anchor{option-nowarn}
3319@opindex -w
3320@opindex ---nowarn
3321@opindex nowarn
3322@item -w, --nowarn, @code{%option nowarn}
3323suppresses warning messages.
3324
3325
3326
3327@anchor{option-verbose}
3328@opindex -v
3329@opindex ---verbose
3330@opindex verbose
3331@item -v, --verbose, @code{%option verbose}
3332specifies that @code{flex} should write to @file{stderr} a summary of
3333statistics regarding the scanner it generates.  Most of the statistics
3334are meaningless to the casual @code{flex} user, but the first line
3335identifies the version of @code{flex} (same as reported by @samp{--version}),
3336and the next line the flags used when generating the scanner, including
3337those that are on by default.
3338
3339
3340
3341@anchor{option-warn}
3342@opindex ---warn
3343@opindex warn
3344@item --warn, @code{%option warn}
3345warn about certain things. In particular, if the default rule can be
3346matched but no default rule has been given, the flex will warn you.
3347We recommend using this option always.
3348
3349@end table
3350
3351@node Miscellaneous Options,  , Debugging Options, Scanner Options
3352@section Miscellaneous Options
3353
3354@table @samp
3355@opindex -c
3356@item -c
3357A do-nothing option included for POSIX compliance.
3358
3359@opindex -h
3360@opindex ---help
3361@item -h, -?, --help
3362generates a ``help'' summary of @code{flex}'s options to @file{stdout}
3363and then exits.
3364
3365@opindex -n
3366@item -n
3367Another do-nothing option included for
3368POSIX compliance.
3369
3370@opindex -V
3371@opindex ---version
3372@item -V, --version
3373prints the version number to @file{stdout} and exits.
3374
3375@end table
3376
3377
3378@node Performance, Cxx, Scanner Options, Top
3379@chapter Performance Considerations
3380
3381@cindex performance, considerations
3382The main design goal of @code{flex} is that it generate high-performance
3383scanners.  It has been optimized for dealing well with large sets of
3384rules.  Aside from the effects on scanner speed of the table compression
3385@samp{-C} options outlined above, there are a number of options/actions
3386which degrade performance.  These are, from most expensive to least:
3387
3388@cindex REJECT, performance costs
3389@cindex yylineno, performance costs
3390@cindex trailing context, performance costs
3391@example
3392@verbatim
3393    REJECT
3394    arbitrary trailing context
3395
3396    pattern sets that require backing up
3397    %option yylineno
3398    %array
3399
3400    %option interactive
3401    %option always-interactive
3402
3403    ^ beginning-of-line operator
3404    yymore()
3405@end verbatim
3406@end example
3407
3408with the first two all being quite expensive and the last two being
3409quite cheap.  Note also that @code{unput()} is implemented as a routine
3410call that potentially does quite a bit of work, while @code{yyless()} is
3411a quite-cheap macro. So if you are just putting back some excess text
3412you scanned, use @code{yyless()}.
3413
3414@code{REJECT} should be avoided at all costs when performance is
3415important.  It is a particularly expensive option.
3416
3417There is one case when @code{%option yylineno} can be expensive. That is when
3418your patterns match long tokens that could @emph{possibly} contain a newline
3419character. There is no performance penalty for rules that can not possibly
3420match newlines, since flex does not need to check them for newlines.  In
3421general, you should avoid rules such as @code{[^f]+}, which match very long
3422tokens, including newlines, and may possibly match your entire file! A better
3423approach is to separate @code{[^f]+} into two rules:
3424
3425@example
3426@verbatim
3427%option yylineno
3428%%
3429    [^f\n]+
3430    \n+
3431@end verbatim
3432@end example
3433
3434The above scanner does not incur a performance penalty.
3435
3436@cindex patterns, tuning for performance
3437@cindex performance, backing up
3438@cindex backing up, example of eliminating
3439Getting rid of backing up is messy and often may be an enormous amount
3440of work for a complicated scanner.  In principal, one begins by using
3441the @samp{-b} flag to generate a @file{lex.backup} file.  For example,
3442on the input:
3443
3444@cindex backing up, eliminating
3445@example
3446@verbatim
3447    %%
3448    foo        return TOK_KEYWORD;
3449    foobar     return TOK_KEYWORD;
3450@end verbatim
3451@end example
3452
3453the file looks like:
3454
3455@example
3456@verbatim
3457    State #6 is non-accepting -
3458     associated rule line numbers:
3459           2       3
3460     out-transitions: [ o ]
3461     jam-transitions: EOF [ \001-n  p-\177 ]
3462
3463    State #8 is non-accepting -
3464     associated rule line numbers:
3465           3
3466     out-transitions: [ a ]
3467     jam-transitions: EOF [ \001-`  b-\177 ]
3468
3469    State #9 is non-accepting -
3470     associated rule line numbers:
3471           3
3472     out-transitions: [ r ]
3473     jam-transitions: EOF [ \001-q  s-\177 ]
3474
3475    Compressed tables always back up.
3476@end verbatim
3477@end example
3478
3479The first few lines tell us that there's a scanner state in which it can
3480make a transition on an 'o' but not on any other character, and that in
3481that state the currently scanned text does not match any rule.  The
3482state occurs when trying to match the rules found at lines 2 and 3 in
3483the input file.  If the scanner is in that state and then reads
3484something other than an 'o', it will have to back up to find a rule
3485which is matched.  With a bit of headscratching one can see that this
3486must be the state it's in when it has seen @samp{fo}.  When this has
3487happened, if anything other than another @samp{o} is seen, the scanner
3488will have to back up to simply match the @samp{f} (by the default rule).
3489
3490The comment regarding State #8 indicates there's a problem when
3491@samp{foob} has been scanned.  Indeed, on any character other than an
3492@samp{a}, the scanner will have to back up to accept "foo".  Similarly,
3493the comment for State #9 concerns when @samp{fooba} has been scanned and
3494an @samp{r} does not follow.
3495
3496The final comment reminds us that there's no point going to all the
3497trouble of removing backing up from the rules unless we're using
3498@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so
3499with compressed scanners.
3500
3501@cindex error rules, to eliminate backing up
3502The way to remove the backing up is to add ``error'' rules:
3503
3504@cindex backing up, eliminating by adding error rules
3505@example
3506@verbatim
3507    %%
3508    foo         return TOK_KEYWORD;
3509    foobar      return TOK_KEYWORD;
3510
3511    fooba       |
3512    foob        |
3513    fo          {
3514                /* false alarm, not really a keyword */
3515                return TOK_ID;
3516                }
3517@end verbatim
3518@end example
3519
3520Eliminating backing up among a list of keywords can also be done using a
3521``catch-all'' rule:
3522
3523@cindex backing up, eliminating with catch-all rule
3524@example
3525@verbatim
3526    %%
3527    foo         return TOK_KEYWORD;
3528    foobar      return TOK_KEYWORD;
3529
3530    [a-z]+      return TOK_ID;
3531@end verbatim
3532@end example
3533
3534This is usually the best solution when appropriate.
3535
3536Backing up messages tend to cascade.  With a complicated set of rules
3537it's not uncommon to get hundreds of messages.  If one can decipher
3538them, though, it often only takes a dozen or so rules to eliminate the
3539backing up (though it's easy to make a mistake and have an error rule
3540accidentally match a valid token.  A possible future @code{flex} feature
3541will be to automatically add rules to eliminate backing up).
3542
3543It's important to keep in mind that you gain the benefits of eliminating
3544backing up only if you eliminate @emph{every} instance of backing up.
3545Leaving just one means you gain nothing.
3546
3547@emph{Variable} trailing context (where both the leading and trailing
3548parts do not have a fixed length) entails almost the same performance
3549loss as @code{REJECT} (i.e., substantial).  So when possible a rule
3550like:
3551
3552@cindex trailing context, variable length
3553@example
3554@verbatim
3555    %%
3556    mouse|rat/(cat|dog)   run();
3557@end verbatim
3558@end example
3559
3560is better written:
3561
3562@example
3563@verbatim
3564    %%
3565    mouse/cat|dog         run();
3566    rat/cat|dog           run();
3567@end verbatim
3568@end example
3569
3570or as
3571
3572@example
3573@verbatim
3574    %%
3575    mouse|rat/cat         run();
3576    mouse|rat/dog         run();
3577@end verbatim
3578@end example
3579
3580Note that here the special '|' action does @emph{not} provide any
3581savings, and can even make things worse (@pxref{Limitations}).
3582
3583Another area where the user can increase a scanner's performance (and
3584one that's easier to implement) arises from the fact that the longer the
3585tokens matched, the faster the scanner will run.  This is because with
3586long tokens the processing of most input characters takes place in the
3587(short) inner scanning loop, and does not often have to go through the
3588additional work of setting up the scanning environment (e.g.,
3589@code{yytext}) for the action.  Recall the scanner for C comments:
3590
3591@cindex performance optimization, matching longer tokens
3592@example
3593@verbatim
3594    %x comment
3595    %%
3596            int line_num = 1;
3597
3598    "/*"         BEGIN(comment);
3599
3600    <comment>[^*\n]*
3601    <comment>"*"+[^*/\n]*
3602    <comment>\n             ++line_num;
3603    <comment>"*"+"/"        BEGIN(INITIAL);
3604@end verbatim
3605@end example
3606
3607This could be sped up by writing it as:
3608
3609@example
3610@verbatim
3611    %x comment
3612    %%
3613            int line_num = 1;
3614
3615    "/*"         BEGIN(comment);
3616
3617    <comment>[^*\n]*
3618    <comment>[^*\n]*\n      ++line_num;
3619    <comment>"*"+[^*/\n]*
3620    <comment>"*"+[^*/\n]*\n ++line_num;
3621    <comment>"*"+"/"        BEGIN(INITIAL);
3622@end verbatim
3623@end example
3624
3625Now instead of each newline requiring the processing of another action,
3626recognizing the newlines is distributed over the other rules to keep the
3627matched text as long as possible.  Note that @emph{adding} rules does
3628@emph{not} slow down the scanner!  The speed of the scanner is
3629independent of the number of rules or (modulo the considerations given
3630at the beginning of this section) how complicated the rules are with
3631regard to operators such as @samp{*} and @samp{|}.
3632
3633@cindex keywords, for performance
3634@cindex performance, using keywords
3635A final example in speeding up a scanner: suppose you want to scan
3636through a file containing identifiers and keywords, one per line
3637and with no other extraneous characters, and recognize all the
3638keywords.  A natural first approach is:
3639
3640@cindex performance optimization, recognizing keywords
3641@example
3642@verbatim
3643    %%
3644    asm      |
3645    auto     |
3646    break    |
3647    ... etc ...
3648    volatile |
3649    while    /* it's a keyword */
3650
3651    .|\n     /* it's not a keyword */
3652@end verbatim
3653@end example
3654
3655To eliminate the back-tracking, introduce a catch-all rule:
3656
3657@example
3658@verbatim
3659    %%
3660    asm      |
3661    auto     |
3662    break    |
3663    ... etc ...
3664    volatile |
3665    while    /* it's a keyword */
3666
3667    [a-z]+   |
3668    .|\n     /* it's not a keyword */
3669@end verbatim
3670@end example
3671
3672Now, if it's guaranteed that there's exactly one word per line, then we
3673can reduce the total number of matches by a half by merging in the
3674recognition of newlines with that of the other tokens:
3675
3676@example
3677@verbatim
3678    %%
3679    asm\n    |
3680    auto\n   |
3681    break\n  |
3682    ... etc ...
3683    volatile\n |
3684    while\n  /* it's a keyword */
3685
3686    [a-z]+\n |
3687    .|\n     /* it's not a keyword */
3688@end verbatim
3689@end example
3690
3691One has to be careful here, as we have now reintroduced backing up
3692into the scanner.  In particular, while
3693@emph{we}
3694know that there will never be any characters in the input stream
3695other than letters or newlines,
3696@code{flex}
3697can't figure this out, and it will plan for possibly needing to back up
3698when it has scanned a token like @samp{auto} and then the next character
3699is something other than a newline or a letter.  Previously it would
3700then just match the @samp{auto} rule and be done, but now it has no @samp{auto}
3701rule, only a @samp{auto\n} rule.  To eliminate the possibility of backing up,
3702we could either duplicate all rules but without final newlines, or,
3703since we never expect to encounter such an input and therefore don't
3704how it's classified, we can introduce one more catch-all rule, this
3705one which doesn't include a newline:
3706
3707@example
3708@verbatim
3709    %%
3710    asm\n    |
3711    auto\n   |
3712    break\n  |
3713    ... etc ...
3714    volatile\n |
3715    while\n  /* it's a keyword */
3716
3717    [a-z]+\n |
3718    [a-z]+   |
3719    .|\n     /* it's not a keyword */
3720@end verbatim
3721@end example
3722
3723Compiled with @samp{-Cf}, this is about as fast as one can get a
3724@code{flex} scanner to go for this particular problem.
3725
3726A final note: @code{flex} is slow when matching @code{NUL}s,
3727particularly when a token contains multiple @code{NUL}s.  It's best to
3728write rules which match @emph{short} amounts of text if it's anticipated
3729that the text will often include @code{NUL}s.
3730
3731Another final note regarding performance: as mentioned in
3732@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge
3733tokens is a slow process because it presently requires that the (huge)
3734token be rescanned from the beginning.  Thus if performance is vital,
3735you should attempt to match ``large'' quantities of text but not
3736``huge'' quantities, where the cutoff between the two is at about 8K
3737characters per token.
3738
3739@node Cxx, Reentrant, Performance, Top
3740@chapter Generating C++ Scanners
3741
3742@cindex c++, experimental form of scanner class
3743@cindex experimental form of c++ scanner class
3744@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental}
3745and may change considerably between major releases.
3746
3747@cindex C++
3748@cindex member functions, C++
3749@cindex methods, c++
3750@code{flex} provides two different ways to generate scanners for use
3751with C++.  The first way is to simply compile a scanner generated by
3752@code{flex} using a C++ compiler instead of a C compiler.  You should
3753not encounter any compilation errors (@pxref{Reporting Bugs}).  You can
3754then use C++ code in your rule actions instead of C code.  Note that the
3755default input source for your scanner remains @file{yyin}, and default
3756echoing is still done to @file{yyout}.  Both of these remain @code{FILE
3757*} variables and not C++ @emph{streams}.
3758
3759You can also use @code{flex} to generate a C++ scanner class, using the
3760@samp{-+} option (or, equivalently, @code{%option c++)}, which is
3761automatically specified if the name of the @code{flex} executable ends
3762in a '+', such as @code{flex++}.  When using this option, @code{flex}
3763defaults to generating the scanner to the file @file{lex.yy.cc} instead
3764of @file{lex.yy.c}.  The generated scanner includes the header file
3765@file{FlexLexer.h}, which defines the interface to two C++ classes.
3766
3767The first class in @file{FlexLexer.h}, @code{FlexLexer},
3768provides an abstract base class defining the general scanner class
3769interface.  It provides the following member functions:
3770
3771@table @code
3772@findex YYText (C++ only)
3773@item const char* YYText()
3774returns the text of the most recently matched token, the equivalent of
3775@code{yytext}.
3776
3777@findex YYLeng (C++ only)
3778@item int YYLeng()
3779returns the length of the most recently matched token, the equivalent of
3780@code{yyleng}.
3781
3782@findex lineno (C++ only)
3783@item int lineno() const
3784returns the current input line number (see @code{%option yylineno)}, or
3785@code{1} if @code{%option yylineno} was not used.
3786
3787@findex set_debug (C++ only)
3788@item void set_debug( int flag )
3789sets the debugging flag for the scanner, equivalent to assigning to
3790@code{yy_flex_debug} (@pxref{Scanner Options}).  Note that you must build
3791the scanner using @code{%option debug} to include debugging information
3792in it.
3793
3794@findex  debug (C++ only)
3795@item int debug() const
3796returns the current setting of the debugging flag.
3797@end table
3798
3799Also provided are member functions equivalent to
3800@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the
3801first argument is an @code{istream&} object reference and not a
3802@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and
3803@code{yyrestart()} (again, the first argument is a @code{istream&}
3804object reference).
3805
3806@tindex yyFlexLexer (C++ only)
3807@tindex FlexLexer (C++ only)
3808The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer},
3809which is derived from @code{FlexLexer}.  It defines the following
3810additional member functions:
3811
3812@table @code
3813@findex yyFlexLexer constructor (C++ only)
3814@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
3815@item yyFlexLexer( istream& arg_yyin, ostream& arg_yyout )
3816constructs a @code{yyFlexLexer} object using the given streams for input
3817and output.  If not specified, the streams default to @code{cin} and
3818@code{cout}, respectively.  @code{yyFlexLexer} does not take ownership of
3819its stream arguments.  It's up to the user to ensure the streams pointed
3820to remain alive at least as long as the @code{yyFlexLexer} instance.
3821
3822@findex yylex (C++ version)
3823@item virtual int yylex()
3824performs the same role is @code{yylex()} does for ordinary @code{flex}
3825scanners: it scans the input stream, consuming tokens, until a rule's
3826action returns a value.  If you derive a subclass @code{S} from
3827@code{yyFlexLexer} and want to access the member functions and variables
3828of @code{S} inside @code{yylex()}, then you need to use @code{%option
3829yyclass="S"} to inform @code{flex} that you will be using that subclass
3830instead of @code{yyFlexLexer}.  In this case, rather than generating
3831@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()}
3832(and also generates a dummy @code{yyFlexLexer::yylex()} that calls
3833@code{yyFlexLexer::LexerError()} if called).
3834
3835@findex switch_streams (C++ only)
3836@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)
3837@item virtual void switch_streams(istream& new_in, ostream& new_out)
3838reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to
3839@code{new_out} (if non-null), deleting the previous input buffer if
3840@code{yyin} is reassigned.
3841
3842@item int yylex( istream* new_in, ostream* new_out = 0 )
3843@item int yylex( istream& new_in, ostream& new_out )
3844first switches the input streams via @code{switch_streams( new_in,
3845new_out )} and then returns the value of @code{yylex()}.
3846@end table
3847
3848In addition, @code{yyFlexLexer} defines the following protected virtual
3849functions which you can redefine in derived classes to tailor the
3850scanner:
3851
3852@table @code
3853@findex LexerInput (C++ only)
3854@item virtual int LexerInput( char* buf, int max_size )
3855reads up to @code{max_size} characters into @code{buf} and returns the
3856number of characters read.  To indicate end-of-input, return 0
3857characters.  Note that @code{interactive} scanners (see the @samp{-B}
3858and @samp{-I} flags in @ref{Scanner Options}) define the macro
3859@code{YY_INTERACTIVE}.  If you redefine @code{LexerInput()} and need to
3860take different actions depending on whether or not the scanner might be
3861scanning an interactive input source, you can test for the presence of
3862this name via @code{#ifdef} statements.
3863
3864@findex LexerOutput (C++ only)
3865@item virtual void LexerOutput( const char* buf, int size )
3866writes out @code{size} characters from the buffer @code{buf}, which, while
3867@code{NUL}-terminated, may also contain internal @code{NUL}s if the
3868scanner's rules can match text with @code{NUL}s in them.
3869
3870@cindex error reporting, in C++
3871@findex LexerError (C++ only)
3872@item virtual void LexerError( const char* msg )
3873reports a fatal error message.  The default version of this function
3874writes the message to the stream @code{cerr} and exits.
3875@end table
3876
3877Note that a @code{yyFlexLexer} object contains its @emph{entire}
3878scanning state.  Thus you can use such objects to create reentrant
3879scanners, but see also @ref{Reentrant}.  You can instantiate multiple
3880instances of the same @code{yyFlexLexer} class, and you can also combine
3881multiple C++ scanner classes together in the same program using the
3882@samp{-P} option discussed above.
3883
3884Finally, note that the @code{%array} feature is not available to C++
3885scanner classes; you must use @code{%pointer} (the default).
3886
3887Here is an example of a simple C++ scanner:
3888
3889@cindex C++ scanners, use of
3890@example
3891@verbatim
3892     // An example of using the flex C++ scanner class.
3893
3894    %{
3895    #include <iostream>
3896    using namespace std;
3897    int mylineno = 0;
3898    %}
3899
3900    %option noyywrap c++
3901
3902    string  \"[^\n"]+\"
3903
3904    ws      [ \t]+
3905
3906    alpha   [A-Za-z]
3907    dig     [0-9]
3908    name    ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
3909    num1    [-+]?{dig}+\.?([eE][-+]?{dig}+)?
3910    num2    [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
3911    number  {num1}|{num2}
3912
3913    %%
3914
3915    {ws}    /* skip blanks and tabs */
3916
3917    "/*"    {
3918            int c;
3919
3920            while((c = yyinput()) != 0)
3921                {
3922                if(c == '\n')
3923                    ++mylineno;
3924
3925                else if(c == '*')
3926                    {
3927                    if((c = yyinput()) == '/')
3928                        break;
3929                    else
3930                        unput(c);
3931                    }
3932                }
3933            }
3934
3935    {number}  cout << "number " << YYText() << '\n';
3936
3937    \n        mylineno++;
3938
3939    {name}    cout << "name " << YYText() << '\n';
3940
3941    {string}  cout << "string " << YYText() << '\n';
3942
3943    %%
3944
3945	// This include is required if main() is an another source file.
3946	//#include <FlexLexer.h>
3947
3948    int main( int /* argc */, char** /* argv */ )
3949    {
3950        FlexLexer* lexer = new yyFlexLexer;
3951        while(lexer->yylex() != 0)
3952            ;
3953        return 0;
3954    }
3955@end verbatim
3956@end example
3957
3958@cindex C++, multiple different scanners
3959If you want to create multiple (different) lexer classes, you use the
3960@samp{-P} flag (or the @code{prefix=} option) to rename each
3961@code{yyFlexLexer} to some other @samp{xxFlexLexer}.  You then can
3962include @file{<FlexLexer.h>} in your other sources once per lexer class,
3963first renaming @code{yyFlexLexer} as follows:
3964
3965@cindex include files, with C++
3966@cindex header files, with C++
3967@cindex C++ scanners, including multiple scanners
3968@example
3969@verbatim
3970    #undef yyFlexLexer
3971    #define yyFlexLexer xxFlexLexer
3972    #include <FlexLexer.h>
3973
3974    #undef yyFlexLexer
3975    #define yyFlexLexer zzFlexLexer
3976    #include <FlexLexer.h>
3977@end verbatim
3978@end example
3979
3980if, for example, you used @code{%option prefix="xx"} for one of your
3981scanners and @code{%option prefix="zz"} for the other.
3982
3983@node Reentrant, Lex and Posix, Cxx, Top
3984@chapter Reentrant C Scanners
3985
3986@cindex reentrant, explanation
3987@code{flex} has the ability to generate a reentrant C scanner. This is
3988accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated
3989scanner is both portable, and safe to use in one or more separate threads of
3990control.  The most common use for reentrant scanners is from within
3991multi-threaded applications.  Any thread may create and execute a reentrant
3992@code{flex} scanner without the need for synchronization with other threads.
3993
3994@menu
3995* Reentrant Uses::              
3996* Reentrant Overview::          
3997* Reentrant Example::           
3998* Reentrant Detail::            
3999* Reentrant Functions::         
4000@end menu
4001
4002@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant
4003@section Uses for Reentrant Scanners
4004
4005However, there are other uses for a reentrant scanner.  For example, you
4006could scan two or more files simultaneously to implement a @code{diff} at
4007the token level (i.e., instead of at the character level):
4008
4009@cindex reentrant scanners, multiple interleaved scanners
4010@example
4011@verbatim
4012    /* Example of maintaining more than one active scanner. */
4013
4014    do {
4015        int tok1, tok2;
4016
4017        tok1 = yylex( scanner_1 );
4018        tok2 = yylex( scanner_2 );
4019
4020        if( tok1 != tok2 )
4021            printf("Files are different.");
4022
4023   } while ( tok1 && tok2 );
4024@end verbatim
4025@end example
4026
4027Another use for a reentrant scanner is recursion.
4028(Note that a recursive scanner can also be created using a non-reentrant scanner and
4029buffer states. @xref{Multiple Input Buffers}.)
4030
4031The following crude scanner supports the @samp{eval} command by invoking
4032another instance of itself.
4033
4034@cindex reentrant scanners, recursive invocation
4035@example
4036@verbatim
4037    /* Example of recursive invocation. */
4038
4039    %option reentrant
4040
4041    %%
4042    "eval(".+")"  {
4043                      yyscan_t scanner;
4044                      YY_BUFFER_STATE buf;
4045
4046                      yylex_init( &scanner );
4047                      yytext[yyleng-1] = ' ';
4048
4049                      buf = yy_scan_string( yytext + 5, scanner );
4050                      yylex( scanner );
4051
4052                      yy_delete_buffer(buf,scanner);
4053                      yylex_destroy( scanner );
4054                 }
4055    ...
4056    %%
4057@end verbatim
4058@end example
4059
4060@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant
4061@section An Overview of the Reentrant API
4062
4063@cindex reentrant, API explanation
4064The API for reentrant scanners is different than for non-reentrant
4065scanners. Here is a quick overview of the API:
4066
4067@itemize
4068@code{%option reentrant} must be specified.
4069
4070@item
4071All functions take one additional argument: @code{yyscanner}
4072
4073@item
4074All global variables are replaced by their macro equivalents.
4075(We tell you this because it may be important to you during debugging.)
4076
4077@item
4078@code{yylex_init} and @code{yylex_destroy} must be called before and
4079after @code{yylex}, respectively.
4080
4081@item
4082Accessor methods (get/set functions) provide access to common
4083@code{flex} variables.
4084
4085@item
4086User-specific data can be stored in @code{yyextra}.
4087@end itemize
4088
4089@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant
4090@section Reentrant Example
4091
4092First, an example of a reentrant scanner:
4093@cindex reentrant, example of
4094@example
4095@verbatim
4096    /* This scanner prints "//" comments. */
4097
4098    %option reentrant stack noyywrap
4099    %x COMMENT
4100
4101    %%
4102
4103    "//"                 yy_push_state( COMMENT, yyscanner);
4104    .|\n
4105
4106    <COMMENT>\n          yy_pop_state( yyscanner );
4107    <COMMENT>[^\n]+      fprintf( yyout, "%s\n", yytext);
4108
4109    %%
4110
4111    int main ( int argc, char * argv[] )
4112    {
4113        yyscan_t scanner;
4114
4115        yylex_init ( &scanner );
4116        yylex ( scanner );
4117        yylex_destroy ( scanner );
4118    return 0;
4119   }
4120@end verbatim
4121@end example
4122
4123@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant
4124@section The Reentrant API in Detail
4125
4126Here are the things you need to do or know to use the reentrant C API of
4127@code{flex}.
4128
4129@menu
4130* Specify Reentrant::           
4131* Extra Reentrant Argument::    
4132* Global Replacement::          
4133* Init and Destroy Functions::  
4134* Accessor Methods::            
4135* Extra Data::                  
4136* About yyscan_t::              
4137@end menu
4138
4139@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail
4140@subsection Declaring a Scanner As Reentrant
4141
4142 %option reentrant (--reentrant) must be specified.
4143
4144Notice that @code{%option reentrant} is specified in the above example
4145(@pxref{Reentrant Example}. Had this option not been specified,
4146@code{flex} would have happily generated a non-reentrant scanner without
4147complaining. You may explicitly specify @code{%option noreentrant}, if
4148you do @emph{not} want a reentrant scanner, although it is not
4149necessary. The default is to generate a non-reentrant scanner.
4150
4151@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail
4152@subsection The Extra Argument
4153
4154@cindex reentrant, calling functions
4155@vindex yyscanner (reentrant only)
4156All functions take one additional argument: @code{yyscanner}.
4157
4158Notice that the calls to @code{yy_push_state} and @code{yy_pop_state}
4159both have an argument, @code{yyscanner} , that is not present in a
4160non-reentrant scanner.  Here are the declarations of
4161@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner:
4162
4163@example
4164@verbatim
4165    static void yy_push_state  ( int new_state , yyscan_t yyscanner ) ;
4166    static void yy_pop_state  ( yyscan_t yyscanner  ) ;
4167@end verbatim
4168@end example
4169
4170Notice that the argument @code{yyscanner} appears in the declaration of
4171both functions.  In fact, all @code{flex} functions in a reentrant
4172scanner have this additional argument.  It is always the last argument
4173in the argument list, it is always of type @code{yyscan_t} (which is
4174typedef'd to @code{void *}) and it is
4175always named @code{yyscanner}.  As you may have guessed,
4176@code{yyscanner} is a pointer to an opaque data structure encapsulating
4177the current state of the scanner.  For a list of function declarations,
4178see @ref{Reentrant Functions}. Note that preprocessor macros, such as
4179@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this
4180additional argument.
4181
4182@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail
4183@subsection Global Variables Replaced By Macros
4184
4185@cindex reentrant, accessing flex variables
4186All global variables in traditional flex have been replaced by macro equivalents.
4187
4188Note that in the above example, @code{yyout} and @code{yytext} are
4189not plain variables. These are macros that will expand to their equivalent lvalue.
4190All of the familiar @code{flex} globals have been replaced by their macro
4191equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno},
4192@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc}
4193are macros. You may safely use these macros in actions as if they were plain
4194variables. We only tell you this so you don't expect to link to these variables
4195externally. Currently, each macro expands to a member of an internal struct, e.g.,
4196
4197@example
4198@verbatim
4199#define yytext (((struct yyguts_t*)yyscanner)->yytext_r)
4200@end verbatim
4201@end example
4202
4203One important thing to remember about
4204@code{yytext}
4205and friends is that
4206@code{yytext}
4207is not a global variable in a reentrant
4208scanner, you can not access it directly from outside an action or from
4209other functions. You must use an accessor method, e.g.,
4210@code{yyget_text},
4211to accomplish this. (See below).
4212
4213@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail
4214@subsection Init and Destroy Functions
4215
4216@cindex memory, considerations for reentrant scanners
4217@cindex reentrant, initialization
4218@findex yylex_init
4219@findex yylex_destroy
4220
4221@code{yylex_init} and @code{yylex_destroy} must be called before and
4222after @code{yylex}, respectively.
4223
4224@example
4225@verbatim
4226    int yylex_init ( yyscan_t * ptr_yy_globals ) ;
4227    int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ;
4228    int yylex ( yyscan_t yyscanner ) ;
4229    int yylex_destroy ( yyscan_t yyscanner ) ;
4230@end verbatim
4231@end example
4232
4233The function @code{yylex_init} must be called before calling any other
4234function. The argument to @code{yylex_init} is the address of an
4235uninitialized pointer to be filled in by @code{yylex_init}, overwriting
4236any previous contents. The function @code{yylex_init_extra} may be used
4237instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}.
4238See the section on yyextra, below, for more details.
4239
4240The value stored in @code{ptr_yy_globals} should
4241thereafter be passed to @code{yylex} and @code{yylex_destroy}.  Flex
4242does not save the argument passed to @code{yylex_init}, so it is safe to
4243pass the address of a local pointer to @code{yylex_init} so long as it remains
4244in scope for the duration of all calls to the scanner, up to and including
4245the call to @code{yylex_destroy}.
4246
4247The function
4248@code{yylex} should be familiar to you by now. The reentrant version
4249takes one argument, which is the value returned (via an argument) by
4250@code{yylex_init}.  Otherwise, it behaves the same as the non-reentrant
4251version of @code{yylex}.
4252
4253Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success,
4254or non-zero on failure, in which case errno is set to one of the following values:
4255
4256@itemize
4257@item ENOMEM
4258Memory allocation error. @xref{memory-management}.
4259@item EINVAL
4260Invalid argument.
4261@end itemize
4262
4263
4264The function @code{yylex_destroy} should be
4265called to free resources used by the scanner. After @code{yylex_destroy}
4266is called, the contents of @code{yyscanner} should not be used.  Of
4267course, there is no need to destroy a scanner if you plan to reuse it.
4268A @code{flex} scanner (both reentrant and non-reentrant) may be
4269restarted by calling @code{yyrestart}.
4270
4271Below is an example of a program that creates a scanner, uses it, then destroys
4272it when done:
4273
4274@example
4275@verbatim
4276    int main ()
4277    {
4278        yyscan_t scanner;
4279        int tok;
4280
4281        yylex_init(&scanner);
4282
4283        while ((tok=yylex(scanner)) > 0)
4284            printf("tok=%d  yytext=%s\n", tok, yyget_text(scanner));
4285
4286        yylex_destroy(scanner);
4287        return 0;
4288    }
4289@end verbatim
4290@end example
4291
4292@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail
4293@subsection Accessing Variables with Reentrant Scanners
4294
4295@cindex reentrant, accessor functions
4296Accessor methods (get/set functions) provide access to common
4297@code{flex} variables.
4298
4299Many scanners that you build will be part of a larger project. Portions
4300of your project will need access to @code{flex} values, such as
4301@code{yytext}.  In a non-reentrant scanner, these values are global, so
4302there is no problem accessing them. However, in a reentrant scanner, there are no
4303global @code{flex} values. You can not access them directly.  Instead,
4304you must access @code{flex} values using accessor methods (get/set
4305functions). Each accessor method is named @code{yyget_NAME} or
4306@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex}
4307variable you want. For example:
4308
4309@cindex accessor functions, use of
4310@example
4311@verbatim
4312    /* Set the last character of yytext to NULL. */
4313    void chop ( yyscan_t scanner )
4314    {
4315        int len = yyget_leng( scanner );
4316        yyget_text( scanner )[len - 1] = '\0';
4317    }
4318@end verbatim
4319@end example
4320
4321The above code may be called from within an action like this:
4322
4323@example
4324@verbatim
4325    %%
4326    .+\n    { chop( yyscanner );}
4327@end verbatim
4328@end example
4329
4330You may find that @code{%option header-file} is particularly useful for generating
4331prototypes of all the accessor functions. @xref{option-header}.
4332
4333@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail
4334@subsection Extra Data
4335
4336@cindex reentrant, extra data
4337@vindex yyextra
4338User-specific data can be stored in @code{yyextra}.
4339
4340In a reentrant scanner, it is unwise to use global variables to
4341communicate with or maintain state between different pieces of your program.
4342However, you may need access to external data or invoke external functions
4343from within the scanner actions.
4344Likewise, you may need to pass information to your scanner
4345(e.g., open file descriptors, or database connections).
4346In a non-reentrant scanner, the only way to do this would be through the
4347use of global variables.
4348@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner.
4349This data is accessible through the accessor methods
4350@code{yyget_extra} and @code{yyset_extra}
4351from outside the scanner, and through the shortcut macro
4352@code{yyextra}
4353from within the scanner itself. They are defined as follows:
4354
4355@tindex YY_EXTRA_TYPE (reentrant only)
4356@findex yyget_extra
4357@findex yyset_extra
4358@example
4359@verbatim
4360    #define YY_EXTRA_TYPE  void*
4361    YY_EXTRA_TYPE  yyget_extra ( yyscan_t scanner );
4362    void           yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner);
4363@end verbatim
4364@end example
4365
4366In addition, an extra form of @code{yylex_init} is provided,
4367@code{yylex_init_extra}. This function is provided so that the yyextra value can
4368be accessed from within the very first yyalloc, used to allocate
4369the scanner itself.
4370
4371By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}.  You
4372may redefine this type using @code{%option extra-type="your_type"} in 
4373the scanner:
4374
4375@cindex YY_EXTRA_TYPE, defining your own type
4376@example
4377@verbatim
4378    /* An example of overriding YY_EXTRA_TYPE. */
4379    %{
4380    #include <sys/stat.h>
4381    #include <unistd.h>
4382    %}
4383    %option reentrant
4384    %option extra-type="struct stat *"
4385    %%
4386
4387    __filesize__     printf( "%ld", yyextra->st_size  );
4388    __lastmod__      printf( "%ld", yyextra->st_mtime );
4389    %%
4390    void scan_file( char* filename )
4391    {
4392        yyscan_t scanner;
4393        struct stat buf;
4394        FILE *in;
4395
4396        in = fopen( filename, "r" );
4397        stat( filename, &buf );
4398
4399        yylex_init_extra( buf, &scanner );
4400        yyset_in( in, scanner );
4401        yylex( scanner );
4402        yylex_destroy( scanner );
4403
4404        fclose( in );
4405   }
4406@end verbatim
4407@end example
4408
4409
4410@node About yyscan_t,  , Extra Data, Reentrant Detail
4411@subsection About yyscan_t
4412
4413@tindex yyscan_t (reentrant only)
4414@code{yyscan_t} is defined as:
4415
4416@example
4417@verbatim
4418     typedef void* yyscan_t;
4419@end verbatim
4420@end example
4421
4422It is initialized by @code{yylex_init()} to point to
4423an internal structure. You should never access this value
4424directly. In particular, you should never attempt to free it
4425(use @code{yylex_destroy()} instead.)
4426
4427@node Reentrant Functions,  , Reentrant Detail, Reentrant
4428@section Functions and Macros Available in Reentrant C Scanners
4429
4430The following Functions are available in a reentrant scanner:
4431
4432@findex yyget_text
4433@findex yyget_leng
4434@findex yyget_in
4435@findex yyget_out
4436@findex yyget_lineno
4437@findex yyset_in
4438@findex yyset_out
4439@findex yyset_lineno
4440@findex yyget_debug
4441@findex yyset_debug
4442@findex yyget_extra
4443@findex yyset_extra
4444
4445@example
4446@verbatim
4447    char *yyget_text ( yyscan_t scanner );
4448    int yyget_leng ( yyscan_t scanner );
4449    FILE *yyget_in ( yyscan_t scanner );
4450    FILE *yyget_out ( yyscan_t scanner );
4451    int yyget_lineno ( yyscan_t scanner );
4452    YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner );
4453    int  yyget_debug ( yyscan_t scanner );
4454
4455    void yyset_debug ( int flag, yyscan_t scanner );
4456    void yyset_in  ( FILE * in_str , yyscan_t scanner );
4457    void yyset_out  ( FILE * out_str , yyscan_t scanner );
4458    void yyset_lineno ( int line_number , yyscan_t scanner );
4459    void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner );
4460@end verbatim
4461@end example
4462
4463There are no ``set'' functions for yytext and yyleng. This is intentional.
4464
4465The following Macro shortcuts are available in actions in a reentrant
4466scanner:
4467
4468@example
4469@verbatim
4470    yytext
4471    yyleng
4472    yyin
4473    yyout
4474    yylineno
4475    yyextra
4476    yy_flex_debug
4477@end verbatim
4478@end example
4479
4480@cindex yylineno, in a reentrant scanner
4481In a reentrant C scanner, support for yylineno is always present
4482(i.e., you may access yylineno), but the value is never modified by
4483@code{flex} unless @code{%option yylineno} is enabled. This is to allow
4484the user to maintain the line count independently of @code{flex}.
4485
4486@anchor{bison-functions}
4487The following functions and macros are made available when @code{%option
4488bison-bridge} (@samp{--bison-bridge}) is specified:
4489
4490@example
4491@verbatim
4492    YYSTYPE * yyget_lval ( yyscan_t scanner );
4493    void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner );
4494    yylval
4495@end verbatim
4496@end example
4497
4498The following functions and macros are made available
4499when @code{%option bison-locations} (@samp{--bison-locations}) is specified:
4500
4501@example
4502@verbatim
4503    YYLTYPE *yyget_lloc ( yyscan_t scanner );
4504    void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner );
4505    yylloc
4506@end verbatim
4507@end example
4508
4509Support for yylval assumes that @code{YYSTYPE} is a valid type.  Support for
4510yylloc assumes that @code{YYSLYPE} is a valid type.  Typically, these types are
4511generated by @code{bison}, and are included in section 1 of the @code{flex}
4512input.
4513
4514@node Lex and Posix, Memory Management, Reentrant, Top
4515@chapter Incompatibilities with Lex and Posix
4516
4517@cindex POSIX and lex
4518@cindex lex (traditional) and POSIX
4519
4520@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two
4521implementations do not share any code, though), with some extensions and
4522incompatibilities, both of which are of concern to those who wish to
4523write scanners acceptable to both implementations.  @code{flex} is fully
4524compliant with the POSIX @code{lex} specification, except that when
4525using @code{%pointer} (the default), a call to @code{unput()} destroys
4526the contents of @code{yytext}, which is counter to the POSIX
4527specification.  In this section we discuss all of the known areas of
4528incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX
4529specification.  @code{flex}'s @samp{-l} option turns on maximum
4530compatibility with the original AT&T @code{lex} implementation, at the
4531cost of a major loss in the generated scanner's performance.  We note
4532below which incompatibilities can be overcome using the @samp{-l}
4533option.  @code{flex} is fully compatible with @code{lex} with the
4534following exceptions:
4535
4536@itemize
4537@item
4538The undocumented @code{lex} scanner internal variable @code{yylineno} is
4539not supported unless @samp{-l} or @code{%option yylineno} is used.
4540
4541@item
4542@code{yylineno} should be maintained on a per-buffer basis, rather than
4543a per-scanner (single global variable) basis.
4544
4545@item
4546@code{yylineno} is not part of the POSIX specification.
4547
4548@item
4549The @code{input()} routine is not redefinable, though it may be called
4550to read characters following whatever has been matched by a rule.  If
4551@code{input()} encounters an end-of-file the normal @code{yywrap()}
4552processing is done.  A ``real'' end-of-file is returned by
4553@code{input()} as @code{EOF}.
4554
4555@item
4556Input is instead controlled by defining the @code{YY_INPUT()} macro.
4557
4558@item
4559The @code{flex} restriction that @code{input()} cannot be redefined is
4560in accordance with the POSIX specification, which simply does not
4561specify any way of controlling the scanner's input other than by making
4562an initial assignment to @file{yyin}.
4563
4564@item
4565The @code{unput()} routine is not redefinable.  This restriction is in
4566accordance with POSIX.
4567
4568@item
4569@code{flex} scanners are not as reentrant as @code{lex} scanners.  In
4570particular, if you have an interactive scanner and an interrupt handler
4571which long-jumps out of the scanner, and the scanner is subsequently
4572called again, you may get the following message:
4573
4574@cindex error messages, end of buffer missed
4575@example
4576@verbatim
4577    fatal flex scanner internal error--end of buffer missed
4578@end verbatim
4579@end example
4580
4581To reenter the scanner, first use:
4582
4583@cindex restarting the scanner
4584@example
4585@verbatim
4586    yyrestart( yyin );
4587@end verbatim
4588@end example
4589
4590Note that this call will throw away any buffered input; usually this
4591isn't a problem with an interactive scanner. @xref{Reentrant}, for
4592@code{flex}'s reentrant API.
4593
4594@item
4595Also note that @code{flex} C++ scanner classes
4596@emph{are}
4597reentrant, so if using C++ is an option for you, you should use
4598them instead.  @xref{Cxx}, and @ref{Reentrant}  for details.
4599
4600@item
4601@code{output()} is not supported.  Output from the @b{ECHO} macro is
4602done to the file-pointer @code{yyout} (default @file{stdout)}.
4603
4604@item
4605@code{output()} is not part of the POSIX specification.
4606
4607@item
4608@code{lex} does not support exclusive start conditions (%x), though they
4609are in the POSIX specification.
4610
4611@item
4612When definitions are expanded, @code{flex} encloses them in parentheses.
4613With @code{lex}, the following:
4614
4615@cindex name definitions, not POSIX
4616@example
4617@verbatim
4618    NAME    [A-Z][A-Z0-9]*
4619    %%
4620    foo{NAME}?      printf( "Found it\n" );
4621    %%
4622@end verbatim
4623@end example
4624
4625will not match the string @samp{foo} because when the macro is expanded
4626the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?}  and the precedence
4627is such that the @samp{?} is associated with @samp{[A-Z0-9]*}.  With
4628@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?}
4629and so the string @samp{foo} will match.
4630
4631@item
4632Note that if the definition begins with @samp{^} or ends with @samp{$}
4633then it is @emph{not} expanded with parentheses, to allow these
4634operators to appear in definitions without losing their special
4635meanings.  But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators
4636cannot be used in a @code{flex} definition.
4637
4638@item
4639Using @samp{-l} results in the @code{lex} behavior of no parentheses
4640around the definition.
4641
4642@item
4643The POSIX specification is that the definition be enclosed in parentheses.
4644
4645@item
4646Some implementations of @code{lex} allow a rule's action to begin on a
4647separate line, if the rule's pattern has trailing whitespace:
4648
4649@cindex patterns and actions on different lines
4650@example
4651@verbatim
4652    %%
4653    foo|bar<space here>
4654      { foobar_action();}
4655@end verbatim
4656@end example
4657
4658@code{flex} does not support this feature.
4659
4660@item
4661The @code{lex} @code{%r} (generate a Ratfor scanner) option is not
4662supported.  It is not part of the POSIX specification.
4663
4664@item
4665After a call to @code{unput()}, @emph{yytext} is undefined until the
4666next token is matched, unless the scanner was built using @code{%array}.
4667This is not the case with @code{lex} or the POSIX specification.  The
4668@samp{-l} option does away with this incompatibility.
4669
4670@item
4671The precedence of the @samp{@{,@}} (numeric range) operator is
4672different.  The AT&T and POSIX specifications of @code{lex}
4673interpret @samp{abc@{1,3@}} as match one, two,
4674or three occurrences of @samp{abc}'', whereas @code{flex} interprets it
4675as ``match @samp{ab} followed by one, two, or three occurrences of
4676@samp{c}''.  The @samp{-l} and @samp{--posix} options do away with this
4677incompatibility.
4678
4679@item
4680The precedence of the @samp{^} operator is different.  @code{lex}
4681interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a
4682line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match
4683either @samp{foo} or @samp{bar} if they come at the beginning of a
4684line''.  The latter is in agreement with the POSIX specification.
4685
4686@item
4687The special table-size declarations such as @code{%a} supported by
4688@code{lex} are not required by @code{flex} scanners..  @code{flex}
4689ignores them.
4690@item
4691The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be
4692written for use with either @code{flex} or @code{lex}.  Scanners also
4693include @code{YY_FLEX_MAJOR_VERSION},  @code{YY_FLEX_MINOR_VERSION}
4694and @code{YY_FLEX_SUBMINOR_VERSION}
4695indicating which version of @code{flex} generated the scanner. For
4696example, for the 2.5.22 release, these defines would be 2,  5 and 22
4697respectively. If the version of @code{flex} being used is a beta
4698version, then the symbol @code{FLEX_BETA} is defined.
4699
4700@item
4701The symbols @samp{[[} and @samp{]]} in the code sections of the input
4702may conflict with the m4 delimiters. @xref{M4 Dependency}.
4703
4704
4705@end itemize
4706
4707@cindex POSIX comp;compliance
4708@cindex non-POSIX features of flex
4709The following @code{flex} features are not included in @code{lex} or the
4710POSIX specification:
4711
4712@itemize
4713@item
4714C++ scanners
4715@item
4716%option
4717@item
4718start condition scopes
4719@item
4720start condition stacks
4721@item
4722interactive/non-interactive scanners
4723@item
4724yy_scan_string() and friends
4725@item
4726yyterminate()
4727@item
4728yy_set_interactive()
4729@item
4730yy_set_bol()
4731@item
4732YY_AT_BOL()
4733   <<EOF>>
4734@item
4735<*>
4736@item
4737YY_DECL
4738@item
4739YY_START
4740@item
4741YY_USER_ACTION
4742@item
4743YY_USER_INIT
4744@item
4745#line directives
4746@item
4747%@{@}'s around actions
4748@item
4749reentrant C API
4750@item
4751multiple actions on a line
4752@item
4753almost all of the @code{flex} command-line options
4754@end itemize
4755
4756The feature ``multiple actions on a line''
4757refers to the fact that with @code{flex} you can put multiple actions on
4758the same line, separated with semi-colons, while with @code{lex}, the
4759following:
4760
4761@example
4762@verbatim
4763    foo    handle_foo(); ++num_foos_seen;
4764@end verbatim
4765@end example
4766
4767is (rather surprisingly) truncated to
4768
4769@example
4770@verbatim
4771    foo    handle_foo();
4772@end verbatim
4773@end example
4774
4775@code{flex} does not truncate the action.  Actions that are not enclosed
4776in braces are simply terminated at the end of the line.
4777
4778@node Memory Management, Serialized Tables, Lex and Posix, Top
4779@chapter Memory Management
4780
4781@cindex memory management
4782@anchor{memory-management}
4783This chapter describes how flex handles dynamic memory, and how you can
4784override the default behavior.
4785
4786@menu
4787* The Default Memory Management::  
4788* Overriding The Default Memory Management::  
4789* A Note About yytext And Memory::  
4790@end menu
4791
4792@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management
4793@section The Default Memory Management
4794
4795Flex allocates dynamic memory during initialization, and once in a while from
4796within a call to yylex(). Initialization takes place during the first call to
4797yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a
4798buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy}
4799@xref{faq-memory-leak}.
4800
4801Flex allocates dynamic memory for four purposes, listed below @footnote{The
4802quantities given here are approximate, and may vary due to host architecture,
4803compiler configuration, or due to future enhancements to flex.} 
4804
4805@table @asis
4806
4807@item 16kB for the input buffer.
4808Flex allocates memory for the character buffer used to perform pattern
4809matching.  Flex must read ahead from the input stream and store it in a large
4810character buffer.  This buffer is typically the largest chunk of dynamic memory
4811flex consumes. This buffer will grow if necessary, doubling the size each time.
4812Flex frees this memory when you call yylex_destroy().  The default size of this
4813buffer (16384 bytes) is almost always too large.  The ideal size for this
4814buffer is the length of the longest token expected, in bytes, plus a little more.  Flex will allocate a few
4815extra bytes for housekeeping. Currently, to override the size of the input buffer
4816you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan
4817to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management
4818API. 
4819
4820@item 64kb for the REJECT state. This will only be allocated if you use REJECT.
4821The size is  large enough to hold the same number of states as characters in the input buffer. If you override the size of the
4822input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well.
4823
4824@item 100 bytes for the start condition stack.
4825Flex allocates memory for the start condition stack. This is the stack used
4826for pushing start states, i.e., with yy_push_state(). It will grow if
4827necessary.  Since the states are simply integers, this stack doesn't consume
4828much memory.  This stack is not present if @code{%option stack} is not
4829specified.  You will rarely need to tune this buffer. The ideal size for this
4830stack is the maximum depth expected.  The memory for this stack is
4831automatically destroyed when you call yylex_destroy(). @xref{option-stack}.
4832
4833@item 40 bytes for each YY_BUFFER_STATE.
4834Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself
4835is about 40 bytes, plus an additional large character buffer (described above.)
4836The initial buffer state is created during initialization, and with each call
4837to yy_create_buffer(). You can't tune the size of this, but you can tune the
4838character buffer as described above. Any buffer state that you explicitly
4839create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You
4840must call yy_delete_buffer() to free the memory. The exception to this rule is
4841that flex will delete the current buffer automatically when you call
4842yylex_destroy(). If you delete the current buffer, be sure to set it to NULL.
4843That way, flex will not try to delete the buffer a second time (possibly
4844crashing your program!) At the time of this writing, flex does not provide a
4845growable stack for the buffer states.  You have to manage that yourself.
4846@xref{Multiple Input Buffers}.
4847
4848@item 84 bytes for the reentrant scanner guts
4849Flex allocates about 84 bytes for the reentrant scanner structure when
4850you call yylex_init(). It is destroyed when the user calls yylex_destroy().
4851
4852@end table
4853
4854
4855@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management
4856@section Overriding The Default Memory Management
4857
4858@cindex yyalloc, overriding
4859@cindex yyrealloc, overriding
4860@cindex yyfree, overriding
4861
4862Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree}
4863when it needs to allocate or free memory. By default, these functions are
4864wrappers around the standard C functions, @code{malloc}, @code{realloc}, and
4865@code{free}, respectively. You can override the default implementations by telling
4866flex that you will provide your own implementations.
4867
4868To override the default implementations, you must do two things:
4869
4870@enumerate
4871
4872@item Suppress the default implementations by specifying one or more of the
4873following options:
4874
4875@itemize
4876@opindex noyyalloc
4877@item @code{%option noyyalloc}
4878@item @code{%option noyyrealloc}
4879@item @code{%option noyyfree}.
4880@end itemize
4881
4882@item Provide your own implementation of the following functions: @footnote{It
4883is not necessary to override all (or any) of the memory management routines.
4884You may, for example, override @code{yyrealloc}, but not @code{yyfree} or
4885@code{yyalloc}.}
4886
4887@example
4888@verbatim
4889// For a non-reentrant scanner
4890void * yyalloc (size_t bytes);
4891void * yyrealloc (void * ptr, size_t bytes);
4892void   yyfree (void * ptr);
4893
4894// For a reentrant scanner
4895void * yyalloc (size_t bytes, void * yyscanner);
4896void * yyrealloc (void * ptr, size_t bytes, void * yyscanner);
4897void   yyfree (void * ptr, void * yyscanner);
4898@end verbatim
4899@end example
4900
4901@end enumerate
4902
4903In the following example, we will override all three memory routines. We assume
4904that there is a custom allocator with garbage collection. In order to make this
4905example interesting, we will use a reentrant scanner, passing a pointer to the
4906custom allocator through @code{yyextra}.
4907
4908@cindex overriding the memory routines
4909@example
4910@verbatim
4911%{
4912#include "some_allocator.h"
4913%}
4914
4915/* Suppress the default implementations. */
4916%option noyyalloc noyyrealloc noyyfree
4917%option reentrant
4918
4919/* Initialize the allocator. */
4920%{
4921#define YY_EXTRA_TYPE  struct allocator*
4922#define YY_USER_INIT  yyextra = allocator_create();
4923%}
4924
4925%%
4926.|\n   ;
4927%%
4928
4929/* Provide our own implementations. */
4930void * yyalloc (size_t bytes, void* yyscanner) {
4931    return allocator_alloc (yyextra, bytes);
4932}
4933
4934void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) {
4935    return allocator_realloc (yyextra, bytes);
4936}
4937
4938void yyfree (void * ptr, void * yyscanner) {      
4939    /* Do nothing -- we leave it to the garbage collector. */
4940}
4941
4942@end verbatim
4943@end example
4944
4945
4946@node A Note About yytext And Memory,  , Overriding The Default Memory Management, Memory Management
4947@section A Note About yytext And Memory
4948
4949@cindex yytext, memory considerations
4950
4951When flex finds a match, @code{yytext} points to the first character of the
4952match in the input buffer. The string itself is part of the input buffer, and
4953is @emph{NOT} allocated separately. The value of yytext will be overwritten the next
4954time yylex() is called. In short, the value of yytext is only valid from within
4955the matched rule's action.
4956
4957Often, you want the value of yytext to persist for later processing, i.e., by a
4958parser with non-zero lookahead. In order to preserve yytext, you will have to
4959copy it with strdup() or a similar function. But this introduces some headache
4960because your parser is now responsible for freeing the copy of yytext. If you
4961use a yacc or bison parser, (commonly used with flex), you will discover that
4962the error recovery mechanisms can cause memory to be leaked.
4963
4964To prevent memory leaks from strdup'd yytext, you will have to track the memory
4965somehow. Our experience has shown that a garbage collection mechanism or a
4966pooled memory mechanism will save you a lot of grief when writing parsers.
4967
4968@node Serialized Tables, Diagnostics, Memory Management, Top
4969@chapter Serialized Tables
4970@cindex serialization
4971@cindex memory, serialized tables
4972
4973@anchor{serialization}
4974A @code{flex} scanner has the ability to save the DFA tables to a file, and
4975load them at runtime when needed.  The motivation for this feature is to reduce
4976the runtime memory footprint.  Traditionally, these tables have been compiled into
4977the scanner as C arrays, and are sometimes quite large.  Since the tables are
4978compiled into the scanner, the memory used by the tables can never be freed.
4979This is a waste of memory, especially if an application uses several scanners,
4980but none of them at the same time.
4981
4982The serialization feature allows the tables to be loaded at runtime, before
4983scanning begins. The tables may be discarded when scanning is finished.
4984
4985@menu
4986* Creating Serialized Tables::  
4987* Loading and Unloading Serialized Tables::  
4988* Tables File Format::          
4989@end menu
4990
4991@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables
4992@section Creating Serialized Tables
4993@cindex tables, creating serialized
4994@cindex serialization of tables
4995
4996You may create a scanner with serialized tables by specifying:
4997
4998@example
4999@verbatim
5000    %option tables-file=FILE
5001or
5002    --tables-file=FILE
5003@end verbatim
5004@end example
5005
5006These options instruct flex to save the DFA tables to the file @var{FILE}. The tables
5007will @emph{not} be embedded in the generated scanner. The scanner will not
5008function on its own. The scanner will be dependent upon the serialized tables. You must
5009load the tables from this file at runtime before you can scan anything. 
5010
5011If you do not specify a filename to @code{--tables-file}, the tables will be
5012saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix.
5013
5014If your project uses several different scanners, you can concatenate the
5015serialized tables into one file, and flex will find the correct set of tables,
5016using the scanner prefix as part of the lookup key. An example follows:
5017
5018@cindex serialized tables, multiple scanners
5019@example
5020@verbatim
5021$ flex --tables-file --prefix=cpp cpp.l
5022$ flex --tables-file --prefix=c   c.l
5023$ cat lex.cpp.tables lex.c.tables  >  all.tables
5024@end verbatim
5025@end example
5026
5027The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did
5028not specify a filename, the tables were serialized to @file{lex.c.tables} and
5029@file{lex.cpp.tables}, respectively. Then, we concatenated the two files
5030together into @file{all.tables}, which we will distribute with our project. At
5031runtime, we will open the file and tell flex to load the tables from it.  Flex
5032will find the correct tables automatically. (See next section).
5033
5034@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables
5035@section Loading and Unloading Serialized Tables
5036@cindex tables, loading and unloading
5037@cindex loading tables at runtime
5038@cindex tables, freeing
5039@cindex freeing tables
5040@cindex memory, serialized tables
5041
5042If you've built your scanner with @code{%option tables-file}, then you must
5043load the scanner tables at runtime. This can be accomplished with the following
5044function:
5045
5046@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}])
5047Locates scanner tables in the stream pointed to by @var{fp} and loads them.
5048Memory for the tables is allocated via @code{yyalloc}.  You must call this
5049function before the first call to @code{yylex}. The argument @var{scanner}
5050only appears in the reentrant scanner.
5051This function returns @samp{0} (zero) on success, or non-zero on error.
5052@end deftypefun
5053
5054The loaded tables are @strong{not} automatically destroyed (unloaded) when you
5055call @code{yylex_destroy}. The reason is that you may create several scanners
5056of the same type (in a reentrant scanner), each of which needs access to these
5057tables.  To avoid a nasty memory leak, you must call the following function:
5058
5059@deftypefun int yytables_destroy ([yyscan_t @var{scanner}])
5060Unloads the scanner tables. The tables must be loaded again before you can scan
5061any more data.  The argument @var{scanner} only appears in the reentrant
5062scanner.  This function returns @samp{0} (zero) on success, or non-zero on
5063error.
5064@end deftypefun
5065
5066@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not
5067thread-safe.} You must ensure that these functions are called exactly once (for
5068each scanner type) in a threaded program, before any thread calls @code{yylex}.
5069After the tables are loaded, they are never written to, and no thread
5070protection is required thereafter -- until you destroy them.
5071
5072@node Tables File Format,  , Loading and Unloading Serialized Tables, Serialized Tables
5073@section Tables File Format
5074@cindex tables, file format
5075@cindex file format, serialized tables
5076
5077This section defines the file format of serialized @code{flex} tables.
5078
5079The tables format allows for one or more sets of tables to be
5080specified, where each set corresponds to a given scanner. Scanners are
5081indexed by name, as described below. The file format is as follows:
5082
5083@example
5084@verbatim
5085                 TABLE SET 1
5086                +-------------------------------+
5087        Header  | uint32          th_magic;     |
5088                | uint32          th_hsize;     |
5089                | uint32          th_ssize;     |
5090                | uint16          th_flags;     |
5091                | char            th_version[]; |
5092                | char            th_name[];    |
5093                | uint8           th_pad64[];   |
5094                +-------------------------------+
5095        Table 1 | uint16          td_id;        |
5096                | uint16          td_flags;     |
5097                | uint32          td_hilen;     |
5098                | uint32          td_lolen;     |
5099                | void            td_data[];    |
5100                | uint8           td_pad64[];   |
5101                +-------------------------------+
5102        Table 2 |                               |
5103           .    .                               .
5104           .    .                               .
5105           .    .                               .
5106           .    .                               .
5107        Table n |                               |
5108                +-------------------------------+
5109                 TABLE SET 2
5110                      .
5111                      .
5112                      .
5113                 TABLE SET N
5114@end verbatim
5115@end example
5116
5117The above diagram shows that a complete set of tables consists of a header
5118followed by multiple individual tables. Furthermore, multiple complete sets may
5119be present in the same file, each set with its own header and tables. The sets
5120are contiguous in the file. The only way to know if another set follows is to
5121check the next four bytes for the magic number (or check for EOF). The header
5122and tables sections are padded to 64-bit boundaries. Below we describe each
5123field in detail. This format does not specify how the scanner will expand the
5124given data, i.e., data may be serialized as int8, but expanded to an int32
5125array at runtime. This is to reduce the size of the serialized data where
5126possible.  Remember, @emph{all integer values are in network byte order}. 
5127
5128@noindent
5129Fields of a table header:
5130
5131@table @code
5132@item th_magic
5133Magic number, always 0xF13C57B1.
5134
5135@item th_hsize
5136Size of this entire header, in bytes, including all fields plus any padding.
5137
5138@item th_ssize
5139Size of this entire set, in bytes, including the header, all tables, plus
5140any padding.
5141
5142@item th_flags
5143Bit flags for this table set. Currently unused.
5144
5145@item th_version[]
5146Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is
5147the version of flex that was used to create the serialized tables.
5148
5149@item th_name[]
5150Contains the name of this table set. The default is @samp{yytables},
5151and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated.
5152
5153@item th_pad64[]
5154Zero or more NULL bytes, padding the entire header to the next 64-bit boundary
5155as calculated from the beginning of the header.
5156@end table
5157
5158@noindent
5159Fields of a table:
5160
5161@table @code
5162@item td_id
5163Specifies the table identifier. Possible values are:
5164@table @code
5165@item YYTD_ID_ACCEPT (0x01)
5166@code{yy_accept}
5167@item YYTD_ID_BASE   (0x02)
5168@code{yy_base}
5169@item YYTD_ID_CHK    (0x03)
5170@code{yy_chk}
5171@item YYTD_ID_DEF    (0x04)
5172@code{yy_def}
5173@item YYTD_ID_EC     (0x05)
5174@code{yy_ec }
5175@item YYTD_ID_META   (0x06)
5176@code{yy_meta}
5177@item YYTD_ID_NUL_TRANS (0x07)
5178@code{yy_NUL_trans}
5179@item YYTD_ID_NXT (0x08)
5180@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen}
5181field below.
5182@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09)
5183@code{yy_rule_can_match_eol}
5184@item YYTD_ID_START_STATE_LIST (0x0A)
5185@code{yy_start_state_list}. This array is handled specially because it is an
5186array of pointers to structs. See the @code{td_flags} field below.
5187@item YYTD_ID_TRANSITION (0x0B)
5188@code{yy_transition}. This array is handled specially because it is an array of
5189structs. See the @code{td_lolen} field below.
5190@item YYTD_ID_ACCLIST (0x0C)
5191@code{yy_acclist}
5192@end table
5193
5194@item td_flags
5195Bit flags describing how to interpret the data in @code{td_data}.
5196The data arrays are one-dimensional by default, but may be
5197two dimensional as specified in the @code{td_hilen} field.
5198
5199@table @code
5200@item YYTD_DATA8 (0x01)
5201The data is serialized as an array of type int8.
5202@item YYTD_DATA16 (0x02)
5203The data is serialized as an array of type int16.
5204@item YYTD_DATA32 (0x04)
5205The data is serialized as an array of type int32.
5206@item YYTD_PTRANS (0x08)
5207The data is a list of indexes of entries in the expanded @code{yy_transition}
5208array.  Each index should be expanded to a pointer to the corresponding entry
5209in the @code{yy_transition} array. We count on the fact that the
5210@code{yy_transition} array has already been seen.
5211@item YYTD_STRUCT (0x10)
5212The data is a list of yy_trans_info structs, each of which consists of
5213two integers. There is no padding between struct elements or between structs.
5214The type of each member is determined by the @code{YYTD_DATA*} bits.
5215@end table
5216
5217@item td_hilen
5218If @code{td_hilen} is non-zero, then the data is a two-dimensional array.
5219Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the
5220number of elements in the higher dimensional array, and @code{td_lolen} contains
5221the number of elements in the lowest dimension.
5222
5223Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or
5224@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified
5225by the @code{td_flags} field.  It is possible for both @code{td_lolen} and
5226@code{td_hilen} to be zero, in which case @code{td_data} is a zero length
5227array, and no data is loaded, i.e., this table is simply skipped. Flex does not
5228currently generate tables of zero length.
5229
5230@item td_lolen
5231Specifies the number of elements in the lowest dimension array. If this is
5232a one-dimensional array, then it is simply the number of elements in this array.
5233The element size is determined by the @code{td_flags} field.
5234
5235@item td_data[]
5236The table data. This array may be a one- or two-dimensional array, of type
5237@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or
5238@code{struct yy_trans_info*},  depending upon the values in the
5239@code{td_flags}, @code{td_hilen}, and @code{td_lolen} fields.
5240
5241@item td_pad64[]
5242Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as
5243calculated from the beginning of this table.
5244@end table
5245
5246@node Diagnostics, Limitations, Serialized Tables, Top
5247@chapter Diagnostics
5248
5249@cindex error reporting, diagnostic messages
5250@cindex warnings, diagnostic messages
5251
5252The following is a list of @code{flex} diagnostic messages:
5253
5254@itemize
5255@item
5256@samp{warning, rule cannot be matched} indicates that the given rule
5257cannot be matched because it follows other rules that will always match
5258the same text as it.  For example, in the following @samp{foo} cannot be
5259matched because it comes after an identifier ``catch-all'' rule:
5260
5261@cindex warning, rule cannot be matched
5262@example
5263@verbatim
5264    [a-z]+    got_identifier();
5265    foo       got_foo();
5266@end verbatim
5267@end example
5268
5269Using @code{REJECT} in a scanner suppresses this warning.
5270
5271@item
5272@samp{warning, -s option given but default rule can be matched} means
5273that it is possible (perhaps only in a particular start condition) that
5274the default rule (match any single character) is the only one that will
5275match a particular input.  Since @samp{-s} was given, presumably this is
5276not intended.
5277
5278@item
5279@code{reject_used_but_not_detected undefined} or
5280@code{yymore_used_but_not_detected undefined}. These errors can occur
5281at compile time.  They indicate that the scanner uses @code{REJECT} or
5282@code{yymore()} but that @code{flex} failed to notice the fact, meaning
5283that @code{flex} scanned the first two sections looking for occurrences
5284of these actions and failed to find any, but somehow you snuck some in
5285(via a #include file, for example).  Use @code{%option reject} or
5286@code{%option yymore} to indicate to @code{flex} that you really do use
5287these features.
5288
5289@item
5290@samp{flex scanner jammed}. a scanner compiled with
5291@samp{-s} has encountered an input string which wasn't matched by any of
5292its rules.  This error can also occur due to internal problems.
5293
5294@item
5295@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array}
5296and one of its rules matched a string longer than the @code{YYLMAX}
5297constant (8K bytes by default).  You can increase the value by
5298#define'ing @code{YYLMAX} in the definitions section of your @code{flex}
5299input.
5300
5301@item
5302@samp{scanner requires -8 flag to use the character 'x'}. Your scanner
5303specification includes recognizing the 8-bit character @samp{'x'} and
5304you did not specify the -8 flag, and your scanner defaulted to 7-bit
5305because you used the @samp{-Cf} or @samp{-CF} table compression options.
5306See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for
5307details.
5308
5309@item
5310@samp{flex scanner push-back overflow}. you used @code{unput()} to push
5311back so much text that the scanner's buffer could not hold both the
5312pushed-back text and the current token in @code{yytext}.  Ideally the
5313scanner should dynamically resize the buffer in this case, but at
5314present it does not.
5315
5316@item
5317@samp{input buffer overflow, can't enlarge buffer because scanner uses
5318REJECT}.  the scanner was working on matching an extremely large token
5319and needed to expand the input buffer.  This doesn't work with scanners
5320that use @code{REJECT}.
5321
5322@item
5323@samp{fatal flex scanner internal error--end of buffer missed}. This can
5324occur in a scanner which is reentered after a long-jump has jumped out
5325(or over) the scanner's activation frame.  Before reentering the
5326scanner, use:
5327@example
5328@verbatim
5329    yyrestart( yyin );
5330@end verbatim
5331@end example
5332or, as noted above, switch to using the C++ scanner class.
5333
5334@item
5335@samp{too many start conditions in <> construct!}  you listed more start
5336conditions in a <> construct than exist (so you must have listed at
5337least one of them twice).
5338@end itemize
5339
5340@node Limitations, Bibliography, Diagnostics, Top
5341@chapter Limitations
5342
5343@cindex limitations of flex
5344
5345Some trailing context patterns cannot be properly matched and generate
5346warning messages (@samp{dangerous trailing context}).  These are
5347patterns where the ending of the first part of the rule matches the
5348beginning of the second part, such as @samp{zx*/xy*}, where the 'x*'
5349matches the 'x' at the beginning of the trailing context.  (Note that
5350the POSIX draft states that the text matched by such patterns is
5351undefined.)  For some trailing context rules, parts which are actually
5352fixed-length are not recognized as such, leading to the abovementioned
5353performance loss.  In particular, parts using @samp{|} or @samp{@{n@}}
5354(such as @samp{foo@{3@}}) are always considered variable-length.
5355Combining trailing context with the special @samp{|} action can result
5356in @emph{fixed} trailing context being turned into the more expensive
5357@emph{variable} trailing context.  For example, in the following:
5358
5359@cindex warning, dangerous trailing context
5360@example
5361@verbatim
5362    %%
5363    abc      |
5364    xyz/def
5365@end verbatim
5366@end example
5367
5368Use of @code{unput()} invalidates yytext and yyleng, unless the
5369@code{%array} directive or the @samp{-l} option has been used.
5370Pattern-matching of @code{NUL}s is substantially slower than matching
5371other characters.  Dynamic resizing of the input buffer is slow, as it
5372entails rescanning all the text matched so far by the current (generally
5373huge) token.  Due to both buffering of input and read-ahead, you cannot
5374intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()},
5375with @code{flex} rules and expect it to work.  Call @code{input()}
5376instead.  The total table entries listed by the @samp{-v} flag excludes
5377the number of table entries needed to determine what rule has been
5378matched.  The number of entries is equal to the number of DFA states if
5379the scanner does not use @code{REJECT}, and somewhat greater than the
5380number of states if it does.  @code{REJECT} cannot be used with the
5381@samp{-f} or @samp{-F} options.
5382
5383The @code{flex} internal algorithms need documentation.
5384
5385@node Bibliography, FAQ, Limitations, Top
5386@chapter Additional Reading
5387
5388You may wish to read more about the following programs:
5389@itemize
5390@item lex
5391@item yacc
5392@item sed
5393@item awk
5394@end itemize
5395
5396The following books may contain material of interest:
5397
5398John Levine, Tony Mason, and Doug Brown,
5399@emph{Lex & Yacc},
5400O'Reilly and Associates.  Be sure to get the 2nd edition.
5401
5402M. E. Lesk and E. Schmidt,
5403@emph{LEX -- Lexical Analyzer Generator}
5404
5405Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles,
5406Techniques and Tools}, Addison-Wesley (1986).  Describes the
5407pattern-matching techniques used by @code{flex} (deterministic finite
5408automata).
5409
5410@node FAQ, Appendices, Bibliography, Top
5411@unnumbered FAQ
5412
5413From time to time, the @code{flex} maintainer receives certain
5414questions. Rather than repeat answers to well-understood problems, we
5415publish them here.
5416
5417@menu
5418* When was flex born?::         
5419* How do I expand backslash-escape sequences in C-style quoted strings?::  
5420* Why do flex scanners call fileno if it is not ANSI compatible?::  
5421* Does flex support recursive pattern definitions?::  
5422* How do I skip huge chunks of input (tens of megabytes) while using flex?::  
5423* Flex is not matching my patterns in the same order that I defined them.::  
5424* My actions are executing out of order or sometimes not at all.::  
5425* How can I have multiple input sources feed into the same scanner at the same time?::  
5426* Can I build nested parsers that work with the same input file?::  
5427* How can I match text only at the end of a file?::  
5428* How can I make REJECT cascade across start condition boundaries?::  
5429* Why cant I use fast or full tables with interactive mode?::  
5430* How much faster is -F or -f than -C?::  
5431* If I have a simple grammar cant I just parse it with flex?::  
5432* Why doesn't yyrestart() set the start state back to INITIAL?::  
5433* How can I match C-style comments?::  
5434* The period isn't working the way I expected.::  
5435* Can I get the flex manual in another format?::  
5436* Does there exist a "faster" NDFA->DFA algorithm?::  
5437* How does flex compile the DFA so quickly?::  
5438* How can I use more than 8192 rules?::  
5439* How do I abandon a file in the middle of a scan and switch to a new file?::  
5440* How do I execute code only during initialization (only before the first scan)?::  
5441* How do I execute code at termination?::  
5442* Where else can I find help?::  
5443* Can I include comments in the "rules" section of the file?::  
5444* I get an error about undefined yywrap().::  
5445* How can I change the matching pattern at run time?::  
5446* How can I expand macros in the input?::  
5447* How can I build a two-pass scanner?::  
5448* How do I match any string not matched in the preceding rules?::  
5449* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::  
5450* Is there a way to make flex treat NULL like a regular character?::  
5451* Whenever flex can not match the input it says "flex scanner jammed".::  
5452* Why doesn't flex have non-greedy operators like perl does?::  
5453* Memory leak - 16386 bytes allocated by malloc.::  
5454* How do I track the byte offset for lseek()?::  
5455* How do I use my own I/O classes in a C++ scanner?::  
5456* How do I skip as many chars as possible?::  
5457* deleteme00::              
5458* Are certain equivalent patterns faster than others?::              
5459* Is backing up a big deal?::              
5460* Can I fake multi-byte character support?::              
5461* deleteme01::              
5462* Can you discuss some flex internals?::              
5463* unput() messes up yy_at_bol::              
5464* The | operator is not doing what I want::              
5465* Why can't flex understand this variable trailing context pattern?::              
5466* The ^ operator isn't working::              
5467* Trailing context is getting confused with trailing optional patterns::              
5468* Is flex GNU or not?::              
5469* ERASEME53::              
5470* I need to scan if-then-else blocks and while loops::              
5471* ERASEME55::              
5472* ERASEME56::              
5473* ERASEME57::              
5474* Is there a repository for flex scanners?::              
5475* How can I conditionally compile or preprocess my flex input file?::              
5476* Where can I find grammars for lex and yacc?::              
5477* I get an end-of-buffer message for each character scanned.::              
5478* unnamed-faq-62::              
5479* unnamed-faq-63::              
5480* unnamed-faq-64::              
5481* unnamed-faq-65::              
5482* unnamed-faq-66::              
5483* unnamed-faq-67::              
5484* unnamed-faq-68::              
5485* unnamed-faq-69::              
5486* unnamed-faq-70::              
5487* unnamed-faq-71::              
5488* unnamed-faq-72::              
5489* unnamed-faq-73::              
5490* unnamed-faq-74::              
5491* unnamed-faq-75::              
5492* unnamed-faq-76::              
5493* unnamed-faq-77::              
5494* unnamed-faq-78::              
5495* unnamed-faq-79::              
5496* unnamed-faq-80::              
5497* unnamed-faq-81::              
5498* unnamed-faq-82::              
5499* unnamed-faq-83::              
5500* unnamed-faq-84::              
5501* unnamed-faq-85::              
5502* unnamed-faq-86::              
5503* unnamed-faq-87::              
5504* unnamed-faq-88::              
5505* unnamed-faq-90::              
5506* unnamed-faq-91::              
5507* unnamed-faq-92::              
5508* unnamed-faq-93::              
5509* unnamed-faq-94::              
5510* unnamed-faq-95::              
5511* unnamed-faq-96::              
5512* unnamed-faq-97::              
5513* unnamed-faq-98::              
5514* unnamed-faq-99::              
5515* unnamed-faq-100::             
5516* unnamed-faq-101::             
5517* What is the difference between YYLEX_PARAM and YY_DECL?::
5518* Why do I get "conflicting types for yylex" error?::
5519* How do I access the values set in a Flex action from within a Bison action?::
5520@end menu
5521
5522@node  When was flex born?
5523@unnumberedsec When was flex born?
5524
5525Vern Paxson took over
5526the @cite{Software Tools} lex project from Jef Poskanzer in 1982.  At that point it
5527was written in Ratfor.  Around 1987 or so, Paxson translated it into C, and
5528a legend was born :-).
5529
5530@node How do I expand backslash-escape sequences in C-style quoted strings?
5531@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings?
5532
5533A key point when scanning quoted strings is that you cannot (easily) write
5534a single rule that will precisely match the string if you allow things
5535like embedded escape sequences and newlines.  If you try to match strings
5536with a single rule then you'll wind up having to rescan the string anyway
5537to find any escape sequences.
5538
5539Instead you can use exclusive start conditions and a set of rules, one for
5540matching non-escaped text, one for matching a single escape, one for
5541matching an embedded newline, and one for recognizing the end of the
5542string.  Each of these rules is then faced with the question of where to
5543put its intermediary results.  The best solution is for the rules to
5544append their local value of @code{yytext} to the end of a ``string literal''
5545buffer.  A rule like the escape-matcher will append to the buffer the
5546meaning of the escape sequence rather than the literal text in @code{yytext}.
5547In this way, @code{yytext} does not need to be modified at all.
5548
5549@node  Why do flex scanners call fileno if it is not ANSI compatible?
5550@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible?
5551
5552Flex scanners call @code{fileno()} in order to get the file descriptor
5553corresponding to @code{yyin}. The file descriptor may be passed to
5554@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified.
5555If your system does not have @code{fileno()} support, to get rid of the
5556@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()}
5557call, you must specify one of @code{%option always-interactive} or
5558@code{%option never-interactive}.
5559
5560@node  Does flex support recursive pattern definitions?
5561@unnumberedsec Does flex support recursive pattern definitions?
5562
5563e.g.,
5564
5565@example
5566@verbatim
5567%%
5568block   "{"({block}|{statement})*"}"
5569@end verbatim
5570@end example
5571
5572No. You cannot have recursive definitions.  The pattern-matching power of
5573regular expressions in general (and therefore flex scanners, too) is
5574limited.  In particular, regular expressions cannot ``balance'' parentheses
5575to an arbitrary degree.  For example, it's impossible to write a regular
5576expression that matches all strings containing the same number of '@{'s
5577as '@}'s.  For more powerful pattern matching, you need a parser, such
5578as @cite{GNU bison}.
5579
5580@node  How do I skip huge chunks of input (tens of megabytes) while using flex?
5581@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex?
5582
5583Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}.
5584
5585@node  Flex is not matching my patterns in the same order that I defined them.
5586@unnumberedsec Flex is not matching my patterns in the same order that I defined them.
5587
5588@code{flex} picks the
5589rule that matches the most text (i.e., the longest possible input string).
5590This is because @code{flex} uses an entirely different matching technique
5591(``deterministic finite automata'') that actually does all of the matching
5592simultaneously, in parallel.  (Seems impossible, but it's actually a fairly
5593simple technique once you understand the principles.)
5594
5595A side-effect of this parallel matching is that when the input matches more
5596than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This
5597is explained further in the manual, in the section @xref{Matching}.
5598
5599If you want @code{flex} to choose a shorter match, then you can work around this
5600behavior by expanding your short
5601rule to match more text, then put back the extra:
5602
5603@example
5604@verbatim
5605data_.*        yyless( 5 ); BEGIN BLOCKIDSTATE;
5606@end verbatim
5607@end example
5608
5609Another fix would be to make the second rule active only during the
5610@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive
5611by declaring it with @code{%x} instead of @code{%s}.
5612
5613A final fix is to change the input language so that the ambiguity for
5614@samp{data_} is removed, by adding characters to it that don't match the
5615identifier rule, or by removing characters (such as @samp{_}) from the
5616identifier rule so it no longer matches @samp{data_}.  (Of course, you might
5617also not have the option of changing the input language.)
5618
5619@node  My actions are executing out of order or sometimes not at all.
5620@unnumberedsec My actions are executing out of order or sometimes not at all.
5621
5622Most likely, you have (in error) placed the opening @samp{@{} of the action
5623block on a different line than the rule, e.g.,
5624
5625@example
5626@verbatim
5627^(foo|bar)
5628{  <<<--- WRONG!
5629
5630}
5631@end verbatim
5632@end example
5633
5634@code{flex} requires that the opening @samp{@{} of an action associated with a rule
5635begin on the same line as does the rule.  You need instead to write your rules
5636as follows:
5637
5638@example
5639@verbatim
5640^(foo|bar)   {  // CORRECT!
5641
5642}
5643@end verbatim
5644@end example
5645
5646@node  How can I have multiple input sources feed into the same scanner at the same time?
5647@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time?
5648
5649If @dots{}
5650@itemize
5651@item
5652your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag),
5653@item
5654AND you run your scanner interactively (@samp{-I} option; default unless using special table
5655compression options),
5656@item
5657AND you feed it one character at a time by redefining @code{YY_INPUT} to do so,
5658@end itemize
5659
5660then every time it matches a token, it will have exhausted its input
5661buffer (because the scanner is free of backtracking).  This means you
5662can safely use @code{select()} at the point and only call @code{yylex()} for another
5663token if @code{select()} indicates there's data available.
5664
5665That is, move the @code{select()} out from the input function to a point where
5666it determines whether @code{yylex()} gets called for the next token.
5667
5668With this approach, you will still have problems if your input can arrive
5669piecemeal; @code{select()} could inform you that the beginning of a token is
5670available, you call @code{yylex()} to get it, but it winds up blocking waiting
5671for the later characters in the token.
5672
5673Here's another way:  Move your input multiplexing inside of @code{YY_INPUT}.  That
5674is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is
5675available.  If input is available for the scanner, it reads and returns the
5676next byte.  If input is available from another source, it calls whatever
5677function is responsible for reading from that source.  (If no input is
5678available, it blocks until some input is available.)  I've used this technique in an
5679interpreter I wrote that both reads keyboard input using a @code{flex} scanner and
5680IPC traffic from sockets, and it works fine.
5681
5682@node  Can I build nested parsers that work with the same input file?
5683@unnumberedsec Can I build nested parsers that work with the same input file?
5684
5685This is not going to work without some additional effort.  The reason is
5686that @code{flex} block-buffers the input it reads from @code{yyin}.  This means that the
5687``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K
5688of input available on yyin, and subsequent calls to other @code{yylex()}'s won't
5689see that input.  You might be tempted to work around this problem by
5690redefining @code{YY_INPUT} to only return a small amount of text, but it turns out
5691that that approach is quite difficult.  Instead, the best solution is to
5692combine all of your scanners into one large scanner, using a different
5693exclusive start condition for each.
5694
5695@node  How can I match text only at the end of a file?
5696@unnumberedsec How can I match text only at the end of a file?
5697
5698There is no way to write a rule which is ``match this text, but only if
5699it comes at the end of the file''.  You can fake it, though, if you happen
5700to have a character lying around that you don't allow in your input.
5701Then you redefine @code{YY_INPUT} to call your own routine which, if it sees
5702an @samp{EOF}, returns the magic character first (and remembers to return a
5703real @code{EOF} next time it's called).  Then you could write:
5704
5705@example
5706@verbatim
5707<COMMENT>(.|\n)*{EOF_CHAR}    /* saw comment at EOF */
5708@end verbatim
5709@end example
5710
5711@node  How can I make REJECT cascade across start condition boundaries?
5712@unnumberedsec How can I make REJECT cascade across start condition boundaries?
5713
5714You can do this as follows.  Suppose you have a start condition @samp{A}, and
5715after exhausting all of the possible matches in @samp{<A>}, you want to try
5716matches in @samp{<INITIAL>}.  Then you could use the following:
5717
5718@example
5719@verbatim
5720%x A
5721%%
5722<A>rule_that_is_long    ...; REJECT;
5723<A>rule                 ...; REJECT; /* shorter rule */
5724<A>etc.
5725...
5726<A>.|\n  {
5727/* Shortest and last rule in <A>, so
5728* cascaded REJECTs will eventually
5729* wind up matching this rule.  We want
5730* to now switch to the initial state
5731* and try matching from there instead.
5732*/
5733yyless(0);    /* put back matched text */
5734BEGIN(INITIAL);
5735}
5736@end verbatim
5737@end example
5738
5739@node  Why cant I use fast or full tables with interactive mode?
5740@unnumberedsec Why can't I use fast or full tables with interactive mode?
5741
5742One of the assumptions
5743flex makes is that interactive applications are inherently slow (they're
5744waiting on a human after all).
5745It has to do with how the scanner detects that it must be finished scanning
5746a token.  For interactive scanners, after scanning each character the current
5747state is looked up in a table (essentially) to see whether there's a chance
5748of another input character possibly extending the length of the match.  If
5749not, the scanner halts.  For non-interactive scanners, the end-of-token test
5750is much simpler, basically a compare with 0, so no memory bus cycles.  Since
5751the test occurs in the innermost scanning loop, one would like to make it go
5752as fast as possible.
5753
5754Still, it seems reasonable to allow the user to choose to trade off a bit
5755of performance in this area to gain the corresponding flexibility.  There
5756might be another reason, though, why fast scanners don't support the
5757interactive option.
5758
5759@node  How much faster is -F or -f than -C?
5760@unnumberedsec How much faster is -F or -f than -C?
5761
5762Much faster (factor of 2-3).
5763
5764@node  If I have a simple grammar cant I just parse it with flex?
5765@unnumberedsec If I have a simple grammar can't I just parse it with flex?
5766
5767Is your grammar recursive? That's almost always a sign that you're
5768better off using a parser/scanner rather than just trying to use a scanner
5769alone.
5770
5771@node  Why doesn't yyrestart() set the start state back to INITIAL?
5772@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL?
5773
5774There are two reasons.  The first is that there might
5775be programs that rely on the start state not changing across file changes.
5776The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required,
5777so fixing the problem there doesn't solve the more general problem.
5778
5779@node  How can I match C-style comments?
5780@unnumberedsec How can I match C-style comments?
5781
5782You might be tempted to try something like this:
5783
5784@example
5785@verbatim
5786"/*".*"*/"       // WRONG!
5787@end verbatim
5788@end example
5789
5790or, worse, this:
5791
5792@example
5793@verbatim
5794"/*"(.|\n)"*/"   // WRONG!
5795@end verbatim
5796@end example
5797
5798The above rules will eat too much input, and blow up on things like:
5799
5800@example
5801@verbatim
5802/* a comment */ do_my_thing( "oops */" );
5803@end verbatim
5804@end example
5805
5806Here is one way which allows you to track line information:
5807
5808@example
5809@verbatim
5810<INITIAL>{
5811"/*"              BEGIN(IN_COMMENT);
5812}
5813<IN_COMMENT>{
5814"*/"      BEGIN(INITIAL);
5815[^*\n]+   // eat comment in chunks
5816"*"       // eat the lone star
5817\n        yylineno++;
5818}
5819@end verbatim
5820@end example
5821
5822@node  The period isn't working the way I expected.
5823@unnumberedsec The '.' isn't working the way I expected.
5824
5825Here are some tips for using @samp{.}:
5826
5827@itemize
5828@item
5829A common mistake is to place the grouping parenthesis AFTER an operator, when
5830you really meant to place the parenthesis BEFORE the operator, e.g., you
5831probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}.
5832
5833The first pattern matches the words @samp{foo} or @samp{bar} any number of
5834times, e.g., it matches the text @samp{barfoofoobarfoo}. The
5835second pattern matches a single instance of @code{foo} or a single instance of
5836@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} .
5837@item
5838A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period),
5839and NOT ``any character except newline''.
5840@item
5841Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}).
5842If you really want to match ANY character, including newlines, then use @code{(.|\n)}
5843Beware that the regex @code{(.|\n)+} will match your entire input!
5844@item
5845Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."}
5846@end itemize
5847
5848@node  Can I get the flex manual in another format?
5849@unnumberedsec Can I get the flex manual in another format?
5850
5851The @code{flex} source distribution  includes a texinfo manual. You are
5852free to convert that texinfo into whatever format you desire. The
5853@code{texinfo} package includes tools for conversion to a number of formats.
5854
5855@node  Does there exist a "faster" NDFA->DFA algorithm?
5856@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm?
5857
5858There's no way around the potential exponential running time - it
5859can take you exponential time just to enumerate all of the DFA states.
5860In practice, though, the running time is closer to linear, or sometimes
5861quadratic.
5862
5863@node  How does flex compile the DFA so quickly?
5864@unnumberedsec How does flex compile the DFA so quickly?
5865
5866There are two big speed wins that @code{flex} uses:
5867
5868@enumerate
5869@item
5870It analyzes the input rules to construct equivalence classes for those
5871characters that always make the same transitions.  It then rewrites the NFA
5872using equivalence classes for transitions instead of characters.  This cuts
5873down the NFA->DFA computation time dramatically, to the point where, for
5874uncompressed DFA tables, the DFA generation is often I/O bound in writing out
5875the tables.
5876@item
5877It maintains hash values for previously computed DFA states, so testing
5878whether a newly constructed DFA state is equivalent to a previously constructed
5879state can be done very quickly, by first comparing hash values.
5880@end enumerate
5881
5882@node  How can I use more than 8192 rules?
5883@unnumberedsec How can I use more than 8192 rules?
5884
5885@code{Flex} is compiled with an upper limit of 8192 rules per scanner.
5886If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex}
5887with the following changes in @file{flexdef.h}:
5888
5889@example
5890@verbatim
5891<    #define YY_TRAILING_MASK 0x2000
5892<    #define YY_TRAILING_HEAD_MASK 0x4000
5893--
5894>    #define YY_TRAILING_MASK 0x20000000
5895>    #define YY_TRAILING_HEAD_MASK 0x40000000
5896@end verbatim
5897@end example
5898
5899This should work okay as long as your C compiler uses 32 bit integers.
5900But you might want to think about whether using such a huge number of rules
5901is the best way to solve your problem.
5902
5903The following may also be relevant:
5904
5905With luck, you should be able to increase the definitions in flexdef.h for:
5906
5907@example
5908@verbatim
5909#define JAMSTATE -32766 /* marks a reference to the state that always jams */
5910#define MAXIMUM_MNS 31999
5911#define BAD_SUBSCRIPT -32767
5912@end verbatim
5913@end example
5914
5915recompile everything, and it'll all work.  Flex only has these 16-bit-like
5916values built into it because a long time ago it was developed on a machine
5917with 16-bit ints.  I've given this advice to others in the past but haven't
5918heard back from them whether it worked okay or not...
5919
5920@node  How do I abandon a file in the middle of a scan and switch to a new file?
5921@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file?
5922
5923Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a
5924``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}.
5925
5926@node  How do I execute code only during initialization (only before the first scan)?
5927@unnumberedsec How do I execute code only during initialization (only before the first scan)?
5928
5929You can specify an initial action by defining the macro @code{YY_USER_INIT} (though
5930note that @code{yyout} may not be available at the time this macro is executed).  Or you
5931can add to the beginning of your rules section:
5932
5933@example
5934@verbatim
5935%%
5936    /* Must be indented! */
5937    static int did_init = 0;
5938
5939    if ( ! did_init ){
5940do_my_init();
5941        did_init = 1;
5942    }
5943@end verbatim
5944@end example
5945
5946@node  How do I execute code at termination?
5947@unnumberedsec How do I execute code at termination?
5948
5949You can specify an action for the @code{<<EOF>>} rule.
5950
5951@node  Where else can I find help?
5952@unnumberedsec Where else can I find help?
5953
5954You can find the flex homepage on the web at
5955@uref{http://flex.sourceforge.net/}. See that page for details about flex
5956mailing lists as well.
5957
5958@node  Can I include comments in the "rules" section of the file?
5959@unnumberedsec Can I include comments in the "rules" section of the file?
5960
5961Yes, just about anywhere you want to. See the manual for the specific syntax.
5962
5963@node  I get an error about undefined yywrap().
5964@unnumberedsec I get an error about undefined yywrap().
5965
5966You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a}
5967(which provides one), or use
5968
5969@example
5970@verbatim
5971%option noyywrap
5972@end verbatim
5973@end example
5974
5975in your source to say you don't want a @code{yywrap()} function.
5976
5977@node  How can I change the matching pattern at run time?
5978@unnumberedsec How can I change the matching pattern at run time?
5979
5980You can't, it's compiled into a static table when flex builds the scanner.
5981
5982@node How can I expand macros in the input?
5983@unnumberedsec How can I expand macros in the input?
5984
5985The best way to approach this problem is at a higher level, e.g., in the parser.
5986
5987However, you can do this using multiple input buffers.
5988
5989@example
5990@verbatim
5991%%
5992macro/[a-z]+	{
5993/* Saw the macro "macro" followed by extra stuff. */
5994main_buffer = YY_CURRENT_BUFFER;
5995expansion_buffer = yy_scan_string(expand(yytext));
5996yy_switch_to_buffer(expansion_buffer);
5997}
5998
5999<<EOF>>	{
6000if ( expansion_buffer )
6001{
6002// We were doing an expansion, return to where
6003// we were.
6004yy_switch_to_buffer(main_buffer);
6005yy_delete_buffer(expansion_buffer);
6006expansion_buffer = 0;
6007}
6008else
6009yyterminate();
6010}
6011@end verbatim
6012@end example
6013
6014You probably will want a stack of expansion buffers to allow nested macros.
6015From the above though hopefully the idea is clear.
6016
6017@node How can I build a two-pass scanner?
6018@unnumberedsec How can I build a two-pass scanner?
6019
6020One way to do it is to filter the first pass to a temporary file,
6021then process the temporary file on the second pass. You will probably see a
6022performance hit, due to all the disk I/O.
6023
6024When you need to look ahead far forward like this, it almost always means
6025that the right solution is to build a parse tree of the entire input, then
6026walk it after the parse in order to generate the output.  In a sense, this
6027is a two-pass approach, once through the text and once through the parse
6028tree, but the performance hit for the latter is usually an order of magnitude
6029smaller, since everything is already classified, in binary format, and
6030residing in memory.
6031
6032@node How do I match any string not matched in the preceding rules?
6033@unnumberedsec How do I match any string not matched in the preceding rules?
6034
6035One way to assign precedence, is to place the more specific rules first. If
6036two rules would match the same input (same sequence of characters) then the
6037first rule listed in the @code{flex} input wins, e.g.,
6038
6039@example
6040@verbatim
6041%%
6042foo[a-zA-Z_]+    return FOO_ID;
6043bar[a-zA-Z_]+    return BAR_ID;
6044[a-zA-Z_]+       return GENERIC_ID;
6045@end verbatim
6046@end example
6047
6048Note that the rule @code{[a-zA-Z_]+} must come *after* the others.  It will match the
6049same amount of text as the more specific rules, and in that case the
6050@code{flex} scanner will pick the first rule listed in your scanner as the
6051one to match.
6052
6053@node I am trying to port code from AT&T lex that uses yysptr and yysbuf.
6054@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf.
6055
6056Those are internal variables pointing into the AT&T scanner's input buffer.  I
6057imagine they're being manipulated in user versions of the @code{input()} and @code{unput()}
6058functions.  If so, what you need to do is analyze those functions to figure out
6059what they're doing, and then replace @code{input()} with an appropriate definition of
6060@code{YY_INPUT}.  You shouldn't need to (and must not) replace
6061@code{flex}'s @code{unput()} function.
6062
6063@node Is there a way to make flex treat NULL like a regular character?
6064@unnumberedsec Is there a way to make flex treat NULL like a regular character?
6065
6066Yes, @samp{\0} and @samp{\x00} should both do the trick.  Perhaps you have an ancient
6067version of @code{flex}.  The latest release is version @value{VERSION}.
6068
6069@node Whenever flex can not match the input it says "flex scanner jammed".
6070@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed".
6071
6072You need to add a rule that matches the otherwise-unmatched text,
6073e.g.,
6074
6075@example
6076@verbatim
6077%option yylineno
6078%%
6079[[a bunch of rules here]]
6080
6081.	printf("bad input character '%s' at line %d\n", yytext, yylineno);
6082@end verbatim
6083@end example
6084
6085See @code{%option default} for more information.
6086
6087@node Why doesn't flex have non-greedy operators like perl does?
6088@unnumberedsec Why doesn't flex have non-greedy operators like perl does?
6089
6090A DFA can do a non-greedy match by stopping
6091the first time it enters an accepting state, instead of consuming input until
6092it determines that no further matching is possible (a ``jam'' state).  This
6093is actually easier to implement than longest leftmost match (which flex does).
6094
6095But it's also much less useful than longest leftmost match.  In general,
6096when you find yourself wishing for non-greedy matching, that's usually a
6097sign that you're trying to make the scanner do some parsing.  That's
6098generally the wrong approach, since it lacks the power to do a decent job.
6099Better is to either introduce a separate parser, or to split the scanner
6100into multiple scanners using (exclusive) start conditions.
6101
6102You might have
6103a separate start state once you've seen the @samp{BEGIN}. In that state, you
6104might then have a regex that will match @samp{END} (to kick you out of the
6105state), and perhaps @samp{(.|\n)} to get a single character within the chunk ...
6106
6107This approach also has much better error-reporting properties.
6108
6109@node Memory leak - 16386 bytes allocated by malloc.
6110@unnumberedsec Memory leak - 16386 bytes allocated by malloc.
6111@anchor{faq-memory-leak}
6112
6113UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not
6114call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read
6115on.
6116
6117The leak is about 16426 bytes.  That is, (8192 * 2 + 2) for the read-buffer, and
6118about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in
6119the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++
6120scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed.
6121
6122However, the leak won't multiply since the buffer is reused no matter how many
6123times you call @code{yylex()}.
6124
6125If you want to reclaim the memory when you are completely done scanning, then
6126you might try this:
6127
6128@example
6129@verbatim
6130/* For non-reentrant C scanner only. */
6131yy_delete_buffer(YY_CURRENT_BUFFER);
6132yy_init = 1;
6133@end verbatim
6134@end example
6135
6136Note: @code{yy_init} is an "internal variable", and hasn't been tested in this
6137situation. It is possible that some other globals may need resetting as well.
6138
6139@node How do I track the byte offset for lseek()?
6140@unnumberedsec How do I track the byte offset for lseek()?
6141
6142@example
6143@verbatim
6144>   We thought that it would be possible to have this number through the
6145>   evaluation of the following expression:
6146>
6147>   seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf
6148@end verbatim
6149@end example
6150
6151While this is the right idea, it has two problems.  The first is that
6152it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during
6153an invocation of @code{YY_INPUT} (or that your input source will return less
6154even though @code{YY_READ_BUF_SIZE} bytes were requested).  The second problem
6155is that when refilling its internal buffer, @code{flex} keeps some characters
6156from the previous buffer (because usually it's in the middle of a match,
6157and needs those characters to construct @code{yytext} for the match once it's
6158done).  Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't
6159be exactly the number of characters already read from the current buffer.
6160
6161An alternative solution is to count the number of characters you've matched
6162since starting to scan.  This can be done by using @code{YY_USER_ACTION}.  For
6163example,
6164
6165@example
6166@verbatim
6167#define YY_USER_ACTION num_chars += yyleng;
6168@end verbatim
6169@end example
6170
6171(You need to be careful to update your bookkeeping if you use @code{yymore(}),
6172@code{yyless()}, @code{unput()}, or @code{input()}.)
6173
6174@node How do I use my own I/O classes in a C++ scanner?
6175@section How do I use my own I/O classes in a C++ scanner?
6176
6177When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier.
6178
6179@cindex LexerOutput, overriding
6180@cindex LexerInput, overriding
6181@cindex overriding LexerOutput
6182@cindex overriding LexerInput
6183@cindex customizing I/O in C++ scanners
6184@cindex C++ I/O, customizing
6185You can do this by passing the various functions (such as @code{LexerInput()}
6186and @code{LexerOutput()}) NULL @code{iostream*}'s, and then
6187dealing with your own I/O classes surreptitiously (i.e., stashing them in
6188special member variables).  This works because the only assumption about
6189the lexer regarding what's done with the iostream's is that they're
6190ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever
6191is necessary with them.
6192
6193@c faq edit stopped here
6194@node How do I skip as many chars as possible?
6195@unnumberedsec How do I skip as many chars as possible?
6196
6197How do I skip as many chars as possible -- without interfering with the other
6198patterns?
6199
6200In the example below, we want to skip over characters until we see the phrase
6201"endskip". The following will @emph{NOT} work correctly (do you see why not?)
6202
6203@example
6204@verbatim
6205/* INCORRECT SCANNER */
6206%x SKIP
6207%%
6208<INITIAL>startskip   BEGIN(SKIP);
6209...
6210<SKIP>"endskip"       BEGIN(INITIAL);
6211<SKIP>.*             ;
6212@end verbatim
6213@end example
6214
6215The problem is that the pattern .* will eat up the word "endskip."
6216The simplest (but slow) fix is:
6217
6218@example
6219@verbatim
6220<SKIP>"endskip"      BEGIN(INITIAL);
6221<SKIP>.              ;
6222@end verbatim
6223@end example
6224
6225The fix involves making the second rule match more, without
6226making it match "endskip" plus something else.  So for example:
6227
6228@example
6229@verbatim
6230<SKIP>"endskip"     BEGIN(INITIAL);
6231<SKIP>[^e]+         ;
6232<SKIP>.		        ;/* so you eat up e's, too */
6233@end verbatim
6234@end example
6235
6236@c TODO: Evaluate this faq.
6237@node deleteme00
6238@unnumberedsec deleteme00
6239@example
6240@verbatim
6241QUESTION:
6242When was flex born?
6243
6244Vern Paxson took over
6245the Software Tools lex project from Jef Poskanzer in 1982.  At that point it
6246was written in Ratfor.  Around 1987 or so, Paxson translated it into C, and
6247a legend was born :-).
6248@end verbatim
6249@end example
6250
6251@c TODO: Evaluate this faq.
6252@node Are certain equivalent patterns faster than others?
6253@unnumberedsec Are certain equivalent patterns faster than others?
6254@example
6255@verbatim
6256To: Adoram Rogel <adoram@orna.hybridge.com>
6257Subject: Re: Flex 2.5.2 performance questions
6258In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT.
6259Date: Wed, 18 Sep 96 10:51:02 PDT
6260From: Vern Paxson <vern>
6261
6262[Note, the most recent flex release is 2.5.4, which you can get from
6263ftp.ee.lbl.gov.  It has bug fixes over 2.5.2 and 2.5.3.]
6264
6265> 1. Using the pattern
6266>    ([Ff](oot)?)?[Nn](ote)?(\.)?
6267>    instead of
6268>    (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.)))
6269>    (in a very complicated flex program) caused the program to slow from
6270>    300K+/min to 100K/min (no other changes were done).
6271
6272These two are not equivalent.  For example, the first can match "footnote."
6273but the second can only match "footnote".  This is almost certainly the
6274cause in the discrepancy - the slower scanner run is matching more tokens,
6275and/or having to do more backing up.
6276
6277> 2. Which of these two are better: [Ff]oot or (F|f)oot ?
6278
6279From a performance point of view, they're equivalent (modulo presumably
6280minor effects such as memory cache hit rates; and the presence of trailing
6281context, see below).  From a space point of view, the first is slightly
6282preferable.
6283
6284> 3. I have a pattern that look like this:
6285>    pats {p1}|{p2}|{p3}|...|{p50}     (50 patterns ORd)
6286>
6287>    running yet another complicated program that includes the following rule:
6288>    <snext>{and}/{no4}{bb}{pats}
6289>
6290>    gets me to "too complicated - over 32,000 states"...
6291
6292I can't tell from this example whether the trailing context is variable-length
6293or fixed-length (it could be the latter if {and} is fixed-length).  If it's
6294variable length, which flex -p will tell you, then this reflects a basic
6295performance problem, and if you can eliminate it by restructuring your
6296scanner, you will see significant improvement.
6297
6298>    so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about
6299>    10 patterns and changed the rule to be 5 rules.
6300>    This did compile, but what is the rule of thumb here ?
6301
6302The rule is to avoid trailing context other than fixed-length, in which for
6303a/b, either the 'a' pattern or the 'b' pattern have a fixed length.  Use
6304of the '|' operator automatically makes the pattern variable length, so in
6305this case '[Ff]oot' is preferred to '(F|f)oot'.
6306
6307> 4. I changed a rule that looked like this:
6308>    <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN...
6309>
6310>    to the next 2 rules:
6311>    <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;}
6312>    <snext8>{and}{bb}/{ROMAN}         { BEGIN...
6313>
6314>    Again, I understand the using [^...] will cause a great performance loss
6315
6316Actually, it doesn't cause any sort of performance loss.  It's a surprising
6317fact about regular expressions that they always match in linear time
6318regardless of how complex they are.
6319
6320>    but are there any specific rules about it ?
6321
6322See the "Performance Considerations" section of the man page, and also
6323the example in MISC/fastwc/.
6324
6325		Vern
6326@end verbatim
6327@end example
6328
6329@c TODO: Evaluate this faq.
6330@node Is backing up a big deal?
6331@unnumberedsec Is backing up a big deal?
6332@example
6333@verbatim
6334To: Adoram Rogel <adoram@hybridge.com>
6335Subject: Re: Flex 2.5.2 performance questions
6336In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT.
6337Date: Thu, 19 Sep 96 09:58:00 PDT
6338From: Vern Paxson <vern>
6339
6340> a lot about the backing up problem.
6341> I believe that there lies my biggest problem, and I'll try to improve
6342> it.
6343
6344Since you have variable trailing context, this is a bigger performance
6345problem.  Fixing it is usually easier than fixing backing up, which in a
6346complicated scanner (yours seems to fit the bill) can be extremely
6347difficult to do correctly.
6348
6349You also don't mention what flags you are using for your scanner.
6350-f makes a large speed difference, and -Cfe buys you nearly as much
6351speed but the resulting scanner is considerably smaller.
6352
6353> I have an | operator in {and} and in {pats} so both of them are variable
6354> length.
6355
6356-p should have reported this.
6357
6358> Is changing one of them to fixed-length is enough ?
6359
6360Yes.
6361
6362> Is it possible to change the 32,000 states limit ?
6363
6364Yes.  I've appended instructions on how.  Before you make this change,
6365though, you should think about whether there are ways to fundamentally
6366simplify your scanner - those are certainly preferable!
6367
6368		Vern
6369
6370To increase the 32K limit (on a machine with 32 bit integers), you increase
6371the magnitude of the following in flexdef.h:
6372
6373#define JAMSTATE -32766 /* marks a reference to the state that always jams */
6374#define MAXIMUM_MNS 31999
6375#define BAD_SUBSCRIPT -32767
6376#define MAX_SHORT 32700
6377
6378Adding a 0 or two after each should do the trick.
6379@end verbatim
6380@end example
6381
6382@c TODO: Evaluate this faq.
6383@node Can I fake multi-byte character support?
6384@unnumberedsec Can I fake multi-byte character support?
6385@example
6386@verbatim
6387To: Heeman_Lee@hp.com
6388Subject: Re: flex - multi-byte support?
6389In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT.
6390Date: Fri, 04 Oct 1996 11:42:18 PDT
6391From: Vern Paxson <vern>
6392
6393>      I assume as long as my *.l file defines the
6394>      range of expected character code values (in octal format), flex will
6395>      scan the file and read multi-byte characters correctly. But I have no
6396>      confidence in this assumption.
6397
6398Your lack of confidence is justified - this won't work.
6399
6400Flex has in it a widespread assumption that the input is processed
6401one byte at a time.  Fixing this is on the to-do list, but is involved,
6402so it won't happen any time soon.  In the interim, the best I can suggest
6403(unless you want to try fixing it yourself) is to write your rules in
6404terms of pairs of bytes, using definitions in the first section:
6405
6406	X	\xfe\xc2
6407	...
6408	%%
6409	foo{X}bar	found_foo_fe_c2_bar();
6410
6411etc.  Definitely a pain - sorry about that.
6412
6413By the way, the email address you used for me is ancient, indicating you
6414have a very old version of flex.  You can get the most recent, 2.5.4, from
6415ftp.ee.lbl.gov.
6416
6417		Vern
6418@end verbatim
6419@end example
6420
6421@c TODO: Evaluate this faq.
6422@node deleteme01
6423@unnumberedsec deleteme01
6424@example
6425@verbatim
6426To: moleary@primus.com
6427Subject: Re: Flex / Unicode compatibility question
6428In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT.
6429Date: Tue, 22 Oct 1996 11:06:13 PDT
6430From: Vern Paxson <vern>
6431
6432Unfortunately flex at the moment has a widespread assumption within it
6433that characters are processed 8 bits at a time.  I don't see any easy
6434fix for this (other than writing your rules in terms of double characters -
6435a pain).  I also don't know of a wider lex, though you might try surfing
6436the Plan 9 stuff because I know it's a Unicode system, and also the PCCT
6437toolkit (try searching say Alta Vista for "Purdue Compiler Construction
6438Toolkit").
6439
6440Fixing flex to handle wider characters is on the long-term to-do list.
6441But since flex is a strictly spare-time project these days, this probably
6442won't happen for quite a while, unless someone else does it first.
6443
6444		Vern
6445@end verbatim
6446@end example
6447
6448@c TODO: Evaluate this faq.
6449@node Can you discuss some flex internals?
6450@unnumberedsec Can you discuss some flex internals?
6451@example
6452@verbatim
6453To: Johan Linde <jl@theophys.kth.se>
6454Subject: Re: translation of flex
6455In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST.
6456Date: Mon, 11 Nov 1996 10:33:50 PST
6457From: Vern Paxson <vern>
6458
6459> I'm working for the Swedish team translating GNU program, and I'm currently
6460> working with flex. I have a few questions about some of the messages which
6461> I hope you can answer.
6462
6463All of the things you're wondering about, by the way, concerning flex
6464internals - probably the only person who understands what they mean in
6465English is me!  So I wouldn't worry too much about getting them right.
6466That said ...
6467
6468> #: main.c:545
6469> msgid "  %d protos created\n"
6470>
6471> Does proto mean prototype?
6472
6473Yes - prototypes of state compression tables.
6474
6475> #: main.c:539
6476> msgid "  %d/%d (peak %d) template nxt-chk entries created\n"
6477>
6478> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?)
6479> However, 'template next-check entries' doesn't make much sense to me. To be
6480> able to find a good translation I need to know a little bit more about it.
6481
6482There is a scheme in the Aho/Sethi/Ullman compiler book for compressing
6483scanner tables.  It involves creating two pairs of tables.  The first has
6484"base" and "default" entries, the second has "next" and "check" entries.
6485The "base" entry is indexed by the current state and yields an index into
6486the next/check table.  The "default" entry gives what to do if the state
6487transition isn't found in next/check.  The "next" entry gives the next
6488state to enter, but only if the "check" entry verifies that this entry is
6489correct for the current state.  Flex creates templates of series of
6490next/check entries and then encodes differences from these templates as a
6491way to compress the tables.
6492
6493> #: main.c:533
6494> msgid "  %d/%d base-def entries created\n"
6495>
6496> The same problem here for 'base-def'.
6497
6498See above.
6499
6500		Vern
6501@end verbatim
6502@end example
6503
6504@c TODO: Evaluate this faq.
6505@node unput() messes up yy_at_bol
6506@unnumberedsec unput() messes up yy_at_bol
6507@example
6508@verbatim
6509To: Xinying Li <xli@npac.syr.edu>
6510Subject: Re: FLEX ?
6511In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST.
6512Date: Wed, 13 Nov 1996 19:51:54 PST
6513From: Vern Paxson <vern>
6514
6515> "unput()" them to input flow, question occurs. If I do this after I scan
6516> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That
6517> means the carriage flag has gone.
6518
6519You can control this by calling yy_set_bol().  It's described in the manual.
6520
6521>      And if in pre-reading it goes to the end of file, is anything done
6522> to control the end of curren buffer and end of file?
6523
6524No, there's no way to put back an end-of-file.
6525
6526>      By the way I am using flex 2.5.2 and using the "-l".
6527
6528The latest release is 2.5.4, by the way.  It fixes some bugs in 2.5.2 and
65292.5.3.  You can get it from ftp.ee.lbl.gov.
6530
6531		Vern
6532@end verbatim
6533@end example
6534
6535@c TODO: Evaluate this faq.
6536@node The | operator is not doing what I want
6537@unnumberedsec The | operator is not doing what I want
6538@example
6539@verbatim
6540To: Alain.ISSARD@st.com
6541Subject: Re: Start condition with FLEX
6542In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST.
6543Date: Mon, 18 Nov 1996 10:41:34 PST
6544From: Vern Paxson <vern>
6545
6546> I am not able to use the start condition scope and to use the | (OR) with
6547> rules having start conditions.
6548
6549The problem is that if you use '|' as a regular expression operator, for
6550example "a|b" meaning "match either 'a' or 'b'", then it must *not* have
6551any blanks around it.  If you instead want the special '|' *action* (which
6552from your scanner appears to be the case), which is a way of giving two
6553different rules the same action:
6554
6555	foo	|
6556	bar	matched_foo_or_bar();
6557
6558then '|' *must* be separated from the first rule by whitespace and *must*
6559be followed by a new line.  You *cannot* write it as:
6560
6561	foo | bar	matched_foo_or_bar();
6562
6563even though you might think you could because yacc supports this syntax.
6564The reason for this unfortunately incompatibility is historical, but it's
6565unlikely to be changed.
6566
6567Your problems with start condition scope are simply due to syntax errors
6568from your use of '|' later confusing flex.
6569
6570Let me know if you still have problems.
6571
6572		Vern
6573@end verbatim
6574@end example
6575
6576@c TODO: Evaluate this faq.
6577@node Why can't flex understand this variable trailing context pattern?
6578@unnumberedsec Why can't flex understand this variable trailing context pattern?
6579@example
6580@verbatim
6581To: Gregory Margo <gmargo@newton.vip.best.com>
6582Subject: Re: flex-2.5.3 bug report
6583In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST.
6584Date: Sat, 23 Nov 1996 17:07:32 PST
6585From: Vern Paxson <vern>
6586
6587> Enclosed is a lex file that "real" lex will process, but I cannot get
6588> flex to process it.  Could you try it and maybe point me in the right direction?
6589
6590Your problem is that some of the definitions in the scanner use the '/'
6591trailing context operator, and have it enclosed in ()'s.  Flex does not
6592allow this operator to be enclosed in ()'s because doing so allows undefined
6593regular expressions such as "(a/b)+".  So the solution is to remove the
6594parentheses.  Note that you must also be building the scanner with the -l
6595option for AT&T lex compatibility.  Without this option, flex automatically
6596encloses the definitions in parentheses.
6597
6598		Vern
6599@end verbatim
6600@end example
6601
6602@c TODO: Evaluate this faq.
6603@node The ^ operator isn't working
6604@unnumberedsec The ^ operator isn't working
6605@example
6606@verbatim
6607To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de>
6608Subject: Re: Flex Bug ?
6609In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST.
6610Date: Tue, 26 Nov 1996 11:15:05 PST
6611From: Vern Paxson <vern>
6612
6613> In my lexer code, i have the line :
6614> ^\*.*          { }
6615>
6616> Thus all lines starting with an astrix (*) are comment lines.
6617> This does not work !
6618
6619I can't get this problem to reproduce - it works fine for me.  Note
6620though that if what you have is slightly different:
6621
6622	COMMENT	^\*.*
6623	%%
6624	{COMMENT}	{ }
6625
6626then it won't work, because flex pushes back macro definitions enclosed
6627in ()'s, so the rule becomes
6628
6629	(^\*.*)		{ }
6630
6631and now that the '^' operator is not at the immediate beginning of the
6632line, it's interpreted as just a regular character.  You can avoid this
6633behavior by using the "-l" lex-compatibility flag, or "%option lex-compat".
6634
6635		Vern
6636@end verbatim
6637@end example
6638
6639@c TODO: Evaluate this faq.
6640@node Trailing context is getting confused with trailing optional patterns
6641@unnumberedsec Trailing context is getting confused with trailing optional patterns
6642@example
6643@verbatim
6644To: Adoram Rogel <adoram@hybridge.com>
6645Subject: Re: Flex 2.5.4 BOF ???
6646In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST.
6647Date: Wed, 27 Nov 1996 10:56:25 PST
6648From: Vern Paxson <vern>
6649
6650>     Organization(s)?/[a-z]
6651>
6652> This matched "Organizations" (looking in debug mode, the trailing s
6653> was matched with trailing context instead of the optional (s) in the
6654> end of the word.
6655
6656That should only happen with lex.  Flex can properly match this pattern.
6657(That might be what you're saying, I'm just not sure.)
6658
6659> Is there a way to avoid this dangerous trailing context problem ?
6660
6661Unfortunately, there's no easy way.  On the other hand, I don't see why
6662it should be a problem.  Lex's matching is clearly wrong, and I'd hope
6663that usually the intent remains the same as expressed with the pattern,
6664so flex's matching will be correct.
6665
6666		Vern
6667@end verbatim
6668@end example
6669
6670@c TODO: Evaluate this faq.
6671@node Is flex GNU or not?
6672@unnumberedsec Is flex GNU or not?
6673@example
6674@verbatim
6675To: Cameron MacKinnon <mackin@interlog.com>
6676Subject: Re: Flex documentation bug
6677In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST.
6678Date: Sun, 01 Dec 1996 22:29:39 PST
6679From: Vern Paxson <vern>
6680
6681> I'm not sure how or where to submit bug reports (documentation or
6682> otherwise) for the GNU project stuff ...
6683
6684Well, strictly speaking flex isn't part of the GNU project.  They just
6685distribute it because no one's written a decent GPL'd lex replacement.
6686So you should send bugs directly to me.  Those sent to the GNU folks
6687sometimes find there way to me, but some may drop between the cracks.
6688
6689> In GNU Info, under the section 'Start Conditions', and also in the man
6690> page (mine's dated April '95) is a nice little snippet showing how to
6691> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in
6692> size. Unfortunately, no overflow checking is ever done ...
6693
6694This is already mentioned in the manual:
6695
6696Finally, here's an example of how to  match  C-style  quoted
6697strings using exclusive start conditions, including expanded
6698escape sequences (but not including checking  for  a  string
6699that's too long):
6700
6701The reason for not doing the overflow checking is that it will needlessly
6702clutter up an example whose main purpose is just to demonstrate how to
6703use flex.
6704
6705The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov.
6706
6707		Vern
6708@end verbatim
6709@end example
6710
6711@c TODO: Evaluate this faq.
6712@node ERASEME53
6713@unnumberedsec ERASEME53
6714@example
6715@verbatim
6716To: tsv@cs.UManitoba.CA
6717Subject: Re: Flex (reg)..
6718In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST.
6719Date: Thu, 06 Mar 1997 15:54:19 PST
6720From: Vern Paxson <vern>
6721
6722> [:alpha:] ([:alnum:] | \\_)*
6723
6724If your rule really has embedded blanks as shown above, then it won't
6725work, as the first blank delimits the rule from the action.  (It wouldn't
6726even compile ...)  You need instead:
6727
6728[:alpha:]([:alnum:]|\\_)*
6729
6730and that should work fine - there's no restriction on what can go inside
6731of ()'s except for the trailing context operator, '/'.
6732
6733		Vern
6734@end verbatim
6735@end example
6736
6737@c TODO: Evaluate this faq.
6738@node I need to scan if-then-else blocks and while loops
6739@unnumberedsec I need to scan if-then-else blocks and while loops
6740@example
6741@verbatim
6742To: "Mike Stolnicki" <mstolnic@ford.com>
6743Subject: Re: FLEX help
6744In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT.
6745Date: Fri, 30 May 1997 10:46:35 PDT
6746From: Vern Paxson <vern>
6747
6748> We'd like to add "if-then-else", "while", and "for" statements to our
6749> language ...
6750> We've investigated many possible solutions.  The one solution that seems
6751> the most reasonable involves knowing the position of a TOKEN in yyin.
6752
6753I strongly advise you to instead build a parse tree (abstract syntax tree)
6754and loop over that instead.  You'll find this has major benefits in keeping
6755your interpreter simple and extensible.
6756
6757That said, the functionality you mention for get_position and set_position
6758have been on the to-do list for a while.  As flex is a purely spare-time
6759project for me, no guarantees when this will be added (in particular, it
6760for sure won't be for many months to come).
6761
6762		Vern
6763@end verbatim
6764@end example
6765
6766@c TODO: Evaluate this faq.
6767@node ERASEME55
6768@unnumberedsec ERASEME55
6769@example
6770@verbatim
6771To: Colin Paul Adams <colin@colina.demon.co.uk>
6772Subject: Re: Flex C++ classes and Bison
6773In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT.
6774Date: Fri, 15 Aug 1997 10:48:19 PDT
6775From: Vern Paxson <vern>
6776
6777> #define YY_DECL   int yylex (YYSTYPE *lvalp, struct parser_control
6778> *parm)
6779>
6780> I have been trying  to get this to work as a C++ scanner, but it does
6781> not appear to be possible (warning that it matches no declarations in
6782> yyFlexLexer, or something like that).
6783>
6784> Is this supposed to be possible, or is it being worked on (I DID
6785> notice the comment that scanner classes are still experimental, so I'm
6786> not too hopeful)?
6787
6788What you need to do is derive a subclass from yyFlexLexer that provides
6789the above yylex() method, squirrels away lvalp and parm into member
6790variables, and then invokes yyFlexLexer::yylex() to do the regular scanning.
6791
6792		Vern
6793@end verbatim
6794@end example
6795
6796@c TODO: Evaluate this faq.
6797@node ERASEME56
6798@unnumberedsec ERASEME56
6799@example
6800@verbatim
6801To: Mikael.Latvala@lmf.ericsson.se
6802Subject: Re: Possible mistake in Flex v2.5 document
6803In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT.
6804Date: Fri, 05 Sep 1997 10:01:54 PDT
6805From: Vern Paxson <vern>
6806
6807> In that example you show how to count comment lines when using
6808> C style /* ... */ comments. My question is, shouldn't you take into
6809> account a scenario where end of a comment marker occurs inside
6810> character or string literals?
6811
6812The scanner certainly needs to also scan character and string literals.
6813However it does that (there's an example in the man page for strings), the
6814lexer will recognize the beginning of the literal before it runs across the
6815embedded "/*".  Consequently, it will finish scanning the literal before it
6816even considers the possibility of matching "/*".
6817
6818Example:
6819
6820	'([^']*|{ESCAPE_SEQUENCE})'
6821
6822will match all the text between the ''s (inclusive).  So the lexer
6823considers this as a token beginning at the first ', and doesn't even
6824attempt to match other tokens inside it.
6825
6826I thinnk this subtlety is not worth putting in the manual, as I suspect
6827it would confuse more people than it would enlighten.
6828
6829		Vern
6830@end verbatim
6831@end example
6832
6833@c TODO: Evaluate this faq.
6834@node ERASEME57
6835@unnumberedsec ERASEME57
6836@example
6837@verbatim
6838To: "Marty Leisner" <leisner@sdsp.mc.xerox.com>
6839Subject: Re: flex limitations
6840In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT.
6841Date: Mon, 08 Sep 1997 11:38:08 PDT
6842From: Vern Paxson <vern>
6843
6844> %%
6845> [a-zA-Z]+       /* skip a line */
6846>                 {  printf("got %s\n", yytext); }
6847> %%
6848
6849What version of flex are you using?  If I feed this to 2.5.4, it complains:
6850
6851	"bug.l", line 5: EOF encountered inside an action
6852	"bug.l", line 5: unrecognized rule
6853	"bug.l", line 5: fatal parse error
6854
6855Not the world's greatest error message, but it manages to flag the problem.
6856
6857(With the introduction of start condition scopes, flex can't accommodate
6858an action on a separate line, since it's ambiguous with an indented rule.)
6859
6860You can get 2.5.4 from ftp.ee.lbl.gov.
6861
6862		Vern
6863@end verbatim
6864@end example
6865
6866@c TODO: Evaluate this faq.
6867@node Is there a repository for flex scanners?
6868@unnumberedsec Is there a repository for flex scanners?
6869
6870Not that we know of. You might try asking on comp.compilers.
6871
6872@c TODO: Evaluate this faq.
6873@node How can I conditionally compile or preprocess my flex input file?
6874@unnumberedsec How can I conditionally compile or preprocess my flex input file?
6875
6876
6877Flex doesn't have a preprocessor like C does.  You might try using m4, or the C
6878preprocessor plus a sed script to clean up the result.
6879
6880
6881@c TODO: Evaluate this faq.
6882@node Where can I find grammars for lex and yacc?
6883@unnumberedsec Where can I find grammars for lex and yacc?
6884
6885In the sources for flex and bison.
6886
6887@c TODO: Evaluate this faq.
6888@node I get an end-of-buffer message for each character scanned.
6889@unnumberedsec I get an end-of-buffer message for each character scanned.
6890
6891This will happen if your LexerInput() function returns only one character
6892at a time, which can happen either if you're scanner is "interactive", or
6893if the streams library on your platform always returns 1 for yyin->gcount().
6894
6895Solution: override LexerInput() with a version that returns whole buffers.
6896
6897@c TODO: Evaluate this faq.
6898@node unnamed-faq-62
6899@unnumberedsec unnamed-faq-62
6900@example
6901@verbatim
6902To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE
6903Subject: Re: Flex maximums
6904In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST.
6905Date: Mon, 17 Nov 1997 17:16:15 PST
6906From: Vern Paxson <vern>
6907
6908> I took a quick look into the flex-sources and altered some #defines in
6909> flexdefs.h:
6910>
6911> 	#define INITIAL_MNS 64000
6912> 	#define MNS_INCREMENT 1024000
6913> 	#define MAXIMUM_MNS 64000
6914
6915The things to fix are to add a couple of zeroes to:
6916
6917#define JAMSTATE -32766 /* marks a reference to the state that always jams */
6918#define MAXIMUM_MNS 31999
6919#define BAD_SUBSCRIPT -32767
6920#define MAX_SHORT 32700
6921
6922and, if you get complaints about too many rules, make the following change too:
6923
6924	#define YY_TRAILING_MASK 0x200000
6925	#define YY_TRAILING_HEAD_MASK 0x400000
6926
6927- Vern
6928@end verbatim
6929@end example
6930
6931@c TODO: Evaluate this faq.
6932@node unnamed-faq-63
6933@unnumberedsec unnamed-faq-63
6934@example
6935@verbatim
6936To: jimmey@lexis-nexis.com (Jimmey Todd)
6937Subject: Re: FLEX question regarding istream vs ifstream
6938In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST.
6939Date: Mon, 15 Dec 1997 13:21:35 PST
6940From: Vern Paxson <vern>
6941
6942>         stdin_handle = YY_CURRENT_BUFFER;
6943>         ifstream fin( "aFile" );
6944>         yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) );
6945>
6946> What I'm wanting to do, is pass the contents of a file thru one set
6947> of rules and then pass stdin thru another set... It works great if, I
6948> don't use the C++ classes. But since everything else that I'm doing is
6949> in C++, I thought I'd be consistent.
6950>
6951> The problem is that 'yy_create_buffer' is expecting an istream* as it's
6952> first argument (as stated in the man page). However, fin is a ifstream
6953> object. Any ideas on what I might be doing wrong? Any help would be
6954> appreciated. Thanks!!
6955
6956You need to pass &fin, to turn it into an ifstream* instead of an ifstream.
6957Then its type will be compatible with the expected istream*, because ifstream
6958is derived from istream.
6959
6960		Vern
6961@end verbatim
6962@end example
6963
6964@c TODO: Evaluate this faq.
6965@node unnamed-faq-64
6966@unnumberedsec unnamed-faq-64
6967@example
6968@verbatim
6969To: Enda Fadian <fadiane@piercom.ie>
6970Subject: Re: Question related to Flex man page?
6971In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST.
6972Date: Tue, 16 Dec 1997 14:17:09 PST
6973From: Vern Paxson <vern>
6974
6975> Can you explain to me what is ment by a long-jump in relation to flex?
6976
6977Using the longjmp() function while inside yylex() or a routine called by it.
6978
6979> what is the flex activation frame.
6980
6981Just yylex()'s stack frame.
6982
6983> As far as I can see yyrestart will bring me back to the sart of the input
6984> file and using flex++ isnot really an option!
6985
6986No, yyrestart() doesn't imply a rewind, even though its name might sound
6987like it does.  It tells the scanner to flush its internal buffers and
6988start reading from the given file at its present location.
6989
6990		Vern
6991@end verbatim
6992@end example
6993
6994@c TODO: Evaluate this faq.
6995@node unnamed-faq-65
6996@unnumberedsec unnamed-faq-65
6997@example
6998@verbatim
6999To: hassan@larc.info.uqam.ca (Hassan Alaoui)
7000Subject: Re: Need urgent Help
7001In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST.
7002Date: Sun, 21 Dec 1997 21:30:46 PST
7003From: Vern Paxson <vern>
7004
7005> /usr/lib/yaccpar: In function `int yyparse()':
7006> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)'
7007>
7008> ld: Undefined symbol
7009>    _yylex
7010>    _yyparse
7011>    _yyin
7012
7013This is a known problem with Solaris C++ (and/or Solaris yacc).  I believe
7014the fix is to explicitly insert some 'extern "C"' statements for the
7015corresponding routines/symbols.
7016
7017		Vern
7018@end verbatim
7019@end example
7020
7021@c TODO: Evaluate this faq.
7022@node unnamed-faq-66
7023@unnumberedsec unnamed-faq-66
7024@example
7025@verbatim
7026To: mc0307@mclink.it
7027Cc: gnu@prep.ai.mit.edu
7028Subject: Re: [mc0307@mclink.it: Help request]
7029In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST.
7030Date: Sun, 21 Dec 1997 22:33:37 PST
7031From: Vern Paxson <vern>
7032
7033> This is my definition for float and integer types:
7034> . . .
7035> NZD          [1-9]
7036> ...
7037> I've tested my program on other lex version (on UNIX Sun Solaris an HP
7038> UNIX) and it work well, so I think that my definitions are correct.
7039> There are any differences between Lex and Flex?
7040
7041There are indeed differences, as discussed in the man page.  The one
7042you are probably running into is that when flex expands a name definition,
7043it puts parentheses around the expansion, while lex does not.  There's
7044an example in the man page of how this can lead to different matching.
7045Flex's behavior complies with the POSIX standard (or at least with the
7046last POSIX draft I saw).
7047
7048		Vern
7049@end verbatim
7050@end example
7051
7052@c TODO: Evaluate this faq.
7053@node unnamed-faq-67
7054@unnumberedsec unnamed-faq-67
7055@example
7056@verbatim
7057To: hassan@larc.info.uqam.ca (Hassan Alaoui)
7058Subject: Re: Thanks
7059In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST.
7060Date: Mon, 22 Dec 1997 14:35:05 PST
7061From: Vern Paxson <vern>
7062
7063> Thank you very much for your help. I compile and link well with C++ while
7064> declaring 'yylex ...' extern, But a little problem remains. I get a
7065> segmentation default when executing ( I linked with lfl library) while it
7066> works well when using LEX instead of flex. Do you have some ideas about the
7067> reason for this ?
7068
7069The one possible reason for this that comes to mind is if you've defined
7070yytext as "extern char yytext[]" (which is what lex uses) instead of
7071"extern char *yytext" (which is what flex uses).  If it's not that, then
7072I'm afraid I don't know what the problem might be.
7073
7074		Vern
7075@end verbatim
7076@end example
7077
7078@c TODO: Evaluate this faq.
7079@node unnamed-faq-68
7080@unnumberedsec unnamed-faq-68
7081@example
7082@verbatim
7083To: "Bart Niswonger" <NISWONGR@almaden.ibm.com>
7084Subject: Re: flex 2.5: c++ scanners & start conditions
7085In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST.
7086Date: Tue, 06 Jan 1998 19:19:30 PST
7087From: Vern Paxson <vern>
7088
7089> The problem is that when I do this (using %option c++) start
7090> conditions seem to not apply.
7091
7092The BEGIN macro modifies the yy_start variable.  For C scanners, this
7093is a static with scope visible through the whole file.  For C++ scanners,
7094it's a member variable, so it only has visible scope within a member
7095function.  Your lexbegin() routine is not a member function when you
7096build a C++ scanner, so it's not modifying the correct yy_start.  The
7097diagnostic that indicates this is that you found you needed to add
7098a declaration of yy_start in order to get your scanner to compile when
7099using C++; instead, the correct fix is to make lexbegin() a member
7100function (by deriving from yyFlexLexer).
7101
7102		Vern
7103@end verbatim
7104@end example
7105
7106@c TODO: Evaluate this faq.
7107@node unnamed-faq-69
7108@unnumberedsec unnamed-faq-69
7109@example
7110@verbatim
7111To: "Boris Zinin" <boris@ippe.rssi.ru>
7112Subject: Re: current position in flex buffer
7113In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST.
7114Date: Mon, 12 Jan 1998 12:03:15 PST
7115From: Vern Paxson <vern>
7116
7117> The problem is how to determine the current position in flex active
7118> buffer when a rule is matched....
7119
7120You will need to keep track of this explicitly, such as by redefining
7121YY_USER_ACTION to count the number of characters matched.
7122
7123The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov.
7124
7125		Vern
7126@end verbatim
7127@end example
7128
7129@c TODO: Evaluate this faq.
7130@node unnamed-faq-70
7131@unnumberedsec unnamed-faq-70
7132@example
7133@verbatim
7134To: Bik.Dhaliwal@bis.org
7135Subject: Re: Flex question
7136In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST.
7137Date: Tue, 27 Jan 1998 22:41:52 PST
7138From: Vern Paxson <vern>
7139
7140> That requirement involves knowing
7141> the character position at which a particular token was matched
7142> in the lexer.
7143
7144The way you have to do this is by explicitly keeping track of where
7145you are in the file, by counting the number of characters scanned
7146for each token (available in yyleng).  It may prove convenient to
7147do this by redefining YY_USER_ACTION, as described in the manual.
7148
7149		Vern
7150@end verbatim
7151@end example
7152
7153@c TODO: Evaluate this faq.
7154@node unnamed-faq-71
7155@unnumberedsec unnamed-faq-71
7156@example
7157@verbatim
7158To: Vladimir Alexiev <vladimir@cs.ualberta.ca>
7159Subject: Re: flex: how to control start condition from parser?
7160In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST.
7161Date: Tue, 27 Jan 1998 22:45:37 PST
7162From: Vern Paxson <vern>
7163
7164> It seems useful for the parser to be able to tell the lexer about such
7165> context dependencies, because then they don't have to be limited to
7166> local or sequential context.
7167
7168One way to do this is to have the parser call a stub routine that's
7169included in the scanner's .l file, and consequently that has access ot
7170BEGIN.  The only ugliness is that the parser can't pass in the state
7171it wants, because those aren't visible - but if you don't have many
7172such states, then using a different set of names doesn't seem like
7173to much of a burden.
7174
7175While generating a .h file like you suggests is certainly cleaner,
7176flex development has come to a virtual stand-still :-(, so a workaround
7177like the above is much more pragmatic than waiting for a new feature.
7178
7179		Vern
7180@end verbatim
7181@end example
7182
7183@c TODO: Evaluate this faq.
7184@node unnamed-faq-72
7185@unnumberedsec unnamed-faq-72
7186@example
7187@verbatim
7188To: Barbara Denny <denny@3com.com>
7189Subject: Re: freebsd flex bug?
7190In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST.
7191Date: Fri, 30 Jan 1998 12:42:32 PST
7192From: Vern Paxson <vern>
7193
7194> lex.yy.c:1996: parse error before `='
7195
7196This is the key, identifying this error.  (It may help to pinpoint
7197it by using flex -L, so it doesn't generate #line directives in its
7198output.)  I will bet you heavy money that you have a start condition
7199name that is also a variable name, or something like that; flex spits
7200out #define's for each start condition name, mapping them to a number,
7201so you can wind up with:
7202
7203	%x foo
7204	%%
7205		...
7206	%%
7207	void bar()
7208		{
7209		int foo = 3;
7210		}
7211
7212and the penultimate will turn into "int 1 = 3" after C preprocessing,
7213since flex will put "#define foo 1" in the generated scanner.
7214
7215		Vern
7216@end verbatim
7217@end example
7218
7219@c TODO: Evaluate this faq.
7220@node unnamed-faq-73
7221@unnumberedsec unnamed-faq-73
7222@example
7223@verbatim
7224To: Maurice Petrie <mpetrie@infoscigroup.com>
7225Subject: Re: Lost flex .l file
7226In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST.
7227Date: Mon, 02 Feb 1998 11:15:12 PST
7228From: Vern Paxson <vern>
7229
7230> I am curious as to
7231> whether there is a simple way to backtrack from the generated source to
7232> reproduce the lost list of tokens we are searching on.
7233
7234In theory, it's straight-forward to go from the DFA representation
7235back to a regular-expression representation - the two are isomorphic.
7236In practice, a huge headache, because you have to unpack all the tables
7237back into a single DFA representation, and then write a program to munch
7238on that and translate it into an RE.
7239
7240Sorry for the less-than-happy news ...
7241
7242		Vern
7243@end verbatim
7244@end example
7245
7246@c TODO: Evaluate this faq.
7247@node unnamed-faq-74
7248@unnumberedsec unnamed-faq-74
7249@example
7250@verbatim
7251To: jimmey@lexis-nexis.com (Jimmey Todd)
7252Subject: Re: Flex performance question
7253In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST.
7254Date: Thu, 19 Feb 1998 08:48:51 PST
7255From: Vern Paxson <vern>
7256
7257> What I have found, is that the smaller the data chunk, the faster the
7258> program executes. This is the opposite of what I expected. Should this be
7259> happening this way?
7260
7261This is exactly what will happen if your input file has embedded NULs.
7262From the man page:
7263
7264A final note: flex is slow when matching NUL's, particularly
7265when  a  token  contains multiple NUL's.  It's best to write
7266rules which match short amounts of text if it's  anticipated
7267that the text will often include NUL's.
7268
7269So that's the first thing to look for.
7270
7271		Vern
7272@end verbatim
7273@end example
7274
7275@c TODO: Evaluate this faq.
7276@node unnamed-faq-75
7277@unnumberedsec unnamed-faq-75
7278@example
7279@verbatim
7280To: jimmey@lexis-nexis.com (Jimmey Todd)
7281Subject: Re: Flex performance question
7282In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST.
7283Date: Thu, 19 Feb 1998 15:42:25 PST
7284From: Vern Paxson <vern>
7285
7286So there are several problems.
7287
7288First, to go fast, you want to match as much text as possible, which
7289your scanners don't in the case that what they're scanning is *not*
7290a <RN> tag.  So you want a rule like:
7291
7292	[^<]+
7293
7294Second, C++ scanners are particularly slow if they're interactive,
7295which they are by default.  Using -B speeds it up by a factor of 3-4
7296on my workstation.
7297
7298Third, C++ scanners that use the istream interface are slow, because
7299of how poorly implemented istream's are.  I built two versions of
7300the following scanner:
7301
7302	%%
7303	.*\n
7304	.*
7305	%%
7306
7307and the C version inhales a 2.5MB file on my workstation in 0.8 seconds.
7308The C++ istream version, using -B, takes 3.8 seconds.
7309
7310		Vern
7311@end verbatim
7312@end example
7313
7314@c TODO: Evaluate this faq.
7315@node unnamed-faq-76
7316@unnumberedsec unnamed-faq-76
7317@example
7318@verbatim
7319To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com>
7320Subject: Re: FLEX 2.5 & THE YEAR 2000
7321In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT.
7322Date: Wed, 03 Jun 1998 10:22:26 PDT
7323From: Vern Paxson <vern>
7324
7325> I am researching the Y2K problem with General Electric R&D
7326> and need to know if there are any known issues concerning
7327> the above mentioned software and Y2K regardless of version.
7328
7329There shouldn't be, all it ever does with the date is ask the system
7330for it and then print it out.
7331
7332		Vern
7333@end verbatim
7334@end example
7335
7336@c TODO: Evaluate this faq.
7337@node unnamed-faq-77
7338@unnumberedsec unnamed-faq-77
7339@example
7340@verbatim
7341To: "Hans Dermot Doran" <htd@ibhdoran.com>
7342Subject: Re: flex problem
7343In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT.
7344Date: Tue, 21 Jul 1998 14:23:34 PDT
7345From: Vern Paxson <vern>
7346
7347> To overcome this, I gets() the stdin into a string and lex the string. The
7348> string is lexed OK except that the end of string isn't lexed properly
7349> (yy_scan_string()), that is the lexer dosn't recognise the end of string.
7350
7351Flex doesn't contain mechanisms for recognizing buffer endpoints.  But if
7352you use fgets instead (which you should anyway, to protect against buffer
7353overflows), then the final \n will be preserved in the string, and you can
7354scan that in order to find the end of the string.
7355
7356		Vern
7357@end verbatim
7358@end example
7359
7360@c TODO: Evaluate this faq.
7361@node unnamed-faq-78
7362@unnumberedsec unnamed-faq-78
7363@example
7364@verbatim
7365To: soumen@almaden.ibm.com
7366Subject: Re: Flex++ 2.5.3 instance member vs. static member
7367In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT.
7368Date: Tue, 28 Jul 1998 01:10:34 PDT
7369From: Vern Paxson <vern>
7370
7371> %{
7372> int mylineno = 0;
7373> %}
7374> ws      [ \t]+
7375> alpha   [A-Za-z]
7376> dig     [0-9]
7377> %%
7378>
7379> Now you'd expect mylineno to be a member of each instance of class
7380> yyFlexLexer, but is this the case?  A look at the lex.yy.cc file seems to
7381> indicate otherwise; unless I am missing something the declaration of
7382> mylineno seems to be outside any class scope.
7383>
7384> How will this work if I want to run a multi-threaded application with each
7385> thread creating a FlexLexer instance?
7386
7387Derive your own subclass and make mylineno a member variable of it.
7388
7389		Vern
7390@end verbatim
7391@end example
7392
7393@c TODO: Evaluate this faq.
7394@node unnamed-faq-79
7395@unnumberedsec unnamed-faq-79
7396@example
7397@verbatim
7398To: Adoram Rogel <adoram@hybridge.com>
7399Subject: Re: More than 32K states change hangs
7400In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT.
7401Date: Tue, 04 Aug 1998 22:28:45 PDT
7402From: Vern Paxson <vern>
7403
7404> Vern Paxson,
7405>
7406> I followed your advice, posted on Usenet bu you, and emailed to me
7407> personally by you, on how to overcome the 32K states limit. I'm running
7408> on Linux machines.
7409> I took the full source of version 2.5.4 and did the following changes in
7410> flexdef.h:
7411> #define JAMSTATE -327660
7412> #define MAXIMUM_MNS 319990
7413> #define BAD_SUBSCRIPT -327670
7414> #define MAX_SHORT 327000
7415>
7416> and compiled.
7417> All looked fine, including check and bigcheck, so I installed.
7418
7419Hmmm, you shouldn't increase MAX_SHORT, though looking through my email
7420archives I see that I did indeed recommend doing so.  Try setting it back
7421to 32700; that should suffice that you no longer need -Ca.  If it still
7422hangs, then the interesting question is - where?
7423
7424> Compiling the same hanged program with a out-of-the-box (RedHat 4.2
7425> distribution of Linux)
7426> flex 2.5.4 binary works.
7427
7428Since Linux comes with source code, you should diff it against what
7429you have to see what problems they missed.
7430
7431> Should I always compile with the -Ca option now ? even short and simple
7432> filters ?
7433
7434No, definitely not.  It's meant to be for those situations where you
7435absolutely must squeeze every last cycle out of your scanner.
7436
7437		Vern
7438@end verbatim
7439@end example
7440
7441@c TODO: Evaluate this faq.
7442@node unnamed-faq-80
7443@unnumberedsec unnamed-faq-80
7444@example
7445@verbatim
7446To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com>
7447Subject: Re: flex output for static code portion
7448In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT.
7449Date: Mon, 17 Aug 1998 23:57:42 PDT
7450From: Vern Paxson <vern>
7451
7452> I would like to use flex under the hood to generate a binary file
7453> containing the data structures that control the parse.
7454
7455This has been on the wish-list for a long time.  In principle it's
7456straight-forward - you redirect mkdata() et al's I/O to another file,
7457and modify the skeleton to have a start-up function that slurps these
7458into dynamic arrays.  The concerns are (1) the scanner generation code
7459is hairy and full of corner cases, so it's easy to get surprised when
7460going down this path :-( ; and (2) being careful about buffering so
7461that when the tables change you make sure the scanner starts in the
7462correct state and reading at the right point in the input file.
7463
7464> I was wondering if you know of anyone who has used flex in this way.
7465
7466I don't - but it seems like a reasonable project to undertake (unlike
7467numerous other flex tweaks :-).
7468
7469		Vern
7470@end verbatim
7471@end example
7472
7473@c TODO: Evaluate this faq.
7474@node unnamed-faq-81
7475@unnumberedsec unnamed-faq-81
7476@example
7477@verbatim
7478Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11])
7479	by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838
7480	for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT)
7481Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2])
7482	by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694
7483	for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200
7484Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200
7485From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de>
7486Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de>
7487Subject: "flex scanner push-back overflow"
7488To: vern@ee.lbl.gov
7489Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST)
7490Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE
7491X-NoJunk: Do NOT send commercial mail, spam or ads to this address!
7492X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/
7493X-Mailer: ELM [version 2.4ME+ PL28 (25)]
7494MIME-Version: 1.0
7495Content-Type: text/plain; charset=US-ASCII
7496Content-Transfer-Encoding: 7bit
7497
7498Hi Vern,
7499
7500Yesterday, I encountered a strange problem: I use the macro processor m4
7501to include some lengthy lists into a .l file. Following is a flex macro
7502definition that causes some serious pain in my neck:
7503
7504AUTHOR           ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...])
7505
7506The complete list contains about 10kB. When I try to "flex" this file
7507(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased
7508some of the predefined values in flexdefs.h) I get the error:
7509
7510myflex/flex -8  sentag.tmp.l
7511flex scanner push-back overflow
7512
7513When I remove the slashes in the macro definition everything works fine.
7514As I understand it, the double quotes escape the slash-character so it
7515really means "/" and not "trailing context". Furthermore, I tried to
7516escape the slashes with backslashes, but with no use, the same error message
7517appeared when flexing the code.
7518
7519Do you have an idea what's going on here?
7520
7521Greetings from Germany,
7522	Georg
7523--
7524Georg Rehm                                     georg@cl-ki.uni-osnabrueck.de
7525Institute for Semantic Information Processing, University of Osnabrueck, FRG
7526@end verbatim
7527@end example
7528
7529@c TODO: Evaluate this faq.
7530@node unnamed-faq-82
7531@unnumberedsec unnamed-faq-82
7532@example
7533@verbatim
7534To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE
7535Subject: Re: "flex scanner push-back overflow"
7536In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT.
7537Date: Thu, 20 Aug 1998 07:05:35 PDT
7538From: Vern Paxson <vern>
7539
7540> myflex/flex -8  sentag.tmp.l
7541> flex scanner push-back overflow
7542
7543Flex itself uses a flex scanner.  That scanner is running out of buffer
7544space when it tries to unput() the humongous macro you've defined.  When
7545you remove the '/'s, you make it small enough so that it fits in the buffer;
7546removing spaces would do the same thing.
7547
7548The fix is to either rethink how come you're using such a big macro and
7549perhaps there's another/better way to do it; or to rebuild flex's own
7550scan.c with a larger value for
7551
7552	#define YY_BUF_SIZE 16384
7553
7554- Vern
7555@end verbatim
7556@end example
7557
7558@c TODO: Evaluate this faq.
7559@node unnamed-faq-83
7560@unnumberedsec unnamed-faq-83
7561@example
7562@verbatim
7563To: Jan Kort <jan@research.techforce.nl>
7564Subject: Re: Flex
7565In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200.
7566Date: Sat, 05 Sep 1998 00:59:49 PDT
7567From: Vern Paxson <vern>
7568
7569> %%
7570>
7571> "TEST1\n"       { fprintf(stderr, "TEST1\n"); yyless(5); }
7572> ^\n             { fprintf(stderr, "empty line\n"); }
7573> .               { }
7574> \n              { fprintf(stderr, "new line\n"); }
7575>
7576> %%
7577> -- input ---------------------------------------
7578> TEST1
7579> -- output --------------------------------------
7580> TEST1
7581> empty line
7582> ------------------------------------------------
7583
7584IMHO, it's not clear whether or not this is in fact a bug.  It depends
7585on whether you view yyless() as backing up in the input stream, or as
7586pushing new characters onto the beginning of the input stream.  Flex
7587interprets it as the latter (for implementation convenience, I'll admit),
7588and so considers the newline as in fact matching at the beginning of a
7589line, as after all the last token scanned an entire line and so the
7590scanner is now at the beginning of a new line.
7591
7592I agree that this is counter-intuitive for yyless(), given its
7593functional description (it's less so for unput(), depending on whether
7594you're unput()'ing new text or scanned text).  But I don't plan to
7595change it any time soon, as it's a pain to do so.  Consequently,
7596you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak
7597your scanner into the behavior you desire.
7598
7599Sorry for the less-than-completely-satisfactory answer.
7600
7601		Vern
7602@end verbatim
7603@end example
7604
7605@c TODO: Evaluate this faq.
7606@node unnamed-faq-84
7607@unnumberedsec unnamed-faq-84
7608@example
7609@verbatim
7610To: Patrick Krusenotto <krusenot@mac-info-link.de>
7611Subject: Re: Problems with restarting flex-2.5.2-generated scanner
7612In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT.
7613Date: Thu, 24 Sep 1998 23:28:43 PDT
7614From: Vern Paxson <vern>
7615
7616> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately
7617> trying to make my scanner restart with a new file after my parser stops
7618> with a parse error. When my compiler restarts, the parser always
7619> receives the token after the token (in the old file!) that caused the
7620> parser error.
7621
7622I suspect the problem is that your parser has read ahead in order
7623to attempt to resolve an ambiguity, and when it's restarted it picks
7624up with that token rather than reading a fresh one.  If you're using
7625yacc, then the special "error" production can sometimes be used to
7626consume tokens in an attempt to get the parser into a consistent state.
7627
7628		Vern
7629@end verbatim
7630@end example
7631
7632@c TODO: Evaluate this faq.
7633@node unnamed-faq-85
7634@unnumberedsec unnamed-faq-85
7635@example
7636@verbatim
7637To: Henric Jungheim <junghelh@pe-nelson.com>
7638Subject: Re: flex 2.5.4a
7639In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST.
7640Date: Tue, 27 Oct 1998 16:50:14 PST
7641From: Vern Paxson <vern>
7642
7643> This brings up a feature request:  How about a command line
7644> option to specify the filename when reading from stdin?  That way one
7645> doesn't need to create a temporary file in order to get the "#line"
7646> directives to make sense.
7647
7648Use -o combined with -t (per the man page description of -o).
7649
7650> P.S., Is there any simple way to use non-blocking IO to parse multiple
7651> streams?
7652
7653Simple, no.
7654
7655One approach might be to return a magic character on EWOULDBLOCK and
7656have a rule
7657
7658	.*<magic-character>	// put back .*, eat magic character
7659
7660This is off the top of my head, not sure it'll work.
7661
7662		Vern
7663@end verbatim
7664@end example
7665
7666@c TODO: Evaluate this faq.
7667@node unnamed-faq-86
7668@unnumberedsec unnamed-faq-86
7669@example
7670@verbatim
7671To: "Repko, Billy D" <billy.d.repko@intel.com>
7672Subject: Re: Compiling scanners
7673In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST.
7674Date: Thu, 14 Jan 1999 00:25:30 PST
7675From: Vern Paxson <vern>
7676
7677> It appears that maybe it cannot find the lfl library.
7678
7679The Makefile in the distribution builds it, so you should have it.
7680It's exceedingly trivial, just a main() that calls yylex() and
7681a yyrap() that always returns 1.
7682
7683> %%
7684>       \n      ++num_lines; ++num_chars;
7685>       .       ++num_chars;
7686
7687You can't indent your rules like this - that's where the errors are coming
7688from.  Flex copies indented text to the output file, it's how you do things
7689like
7690
7691	int num_lines_seen = 0;
7692
7693to declare local variables.
7694
7695		Vern
7696@end verbatim
7697@end example
7698
7699@c TODO: Evaluate this faq.
7700@node unnamed-faq-87
7701@unnumberedsec unnamed-faq-87
7702@example
7703@verbatim
7704To: Erick Branderhorst <Erick.Branderhorst@asml.nl>
7705Subject: Re: flex input buffer
7706In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST.
7707Date: Tue, 09 Feb 1999 21:03:37 PST
7708From: Vern Paxson <vern>
7709
7710> In the flex.skl file the size of the default input buffers is set.  Can you
7711> explain why this size is set and why it is such a high number.
7712
7713It's large to optimize performance when scanning large files.  You can
7714safely make it a lot lower if needed.
7715
7716		Vern
7717@end verbatim
7718@end example
7719
7720@c TODO: Evaluate this faq.
7721@node unnamed-faq-88
7722@unnumberedsec unnamed-faq-88
7723@example
7724@verbatim
7725To: "Guido Minnen" <guidomi@cogs.susx.ac.uk>
7726Subject: Re: Flex error message
7727In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST.
7728Date: Thu, 25 Feb 1999 00:11:31 PST
7729From: Vern Paxson <vern>
7730
7731> I'm extending a larger scanner written in Flex and I keep running into
7732> problems. More specifically, I get the error message:
7733> "flex: input rules are too complicated (>= 32000 NFA states)"
7734
7735Increase the definitions in flexdef.h for:
7736
7737#define JAMSTATE -32766 /* marks a reference to the state that always j
7738ams */
7739#define MAXIMUM_MNS 31999
7740#define BAD_SUBSCRIPT -32767
7741
7742recompile everything, and it should all work.
7743
7744		Vern
7745@end verbatim
7746@end example
7747
7748@c TODO: Evaluate this faq.
7749@node unnamed-faq-90
7750@unnumberedsec unnamed-faq-90
7751@example
7752@verbatim
7753To: "Dmitriy Goldobin" <gold@ems.chel.su>
7754Subject: Re: FLEX trouble
7755In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT.
7756Date: Tue, 01 Jun 1999 00:15:07 PDT
7757From: Vern Paxson <vern>
7758
7759>   I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20
7760> but rule "/*"(.|\n)*"*/" don't work ?
7761
7762The second of these will have to scan the entire input stream (because
7763"(.|\n)*" matches an arbitrary amount of any text) in order to see if
7764it ends with "*/", terminating the comment.  That potentially will overflow
7765the input buffer.
7766
7767>   More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error
7768> 'unrecognized rule'.
7769
7770You can't use the '/' operator inside parentheses.  It's not clear
7771what "(a/b)*" actually means.
7772
7773>   I now use workaround with state <comment>, but single-rule is
7774> better, i think.
7775
7776Single-rule is nice but will always have the problem of either setting
7777restrictions on comments (like not allowing multi-line comments) and/or
7778running the risk of consuming the entire input stream, as noted above.
7779
7780		Vern
7781@end verbatim
7782@end example
7783
7784@c TODO: Evaluate this faq.
7785@node unnamed-faq-91
7786@unnumberedsec unnamed-faq-91
7787@example
7788@verbatim
7789Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18])
7790	by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100
7791	for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT)
7792Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999
7793To: vern@ee.lbl.gov
7794Date: Tue, 15 Jun 1999 08:55:43 -0700
7795From: "Aki Niimura" <neko@my-deja.com>
7796Message-ID: <KNONDOHDOBGAEAAA@my-deja.com>
7797Mime-Version: 1.0
7798Cc:
7799X-Sent-Mail: on
7800Reply-To:
7801X-Mailer: MailCity Service
7802Subject: A question on flex C++ scanner
7803X-Sender-Ip: 12.72.207.61
7804Organization: My Deja Email  (http://www.my-deja.com:80)
7805Content-Type: text/plain; charset=us-ascii
7806Content-Transfer-Encoding: 7bit
7807
7808Dear Dr. Paxon,
7809
7810I have been using flex for years.
7811It works very well on many projects.
7812Most case, I used it to generate a scanner on C language.
7813However, one project I needed to generate  a scanner
7814on C++ lanuage. Thanks to your enhancement, flex did
7815the job.
7816
7817Currently, I'm working on enhancing my previous project.
7818I need to deal with multiple input streams (recursive
7819inclusion) in this scanner (C++).
7820I did similar thing for another scanner (C) as you
7821explained in your documentation.
7822
7823The generated scanner (C++) has necessary methods:
7824- switch_to_buffer(struct yy_buffer_state *b)
7825- yy_create_buffer(istream *is, int sz)
7826- yy_delete_buffer(struct yy_buffer_state *b)
7827
7828However, I couldn't figure out how to access current
7829buffer (yy_current_buffer).
7830
7831yy_current_buffer is a protected member of yyFlexLexer.
7832I can't access it directly.
7833Then, I thought yy_create_buffer() with is = 0 might
7834return current stream buffer. But it seems not as far
7835as I checked the source. (flex 2.5.4)
7836
7837I went through the Web in addition to Flex documentation.
7838However, it hasn't been successful, so far.
7839
7840It is not my intention to bother you, but, can you
7841comment about how to obtain the current stream buffer?
7842
7843Your response would be highly appreciated.
7844
7845Best regards,
7846Aki Niimura
7847
7848--== Sent via Deja.com http://www.deja.com/ ==--
7849Share what you know. Learn what you don't.
7850@end verbatim
7851@end example
7852
7853@c TODO: Evaluate this faq.
7854@node unnamed-faq-92
7855@unnumberedsec unnamed-faq-92
7856@example
7857@verbatim
7858To: neko@my-deja.com
7859Subject: Re: A question on flex C++ scanner
7860In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT.
7861Date: Tue, 15 Jun 1999 09:04:24 PDT
7862From: Vern Paxson <vern>
7863
7864> However, I couldn't figure out how to access current
7865> buffer (yy_current_buffer).
7866
7867Derive your own subclass from yyFlexLexer.
7868
7869		Vern
7870@end verbatim
7871@end example
7872
7873@c TODO: Evaluate this faq.
7874@node unnamed-faq-93
7875@unnumberedsec unnamed-faq-93
7876@example
7877@verbatim
7878To: "Stones, Darren" <Darren.Stones@nectech.co.uk>
7879Subject: Re: You're the man to see?
7880In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT.
7881Date: Wed, 23 Jun 1999 09:01:40 PDT
7882From: Vern Paxson <vern>
7883
7884> I hope you can help me.  I am using Flex and Bison to produce an interpreted
7885> language.  However all goes well until I try to implement an IF statement or
7886> a WHILE.  I cannot get this to work as the parser parses all the conditions
7887> eg. the TRUE and FALSE conditons to check for a rule match.  So I cannot
7888> make a decision!!
7889
7890You need to use the parser to build a parse tree (= abstract syntax trwee),
7891and when that's all done you recursively evaluate the tree, binding variables
7892to values at that time.
7893
7894		Vern
7895@end verbatim
7896@end example
7897
7898@c TODO: Evaluate this faq.
7899@node unnamed-faq-94
7900@unnumberedsec unnamed-faq-94
7901@example
7902@verbatim
7903To: Petr Danecek <petr@ics.cas.cz>
7904Subject: Re: flex - question
7905In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT.
7906Date: Fri, 02 Jul 1999 16:52:13 PDT
7907From: Vern Paxson <vern>
7908
7909> file, it takes an enormous amount of time. It is funny, because the
7910> source code has only 12 rules!!! I think it looks like an exponencial
7911> growth.
7912
7913Right, that's the problem - some patterns (those with a lot of
7914ambiguity, where yours has because at any given time the scanner can
7915be in the middle of all sorts of combinations of the different
7916rules) blow up exponentially.
7917
7918For your rules, there is an easy fix.  Change the ".*" that comes fater
7919the directory name to "[^ ]*".  With that in place, the rules are no
7920longer nearly so ambiguous, because then once one of the directories
7921has been matched, no other can be matched (since they all require a
7922leading blank).
7923
7924If that's not an acceptable solution, then you can enter a start state
7925to pick up the .*\n after each directory is matched.
7926
7927Also note that for speed, you'll want to add a ".*" rule at the end,
7928otherwise rules that don't match any of the patterns will be matched
7929very slowly, a character at a time.
7930
7931		Vern
7932@end verbatim
7933@end example
7934
7935@c TODO: Evaluate this faq.
7936@node unnamed-faq-95
7937@unnumberedsec unnamed-faq-95
7938@example
7939@verbatim
7940To: Tielman Koekemoer <tielman@spi.co.za>
7941Subject: Re: Please help.
7942In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT.
7943Date: Thu, 08 Jul 1999 08:20:39 PDT
7944From: Vern Paxson <vern>
7945
7946> I was hoping you could help me with my problem.
7947>
7948> I tried compiling (gnu)flex on a Solaris 2.4 machine
7949> but when I ran make (after configure) I got an error.
7950>
7951> --------------------------------------------------------------
7952> gcc -c -I. -I. -g -O parse.c
7953> ./flex -t -p  ./scan.l >scan.c
7954> sh: ./flex: not found
7955> *** Error code 1
7956> make: Fatal error: Command failed for target `scan.c'
7957> -------------------------------------------------------------
7958>
7959> What's strange to me is that I'm only
7960> trying to install flex now. I then edited the Makefile to
7961> and changed where it says "FLEX = flex" to "FLEX = lex"
7962> ( lex: the native Solaris one ) but then it complains about
7963> the "-p" option. Is there any way I can compile flex without
7964> using flex or lex?
7965>
7966> Thanks so much for your time.
7967
7968You managed to step on the bootstrap sequence, which first copies
7969initscan.c to scan.c in order to build flex.  Try fetching a fresh
7970distribution from ftp.ee.lbl.gov.  (Or you can first try removing
7971".bootstrap" and doing a make again.)
7972
7973		Vern
7974@end verbatim
7975@end example
7976
7977@c TODO: Evaluate this faq.
7978@node unnamed-faq-96
7979@unnumberedsec unnamed-faq-96
7980@example
7981@verbatim
7982To: Tielman Koekemoer <tielman@spi.co.za>
7983Subject: Re: Please help.
7984In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT.
7985Date: Fri, 09 Jul 1999 00:27:20 PDT
7986From: Vern Paxson <vern>
7987
7988> First I removed .bootstrap (and ran make) - no luck. I downloaded the
7989> software but I still have the same problem. Is there anything else I
7990> could try.
7991
7992Try:
7993
7994	cp initscan.c scan.c
7995	touch scan.c
7996	make scan.o
7997
7998If this last tries to first build scan.c from scan.l using ./flex, then
7999your "make" is broken, in which case compile scan.c to scan.o by hand.
8000
8001		Vern
8002@end verbatim
8003@end example
8004
8005@c TODO: Evaluate this faq.
8006@node unnamed-faq-97
8007@unnumberedsec unnamed-faq-97
8008@example
8009@verbatim
8010To: Sumanth Kamenani <skamenan@crl.nmsu.edu>
8011Subject: Re: Error
8012In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT.
8013Date: Tue, 20 Jul 1999 00:18:26 PDT
8014From: Vern Paxson <vern>
8015
8016> I am getting a compilation error. The error is given as "unknown symbol- yylex".
8017
8018The parser relies on calling yylex(), but you're instead using the C++ scanning
8019class, so you need to supply a yylex() "glue" function that calls an instance
8020scanner of the scanner (e.g., "scanner->yylex()").
8021
8022		Vern
8023@end verbatim
8024@end example
8025
8026@c TODO: Evaluate this faq.
8027@node unnamed-faq-98
8028@unnumberedsec unnamed-faq-98
8029@example
8030@verbatim
8031To: daniel@synchrods.synchrods.COM (Daniel Senderowicz)
8032Subject: Re: lex
8033In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST.
8034Date: Tue, 23 Nov 1999 15:54:30 PST
8035From: Vern Paxson <vern>
8036
8037Well, your problem is the
8038
8039switch (yybgin-yysvec-1) {      /* witchcraft */
8040
8041at the beginning of lex rules.  "witchcraft" == "non-portable".  It's
8042assuming knowledge of the AT&T lex's internal variables.
8043
8044For flex, you can probably do the equivalent using a switch on YYSTATE.
8045
8046		Vern
8047@end verbatim
8048@end example
8049
8050@c TODO: Evaluate this faq.
8051@node unnamed-faq-99
8052@unnumberedsec unnamed-faq-99
8053@example
8054@verbatim
8055To: archow@hss.hns.com
8056Subject: Re: Regarding distribution of flex and yacc based grammars
8057In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530.
8058Date: Wed, 22 Dec 1999 01:56:24 PST
8059From: Vern Paxson <vern>
8060
8061> When we provide the customer with an object code distribution, is it
8062> necessary for us to provide source
8063> for the generated C files from flex and bison since they are generated by
8064> flex and bison ?
8065
8066For flex, no.  I don't know what the current state of this is for bison.
8067
8068> Also, is there any requrirement for us to neccessarily  provide source for
8069> the grammar files which are fed into flex and bison ?
8070
8071Again, for flex, no.
8072
8073See the file "COPYING" in the flex distribution for the legalese.
8074
8075		Vern
8076@end verbatim
8077@end example
8078
8079@c TODO: Evaluate this faq.
8080@node unnamed-faq-100
8081@unnumberedsec unnamed-faq-100
8082@example
8083@verbatim
8084To: Martin Gallwey <gallweym@hyperion.moe.ul.ie>
8085Subject: Re: Flex, and self referencing rules
8086In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST.
8087Date: Sat, 19 Feb 2000 18:33:16 PST
8088From: Vern Paxson <vern>
8089
8090> However, I do not use unput anywhere. I do use self-referencing
8091> rules like this:
8092>
8093> UnaryExpr               ({UnionExpr})|("-"{UnaryExpr})
8094
8095You can't do this - flex is *not* a parser like yacc (which does indeed
8096allow recursion), it is a scanner that's confined to regular expressions.
8097
8098		Vern
8099@end verbatim
8100@end example
8101
8102@c TODO: Evaluate this faq.
8103@node unnamed-faq-101
8104@unnumberedsec unnamed-faq-101
8105@example
8106@verbatim
8107To: slg3@lehigh.edu (SAMUEL L. GULDEN)
8108Subject: Re: Flex problem
8109In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST.
8110Date: Thu, 02 Mar 2000 23:00:46 PST
8111From: Vern Paxson <vern>
8112
8113If this is exactly your program:
8114
8115> digit [0-9]
8116> digits {digit}+
8117> whitespace [ \t\n]+
8118>
8119> %%
8120> "[" { printf("open_brac\n");}
8121> "]" { printf("close_brac\n");}
8122> "+" { printf("addop\n");}
8123> "*" { printf("multop\n");}
8124> {digits} { printf("NUMBER = %s\n", yytext);}
8125> whitespace ;
8126
8127then the problem is that the last rule needs to be "{whitespace}" !
8128
8129		Vern
8130@end verbatim
8131@end example
8132
8133@node What is the difference between YYLEX_PARAM and YY_DECL?
8134@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL?
8135
8136YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra
8137params when it calls yylex() from the parser.
8138
8139YY_DECL is the Flex declaration of yylex. The default is similar to this:
8140
8141@example
8142@verbatim
8143#define int yy_lex ()
8144@end verbatim
8145@end example
8146
8147
8148@node Why do I get "conflicting types for yylex" error?
8149@unnumberedsec Why do I get "conflicting types for yylex" error?
8150
8151This is a compiler error regarding a generated Bison parser, not a Flex scanner.
8152It means you need a prototype of yylex() in the top of the Bison file.
8153Be sure the prototype matches YY_DECL.
8154
8155@node How do I access the values set in a Flex action from within a Bison action?
8156@unnumberedsec How do I access the values set in a Flex action from within a Bison action?
8157
8158With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual.
8159See @ref{Top, , , bison, the GNU Bison Manual}.
8160
8161@node Appendices, Indices, FAQ, Top
8162@appendix Appendices
8163
8164@menu
8165* Makefiles and Flex::          
8166* Bison Bridge::                
8167* M4 Dependency::               
8168* Common Patterns::               
8169@end menu
8170
8171@node Makefiles and Flex, Bison Bridge, Appendices, Appendices
8172@appendixsec Makefiles and Flex
8173
8174@cindex Makefile, syntax
8175
8176In this appendix, we provide tips for writing Makefiles to build your scanners.
8177
8178In a traditional build environment, we say that the @file{.c} files are the
8179sources, and the @file{.o} files are the intermediate files. When using
8180@code{flex}, however, the @file{.l} files are the sources, and the generated
8181@file{.c} files (along with the @file{.o} files) are the intermediate files.
8182This requires you to carefully plan your Makefile.
8183
8184Modern @command{make} programs understand that @file{foo.l} is intended to
8185generate @file{lex.yy.c} or @file{foo.c}, and will behave
8186accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such
8187programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake}
8188may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want,
8189then you should provide an explicit rule in your Makefile.am}.  The
8190following Makefile does not explicitly instruct @command{make} how to build
8191@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the
8192@command{make} program to build the intermediate file, @file{scan.c}:
8193
8194@cindex Makefile, example of implicit rules
8195@example
8196@verbatim
8197    # Basic Makefile -- relies on implicit rules
8198    # Creates "myprogram" from "scan.l" and "myprogram.c"
8199    #
8200    LEX=flex
8201    myprogram: scan.o myprogram.o
8202    scan.o: scan.l
8203
8204@end verbatim
8205@end example
8206
8207
8208For simple cases, the above may be sufficient. For other cases,
8209you may have to explicitly instruct @command{make} how to build your scanner.
8210The following is an example of a Makefile containing explicit rules:
8211
8212@cindex Makefile, explicit example
8213@example
8214@verbatim
8215    # Basic Makefile -- provides explicit rules
8216    # Creates "myprogram" from "scan.l" and "myprogram.c"
8217    #
8218    LEX=flex
8219    myprogram: scan.o myprogram.o
8220            $(CC) -o $@  $(LDFLAGS) $^
8221
8222    myprogram.o: myprogram.c
8223            $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^
8224
8225    scan.o: scan.c
8226            $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^
8227
8228    scan.c: scan.l
8229            $(LEX) $(LFLAGS) -o $@ $^
8230
8231    clean:
8232            $(RM) *.o scan.c
8233
8234@end verbatim
8235@end example
8236
8237Notice in the above example that @file{scan.c} is in the @code{clean} target.
8238This is because we consider the file @file{scan.c} to be an intermediate file.
8239
8240Finally, we provide a realistic example of a @code{flex} scanner used with a
8241@code{bison} parser@footnote{This example also applies to yacc parsers.}.
8242There is a tricky problem we have to deal with. Since a @code{flex} scanner
8243will typically include a header file (e.g., @file{y.tab.h}) generated by the
8244parser, we need to be sure that the header file is generated BEFORE the scanner
8245is compiled. We handle this case in the following example:
8246
8247@example
8248@verbatim
8249    # Makefile example -- scanner and parser.
8250    # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c"
8251    #
8252    LEX     = flex
8253    YACC    = bison -y
8254    YFLAGS  = -d
8255    objects = scan.o parse.o myprogram.o
8256
8257    myprogram: $(objects)
8258    scan.o: scan.l parse.c
8259    parse.o: parse.y
8260    myprogram.o: myprogram.c
8261
8262@end verbatim
8263@end example
8264
8265In the above example, notice the line,
8266
8267@example
8268@verbatim
8269    scan.o: scan.l parse.c
8270@end verbatim
8271@end example
8272
8273, which lists the file @file{parse.c} (the generated parser) as a dependency of
8274@file{scan.o}. We want to ensure that the parser is created before the scanner
8275is compiled, and the above line seems to do the trick. Feel free to experiment
8276with your specific implementation of @command{make}.
8277
8278
8279For more details on writing Makefiles, see @ref{Top, , , make, The
8280GNU Make Manual}.
8281
8282@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices
8283@section C Scanners with Bison Parsers
8284
8285@cindex bison, bridging with flex
8286@vindex yylval
8287@vindex yylloc
8288@tindex YYLTYPE
8289@tindex YYSTYPE
8290
8291This section describes the @code{flex} features useful when integrating
8292@code{flex} with @code{GNU bison}@footnote{The features described here are
8293purely optional, and are by no means the only way to use flex with bison.
8294We merely provide some glue to ease development of your parser-scanner pair.}.
8295Skip this section if you are not using
8296@code{bison} with your scanner.  Here we discuss only the @code{flex}
8297half of the @code{flex} and @code{bison} pair.  We do not discuss
8298@code{bison} in any detail.  For more information about generating
8299@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}.
8300
8301A compatible @code{bison} scanner is generated by declaring @samp{%option
8302bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex}
8303from the command line.  This instructs @code{flex} that the macro
8304@code{yylval} may be used. The data type for
8305@code{yylval}, @code{YYSTYPE},
8306is typically defined in a header file, included in section 1 of the
8307@code{flex} input file.  For a list of functions and macros
8308available, @xref{bison-functions}.
8309
8310The declaration of yylex becomes,
8311
8312@findex yylex (reentrant version)
8313@example
8314@verbatim
8315      int yylex ( YYSTYPE * lvalp, yyscan_t scanner );
8316@end verbatim
8317@end example
8318
8319If @code{%option bison-locations} is specified, then the declaration
8320becomes,
8321
8322@findex yylex (reentrant version)
8323@example
8324@verbatim
8325      int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner );
8326@end verbatim
8327@end example
8328
8329Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers.
8330Support for @code{yylloc} is optional in @code{bison}, so it is optional in
8331@code{flex} as well. The following is an example of a @code{flex} scanner that
8332is compatible with @code{bison}.
8333
8334@cindex bison, scanner to be called from bison
8335@example
8336@verbatim
8337    /* Scanner for "C" assignment statements... sort of. */
8338    %{
8339    #include "y.tab.h"  /* Generated by bison. */
8340    %}
8341
8342    %option bison-bridge bison-locations
8343    %
8344
8345    [[:digit:]]+  { yylval->num = atoi(yytext);   return NUMBER;}
8346    [[:alnum:]]+  { yylval->str = strdup(yytext); return STRING;}
8347    "="|";"       { return yytext[0];}
8348    .  {}
8349    %
8350@end verbatim
8351@end example
8352
8353As you can see, there really is no magic here. We just use
8354@code{yylval} as we would any other variable. The data type of
8355@code{yylval} is generated by @code{bison}, and included in the file
8356@file{y.tab.h}. Here is the corresponding @code{bison} parser:
8357
8358@cindex bison, parser
8359@example
8360@verbatim
8361    /* Parser to convert "C" assignments to lisp. */
8362    %{
8363    /* Pass the argument to yyparse through to yylex. */
8364    #define YYPARSE_PARAM scanner
8365    #define YYLEX_PARAM   scanner
8366    %}
8367    %locations
8368    %pure_parser
8369    %union {
8370        int num;
8371        char* str;
8372    }
8373    %token <str> STRING
8374    %token <num> NUMBER
8375    %%
8376    assignment:
8377        STRING '=' NUMBER ';' {
8378            printf( "(setf %s %d)", $1, $3 );
8379       }
8380    ;
8381@end verbatim
8382@end example
8383
8384@node M4 Dependency, Common Patterns, Bison Bridge, Appendices
8385@section M4 Dependency
8386@cindex m4
8387The macro processor @code{m4}@footnote{The use of m4 is subject to change in
8388future revisions of flex. It is not part of the public API of flex. Do not depend on it.}
8389must be installed wherever flex is installed.
8390@code{flex} invokes @samp{m4}, found by searching the directories in the
8391@code{PATH} environment variable. Any code you place in section 1 or in the
8392actions will be sent through m4. Please follow these rules to protect your
8393code from unwanted @code{m4} processing.
8394
8395@itemize
8396
8397@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define},
8398or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for 
8399some reason you need m4_ as a prefix, use a preprocessor #define to get your
8400symbol past m4 unmangled.
8401
8402@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The
8403former is not valid in C, except within comments and strings, but the latter is valid in
8404code such as @code{x[y[z]]}. The solution is simple. To get the literal string 
8405@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]},
8406use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and
8407escape them. However, it's best to avoid this complexity where possible, by
8408removing such sequences from your code.
8409
8410@end itemize
8411
8412@code{m4} is only required at the time you run @code{flex}. The generated
8413scanner is ordinary C or C++, and does @emph{not} require @code{m4}.
8414
8415@node Common Patterns, ,M4 Dependency, Appendices
8416@section Common Patterns
8417@cindex patterns, common
8418
8419This appendix provides examples of common regular expressions you might use
8420in your scanner.
8421
8422@menu
8423* Numbers::         
8424* Identifiers::         
8425* Quoted Constructs::       
8426* Addresses::       
8427@end menu
8428
8429
8430@node Numbers, Identifiers, ,Common Patterns
8431@subsection Numbers
8432
8433@table @asis
8434
8435@item C99 decimal constant
8436@code{([[:digit:]]@{-@}[0])[[:digit:]]*}
8437
8438@item C99 hexadecimal constant
8439@code{0[xX][[:xdigit:]]+}
8440
8441@item C99 octal constant
8442@code{0[01234567]*}
8443
8444@item C99 floating point constant
8445@verbatim
8446 {dseq}      ([[:digit:]]+)
8447 {dseq_opt}  ([[:digit:]]*)
8448 {frac}      (({dseq_opt}"."{dseq})|{dseq}".")
8449 {exp}       ([eE][+-]?{dseq})
8450 {exp_opt}   ({exp}?)
8451 {fsuff}     [flFL]
8452 {fsuff_opt} ({fsuff}?)
8453 {hpref}     (0[xX])
8454 {hdseq}     ([[:xdigit:]]+)
8455 {hdseq_opt} ([[:xdigit:]]*)
8456 {hfrac}     (({hdseq_opt}"."{hdseq})|({hdseq}"."))
8457 {bexp}      ([pP][+-]?{dseq})
8458 {dfc}       (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt}))
8459 {hfc}       (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt}))
8460
8461 {c99_floating_point_constant}  ({dfc}|{hfc})
8462@end verbatim
8463
8464See C99 section 6.4.4.2 for the gory details.
8465
8466@end table
8467
8468@node Identifiers, Quoted Constructs, Numbers, Common Patterns
8469@subsection Identifiers
8470
8471@table @asis
8472
8473@item C99 Identifier
8474@verbatim
8475ucn        ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8})))
8476nondigit    [_[:alpha:]]
8477c99_id     ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})*
8478@end verbatim
8479
8480Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for
8481"implementation-defined" characters. In practice, C compilers follow the above pattern, with the
8482addition of the @samp{$} character.
8483
8484@item UTF-8 Encoded Unicode Code Point
8485@verbatim
8486[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2})
8487@end verbatim
8488
8489@end table
8490
8491@node Quoted Constructs, Addresses, Identifiers, Common Patterns
8492@subsection Quoted Constructs
8493
8494@table @asis
8495@item C99 String Literal
8496@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"}
8497
8498@item C99 Comment
8499@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)}
8500
8501Note that in C99, a @samp{//}-style comment may be split across lines,  and, contrary to popular belief,
8502does not include the trailing @samp{\n} character.
8503
8504A better way to scan @samp{/* */} comments is by line, rather than matching
8505possibly huge comments all at once. This will allow you to scan comments of
8506unlimited length, as long as line breaks appear at sane intervals. This is also
8507more efficient when used with automatic line number processing. @xref{option-yylineno}.
8508
8509@verbatim
8510<INITIAL>{
8511    "/*"      BEGIN(COMMENT);
8512}
8513<COMMENT>{
8514    "*/"      BEGIN(0);
8515    [^*\n]+   ;
8516    "*"[^/]   ;
8517    \n        ;
8518}
8519@end verbatim
8520
8521@end table
8522
8523@node Addresses, ,Quoted Constructs, Common Patterns
8524@subsection Addresses
8525
8526@table @asis
8527
8528@item IPv4 Address
8529@verbatim
8530dec-octet     [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]
8531IPv4address   {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet}
8532@end verbatim
8533
8534@item IPv6 Address
8535@verbatim
8536h16           [0-9A-Fa-f]{1,4}
8537ls32          {h16}:{h16}|{IPv4address}
8538IPv6address   ({h16}:){6}{ls32}|
8539              ::({h16}:){5}{ls32}|
8540              ({h16})?::({h16}:){4}{ls32}|
8541              (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}|
8542              (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}|
8543              (({h16}:){0,3}{h16})?::{h16}:{ls32}|
8544              (({h16}:){0,4}{h16})?::{ls32}|
8545              (({h16}:){0,5}{h16})?::{h16}|
8546              (({h16}:){0,6}{h16})?::
8547@end verbatim
8548
8549See @uref{http://www.ietf.org/rfc/rfc2373.txt, RFC 2373} for details.
8550Note that you have to fold the definition of @code{IPv6address} into one
8551line and that it also matches the ``unspecified address'' ``::''.
8552
8553@item URI
8554@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?}
8555
8556This pattern is nearly useless, since it allows just about any character
8557to appear in a URI, including spaces and control characters.  See
8558@uref{http://www.ietf.org/rfc/rfc2396.txt, RFC 2396} for details.
8559
8560@end table
8561
8562
8563@node Indices,  , Appendices, Top
8564@unnumbered Indices
8565
8566@menu
8567* Concept Index::               
8568* Index of Functions and Macros::  
8569* Index of Variables::          
8570* Index of Data Types::         
8571* Index of Hooks::              
8572* Index of Scanner Options::    
8573@end menu
8574
8575@node Concept Index, Index of Functions and Macros, Indices, Indices
8576@unnumberedsec Concept Index
8577
8578@printindex cp
8579
8580@node Index of Functions and Macros, Index of Variables, Concept Index, Indices
8581@unnumberedsec Index of Functions and Macros
8582
8583This is an index of functions and preprocessor macros that look like functions.
8584For macros that expand to variables or constants, see @ref{Index of Variables}.
8585
8586@printindex fn
8587
8588@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices
8589@unnumberedsec Index of Variables
8590
8591This is an index of variables, constants, and preprocessor macros
8592that expand to variables or constants.
8593
8594@printindex vr
8595
8596@node Index of Data Types, Index of Hooks, Index of Variables, Indices
8597@unnumberedsec Index of Data Types
8598@printindex tp
8599
8600@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices
8601@unnumberedsec Index of Hooks
8602
8603This is an index of "hooks" that the user may define. These hooks typically  correspond
8604to specific locations in the generated scanner, and may be used to insert arbitrary code.
8605
8606@printindex hk
8607
8608@node Index of Scanner Options,  , Index of Hooks, Indices
8609@unnumberedsec Index of Scanner Options
8610
8611@printindex op
8612
8613@c A vim script to name the faq entries. delete this when faqs are no longer
8614@c named "unnamed-faq-XXX".
8615@c
8616@c fu! Faq2 () range abort
8617@c     let @r=input("Rename to: ")
8618@c     exe "%s/" . @w . "/" . @r . "/g"
8619@c     normal 'f
8620@c endf
8621@c nnoremap <F5>  1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr>
8622
8623@bye
8624