1\input texinfo
2@c %**start of header
3@setfilename flex.info
4@settitle Flex - a scanner generator
5@c @finalout
6@c @setchapternewpage odd
7@c %**end of header
8
9@set EDITION 2.5
10@set UPDATED March 1995
11@set VERSION 2.5
12
13@c FIXME - Reread a printed copy with a red pen and patience.
14@c FIXME - Modify all "See ..." references and replace with @xref's.
15
16@ifinfo
17@format
18START-INFO-DIR-ENTRY
19* Flex: (flex).         A fast scanner generator.
20END-INFO-DIR-ENTRY
21@end format
22@end ifinfo
23
24@c Define new indices for commands, filenames, and options.
25@c @defcodeindex cm
26@c @defcodeindex fl
27@c @defcodeindex op
28
29@c Put everything in one index (arbitrarily chosen to be the concept index).
30@c @syncodeindex cm cp
31@c @syncodeindex fl cp
32@syncodeindex fn cp
33@syncodeindex ky cp
34@c @syncodeindex op cp
35@syncodeindex pg cp
36@syncodeindex vr cp
37
38@ifinfo
39This file documents Flex.
40
41Copyright (c) 1990 The Regents of the University of California.
42All rights reserved.
43
44This code is derived from software contributed to Berkeley by
45Vern Paxson.
46
47The United States Government has rights in this work pursuant
48to contract no. DE-AC03-76SF00098 between the United States
49Department of Energy and the University of California.
50
51Redistribution and use in source and binary forms with or without
52modification are permitted provided that: (1) source distributions
53retain this entire copyright notice and comment, and (2)
54distributions including binaries display the following
55acknowledgement:  ``This product includes software developed by the
56University of California, Berkeley and its contributors'' in the
57documentation or other materials provided with the distribution and
58in all advertising materials mentioning features or use of this
59software.  Neither the name of the University nor the names of its
60contributors may be used to endorse or promote products derived
61from this software without specific prior written permission.
62
63THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
64IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
65WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
66PURPOSE.
67
68@ignore
69Permission is granted to process this file through TeX and print the
70results, provided the printed document carries copying permission
71notice identical to this one except for the removal of this paragraph
72(this paragraph not being relevant to the printed manual).
73
74@end ignore
75@end ifinfo
76
77@titlepage
78@title Flex, version @value{VERSION}
79@subtitle A fast scanner generator
80@subtitle Edition @value{EDITION}, @value{UPDATED}
81@author Vern Paxson
82
83@page
84@vskip 0pt plus 1filll
85Copyright @copyright{} 1990 The Regents of the University of California.
86All rights reserved.
87
88This code is derived from software contributed to Berkeley by
89Vern Paxson.
90
91The United States Government has rights in this work pursuant
92to contract no. DE-AC03-76SF00098 between the United States
93Department of Energy and the University of California.
94
95Redistribution and use in source and binary forms with or without
96modification are permitted provided that: (1) source distributions
97retain this entire copyright notice and comment, and (2)
98distributions including binaries display the following
99acknowledgement:  ``This product includes software developed by the
100University of California, Berkeley and its contributors'' in the
101documentation or other materials provided with the distribution and
102in all advertising materials mentioning features or use of this
103software.  Neither the name of the University nor the names of its
104contributors may be used to endorse or promote products derived
105from this software without specific prior written permission.
106
107THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
108IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
109WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
110PURPOSE.
111@end titlepage
112
113@ifinfo
114
115@node Top, Name, (dir), (dir)
116@top flex
117
118@cindex scanner generator
119
120This manual documents @code{flex}.  It covers release @value{VERSION}.
121
122@menu
123* Name::                        Name
124* Synopsis::                    Synopsis
125* Overview::                    Overview
126* Description::                 Description
127* Examples::                    Some simple examples
128* Format::                      Format of the input file
129* Patterns::                    Patterns
130* Matching::                    How the input is matched
131* Actions::                     Actions
132* Generated scanner::           The generated scanner
133* Start conditions::            Start conditions
134* Multiple buffers::            Multiple input buffers
135* End-of-file rules::           End-of-file rules
136* Miscellaneous::               Miscellaneous macros
137* User variables::              Values available to the user
138* YACC interface::              Interfacing with @code{yacc}
139* Options::                     Options
140* Performance::                 Performance considerations
141* C++::                         Generating C++ scanners
142* Incompatibilities::           Incompatibilities with @code{lex} and POSIX
143* Diagnostics::                 Diagnostics
144* Files::                       Files
145* Deficiencies::                Deficiencies / Bugs
146* See also::                    See also
147* Author::                      Author
148@c * Index::                       Index
149@end menu
150
151@end ifinfo
152
153@node Name, Synopsis, Top, Top
154@section Name
155
156flex - fast lexical analyzer generator
157
158@node Synopsis, Overview, Name, Top
159@section Synopsis
160
161@example
162flex [-bcdfhilnpstvwBFILTV78+? -C[aefFmr] -ooutput -Pprefix -Sskeleton]
163[--help --version] [@var{filename} @dots{}]
164@end example
165
166@node Overview, Description, Synopsis, Top
167@section Overview
168
169This manual describes @code{flex}, a tool for generating programs
170that perform pattern-matching on text.  The manual
171includes both tutorial and reference sections:
172
173@table @asis
174@item Description
175a brief overview of the tool
176
177@item Some Simple Examples
178
179@item Format Of The Input File
180
181@item Patterns
182the extended regular expressions used by flex
183
184@item How The Input Is Matched
185the rules for determining what has been matched
186
187@item Actions
188how to specify what to do when a pattern is matched
189
190@item The Generated Scanner
191details regarding the scanner that flex produces;
192how to control the input source
193
194@item Start Conditions
195introducing context into your scanners, and
196managing "mini-scanners"
197
198@item Multiple Input Buffers
199how to manipulate multiple input sources; how to
200scan from strings instead of files
201
202@item End-of-file Rules
203special rules for matching the end of the input
204
205@item Miscellaneous Macros
206a summary of macros available to the actions
207
208@item Values Available To The User
209a summary of values available to the actions
210
211@item Interfacing With Yacc
212connecting flex scanners together with yacc parsers
213
214@item Options
215flex command-line options, and the "%option"
216directive
217
218@item Performance Considerations
219how to make your scanner go as fast as possible
220
221@item Generating C++ Scanners
222the (experimental) facility for generating C++
223scanner classes
224
225@item Incompatibilities With Lex And POSIX
226how flex differs from AT&T lex and the POSIX lex
227standard
228
229@item Diagnostics
230those error messages produced by flex (or scanners
231it generates) whose meanings might not be apparent
232
233@item Files
234files used by flex
235
236@item Deficiencies / Bugs
237known problems with flex
238
239@item See Also
240other documentation, related tools
241
242@item Author
243includes contact information
244@end table
245
246@node Description, Examples, Overview, Top
247@section Description
248
249@code{flex} is a tool for generating @dfn{scanners}: programs which
250recognized lexical patterns in text.  @code{flex} reads the given
251input files, or its standard input if no file names are
252given, for a description of a scanner to generate.  The
253description is in the form of pairs of regular expressions
254and C code, called @dfn{rules}. @code{flex} generates as output a C
255source file, @file{lex.yy.c}, which defines a routine @samp{yylex()}.
256This file is compiled and linked with the @samp{-lfl} library to
257produce an executable.  When the executable is run, it
258analyzes its input for occurrences of the regular
259expressions.  Whenever it finds one, it executes the
260corresponding C code.
261
262@node Examples, Format, Description, Top
263@section Some simple examples
264
265First some simple examples to get the flavor of how one
266uses @code{flex}.  The following @code{flex} input specifies a scanner
267which whenever it encounters the string "username" will
268replace it with the user's login name:
269
270@example
271%%
272username    printf( "%s", getlogin() );
273@end example
274
275By default, any text not matched by a @code{flex} scanner is
276copied to the output, so the net effect of this scanner is
277to copy its input file to its output with each occurrence
278of "username" expanded.  In this input, there is just one
279rule.  "username" is the @var{pattern} and the "printf" is the
280@var{action}.  The "%%" marks the beginning of the rules.
281
282Here's another simple example:
283
284@example
285        int num_lines = 0, num_chars = 0;
286
287%%
288\n      ++num_lines; ++num_chars;
289.       ++num_chars;
290
291%%
292main()
293        @{
294        yylex();
295        printf( "# of lines = %d, # of chars = %d\n",
296                num_lines, num_chars );
297        @}
298@end example
299
300This scanner counts the number of characters and the
301number of lines in its input (it produces no output other
302than the final report on the counts).  The first line
303declares two globals, "num_lines" and "num_chars", which
304are accessible both inside @samp{yylex()} and in the @samp{main()}
305routine declared after the second "%%".  There are two rules,
306one which matches a newline ("\n") and increments both the
307line count and the character count, and one which matches
308any character other than a newline (indicated by the "."
309regular expression).
310
311A somewhat more complicated example:
312
313@example
314/* scanner for a toy Pascal-like language */
315
316%@{
317/* need this for the call to atof() below */
318#include <math.h>
319%@}
320
321DIGIT    [0-9]
322ID       [a-z][a-z0-9]*
323
324%%
325
326@{DIGIT@}+    @{
327            printf( "An integer: %s (%d)\n", yytext,
328                    atoi( yytext ) );
329            @}
330
331@{DIGIT@}+"."@{DIGIT@}*        @{
332            printf( "A float: %s (%g)\n", yytext,
333                    atof( yytext ) );
334            @}
335
336if|then|begin|end|procedure|function        @{
337            printf( "A keyword: %s\n", yytext );
338            @}
339
340@{ID@}        printf( "An identifier: %s\n", yytext );
341
342"+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
343
344"@{"[^@}\n]*"@}"     /* eat up one-line comments */
345
346[ \t\n]+          /* eat up whitespace */
347
348.           printf( "Unrecognized character: %s\n", yytext );
349
350%%
351
352main( argc, argv )
353int argc;
354char **argv;
355    @{
356    ++argv, --argc;  /* skip over program name */
357    if ( argc > 0 )
358            yyin = fopen( argv[0], "r" );
359    else
360            yyin = stdin;
361
362    yylex();
363    @}
364@end example
365
366This is the beginnings of a simple scanner for a language
367like Pascal.  It identifies different types of @var{tokens} and
368reports on what it has seen.
369
370The details of this example will be explained in the
371following sections.
372
373@node Format, Patterns, Examples, Top
374@section Format of the input file
375
376The @code{flex} input file consists of three sections, separated
377by a line with just @samp{%%} in it:
378
379@example
380definitions
381%%
382rules
383%%
384user code
385@end example
386
387The @dfn{definitions} section contains declarations of simple
388@dfn{name} definitions to simplify the scanner specification,
389and declarations of @dfn{start conditions}, which are explained
390in a later section.
391Name definitions have the form:
392
393@example
394name definition
395@end example
396
397The "name" is a word beginning with a letter or an
398underscore ('_') followed by zero or more letters, digits, '_',
399or '-' (dash).  The definition is taken to begin at the
400first non-white-space character following the name and
401continuing to the end of the line.  The definition can
402subsequently be referred to using "@{name@}", which will
403expand to "(definition)".  For example,
404
405@example
406DIGIT    [0-9]
407ID       [a-z][a-z0-9]*
408@end example
409
410@noindent
411defines "DIGIT" to be a regular expression which matches a
412single digit, and "ID" to be a regular expression which
413matches a letter followed by zero-or-more
414letters-or-digits.  A subsequent reference to
415
416@example
417@{DIGIT@}+"."@{DIGIT@}*
418@end example
419
420@noindent
421is identical to
422
423@example
424([0-9])+"."([0-9])*
425@end example
426
427@noindent
428and matches one-or-more digits followed by a '.' followed
429by zero-or-more digits.
430
431The @var{rules} section of the @code{flex} input contains a series of
432rules of the form:
433
434@example
435pattern   action
436@end example
437
438@noindent
439where the pattern must be unindented and the action must
440begin on the same line.
441
442See below for a further description of patterns and
443actions.
444
445Finally, the user code section is simply copied to
446@file{lex.yy.c} verbatim.  It is used for companion routines
447which call or are called by the scanner.  The presence of
448this section is optional; if it is missing, the second @samp{%%}
449in the input file may be skipped, too.
450
451In the definitions and rules sections, any @emph{indented} text or
452text enclosed in @samp{%@{} and @samp{%@}} is copied verbatim to the
453output (with the @samp{%@{@}}'s removed).  The @samp{%@{@}}'s must
454appear unindented on lines by themselves.
455
456In the rules section, any indented or %@{@} text appearing
457before the first rule may be used to declare variables
458which are local to the scanning routine and (after the
459declarations) code which is to be executed whenever the
460scanning routine is entered.  Other indented or %@{@} text
461in the rule section is still copied to the output, but its
462meaning is not well-defined and it may well cause
463compile-time errors (this feature is present for @code{POSIX} compliance;
464see below for other such features).
465
466In the definitions section (but not in the rules section),
467an unindented comment (i.e., a line beginning with "/*")
468is also copied verbatim to the output up to the next "*/".
469
470@node Patterns, Matching, Format, Top
471@section Patterns
472
473The patterns in the input are written using an extended
474set of regular expressions.  These are:
475
476@table @samp
477@item x
478match the character @samp{x}
479@item .
480any character (byte) except newline
481@item [xyz]
482a "character class"; in this case, the pattern
483matches either an @samp{x}, a @samp{y}, or a @samp{z}
484@item [abj-oZ]
485a "character class" with a range in it; matches
486an @samp{a}, a @samp{b}, any letter from @samp{j} through @samp{o},
487or a @samp{Z}
488@item [^A-Z]
489a "negated character class", i.e., any character
490but those in the class.  In this case, any
491character EXCEPT an uppercase letter.
492@item [^A-Z\n]
493any character EXCEPT an uppercase letter or
494a newline
495@item @var{r}*
496zero or more @var{r}'s, where @var{r} is any regular expression
497@item @var{r}+
498one or more @var{r}'s
499@item @var{r}?
500zero or one @var{r}'s (that is, "an optional @var{r}")
501@item @var{r}@{2,5@}
502anywhere from two to five @var{r}'s
503@item @var{r}@{2,@}
504two or more @var{r}'s
505@item @var{r}@{4@}
506exactly 4 @var{r}'s
507@item @{@var{name}@}
508the expansion of the "@var{name}" definition
509(see above)
510@item "[xyz]\"foo"
511the literal string: @samp{[xyz]"foo}
512@item \@var{x}
513if @var{x} is an @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or @samp{v},
514then the ANSI-C interpretation of \@var{x}.
515Otherwise, a literal @samp{@var{x}} (used to escape
516operators such as @samp{*})
517@item \0
518a NUL character (ASCII code 0)
519@item \123
520the character with octal value 123
521@item \x2a
522the character with hexadecimal value @code{2a}
523@item (@var{r})
524match an @var{r}; parentheses are used to override
525precedence (see below)
526@item @var{r}@var{s}
527the regular expression @var{r} followed by the
528regular expression @var{s}; called "concatenation"
529@item @var{r}|@var{s}
530either an @var{r} or an @var{s}
531@item @var{r}/@var{s}
532an @var{r} but only if it is followed by an @var{s}.  The text
533matched by @var{s} is included when determining whether this rule is
534the @dfn{longest match}, but is then returned to the input before
535the action is executed.  So the action only sees the text matched
536by @var{r}.  This type of pattern is called @dfn{trailing context}.
537(There are some combinations of @samp{@var{r}/@var{s}} that @code{flex}
538cannot match correctly; see notes in the Deficiencies / Bugs section
539below regarding "dangerous trailing context".)
540@item ^@var{r}
541an @var{r}, but only at the beginning of a line (i.e.,
542which just starting to scan, or right after a
543newline has been scanned).
544@item @var{r}$
545an @var{r}, but only at the end of a line (i.e., just
546before a newline).  Equivalent to "@var{r}/\n".
547
548Note that flex's notion of "newline" is exactly
549whatever the C compiler used to compile flex
550interprets '\n' as; in particular, on some DOS
551systems you must either filter out \r's in the
552input yourself, or explicitly use @var{r}/\r\n for "r$".
553@item <@var{s}>@var{r}
554an @var{r}, but only in start condition @var{s} (see
555below for discussion of start conditions)
556<@var{s1},@var{s2},@var{s3}>@var{r}
557same, but in any of start conditions @var{s1},
558@var{s2}, or @var{s3}
559@item <*>@var{r}
560an @var{r} in any start condition, even an exclusive one.
561@item <<EOF>>
562an end-of-file
563<@var{s1},@var{s2}><<EOF>>
564an end-of-file when in start condition @var{s1} or @var{s2}
565@end table
566
567Note that inside of a character class, all regular
568expression operators lose their special meaning except escape
569('\') and the character class operators, '-', ']', and, at
570the beginning of the class, '^'.
571
572The regular expressions listed above are grouped according
573to precedence, from highest precedence at the top to
574lowest at the bottom.  Those grouped together have equal
575precedence.  For example,
576
577@example
578foo|bar*
579@end example
580
581@noindent
582is the same as
583
584@example
585(foo)|(ba(r*))
586@end example
587
588@noindent
589since the '*' operator has higher precedence than
590concatenation, and concatenation higher than alternation ('|').
591This pattern therefore matches @emph{either} the string "foo" @emph{or}
592the string "ba" followed by zero-or-more r's.  To match
593"foo" or zero-or-more "bar"'s, use:
594
595@example
596foo|(bar)*
597@end example
598
599@noindent
600and to match zero-or-more "foo"'s-or-"bar"'s:
601
602@example
603(foo|bar)*
604@end example
605
606In addition to characters and ranges of characters,
607character classes can also contain character class
608@dfn{expressions}.  These are expressions enclosed inside @samp{[}: and @samp{:}]
609delimiters (which themselves must appear between the '['
610and ']' of the character class; other elements may occur
611inside the character class, too).  The valid expressions
612are:
613
614@example
615[:alnum:] [:alpha:] [:blank:]
616[:cntrl:] [:digit:] [:graph:]
617[:lower:] [:print:] [:punct:]
618[:space:] [:upper:] [:xdigit:]
619@end example
620
621These expressions all designate a set of characters
622equivalent to the corresponding standard C @samp{isXXX} function.  For
623example, @samp{[:alnum:]} designates those characters for which
624@samp{isalnum()} returns true - i.e., any alphabetic or numeric.
625Some systems don't provide @samp{isblank()}, so flex defines
626@samp{[:blank:]} as a blank or a tab.
627
628For example, the following character classes are all
629equivalent:
630
631@example
632[[:alnum:]]
633[[:alpha:][:digit:]
634[[:alpha:]0-9]
635[a-zA-Z0-9]
636@end example
637
638If your scanner is case-insensitive (the @samp{-i} flag), then
639@samp{[:upper:]} and @samp{[:lower:]} are equivalent to @samp{[:alpha:]}.
640
641Some notes on patterns:
642
643@itemize -
644@item
645A negated character class such as the example
646"[^A-Z]" above @emph{will match a newline} unless "\n" (or an
647equivalent escape sequence) is one of the
648characters explicitly present in the negated character
649class (e.g., "[^A-Z\n]").  This is unlike how many
650other regular expression tools treat negated
651character classes, but unfortunately the inconsistency
652is historically entrenched.  Matching newlines
653means that a pattern like [^"]* can match the
654entire input unless there's another quote in the
655input.
656
657@item
658A rule can have at most one instance of trailing
659context (the '/' operator or the '$' operator).
660The start condition, '^', and "<<EOF>>" patterns
661can only occur at the beginning of a pattern, and,
662as well as with '/' and '$', cannot be grouped
663inside parentheses.  A '^' which does not occur at
664the beginning of a rule or a '$' which does not
665occur at the end of a rule loses its special
666properties and is treated as a normal character.
667
668The following are illegal:
669
670@example
671foo/bar$
672<sc1>foo<sc2>bar
673@end example
674
675Note that the first of these, can be written
676"foo/bar\n".
677
678The following will result in '$' or '^' being
679treated as a normal character:
680
681@example
682foo|(bar$)
683foo|^bar
684@end example
685
686If what's wanted is a "foo" or a
687bar-followed-by-a-newline, the following could be used (the special
688'|' action is explained below):
689
690@example
691foo      |
692bar$     /* action goes here */
693@end example
694
695A similar trick will work for matching a foo or a
696bar-at-the-beginning-of-a-line.
697@end itemize
698
699@node Matching, Actions, Patterns, Top
700@section How the input is matched
701
702When the generated scanner is run, it analyzes its input
703looking for strings which match any of its patterns.  If
704it finds more than one match, it takes the one matching
705the most text (for trailing context rules, this includes
706the length of the trailing part, even though it will then
707be returned to the input).  If it finds two or more
708matches of the same length, the rule listed first in the
709@code{flex} input file is chosen.
710
711Once the match is determined, the text corresponding to
712the match (called the @var{token}) is made available in the
713global character pointer @code{yytext}, and its length in the
714global integer @code{yyleng}.  The @var{action} corresponding to the
715matched pattern is then executed (a more detailed
716description of actions follows), and then the remaining input is
717scanned for another match.
718
719If no match is found, then the @dfn{default rule} is executed:
720the next character in the input is considered matched and
721copied to the standard output.  Thus, the simplest legal
722@code{flex} input is:
723
724@example
725%%
726@end example
727
728which generates a scanner that simply copies its input
729(one character at a time) to its output.
730
731Note that @code{yytext} can be defined in two different ways:
732either as a character @emph{pointer} or as a character @emph{array}.
733You can control which definition @code{flex} uses by including
734one of the special directives @samp{%pointer} or @samp{%array} in the
735first (definitions) section of your flex input.  The
736default is @samp{%pointer}, unless you use the @samp{-l} lex
737compatibility option, in which case @code{yytext} will be an array.  The
738advantage of using @samp{%pointer} is substantially faster
739scanning and no buffer overflow when matching very large
740tokens (unless you run out of dynamic memory).  The
741disadvantage is that you are restricted in how your actions can
742modify @code{yytext} (see the next section), and calls to the
743@samp{unput()} function destroys the present contents of @code{yytext},
744which can be a considerable porting headache when moving
745between different @code{lex} versions.
746
747The advantage of @samp{%array} is that you can then modify @code{yytext}
748to your heart's content, and calls to @samp{unput()} do not
749destroy @code{yytext} (see below).  Furthermore, existing @code{lex}
750programs sometimes access @code{yytext} externally using
751declarations of the form:
752@example
753extern char yytext[];
754@end example
755This definition is erroneous when used with @samp{%pointer}, but
756correct for @samp{%array}.
757
758@samp{%array} defines @code{yytext} to be an array of @code{YYLMAX} characters,
759which defaults to a fairly large value.  You can change
760the size by simply #define'ing @code{YYLMAX} to a different value
761in the first section of your @code{flex} input.  As mentioned
762above, with @samp{%pointer} yytext grows dynamically to
763accommodate large tokens.  While this means your @samp{%pointer} scanner
764can accommodate very large tokens (such as matching entire
765blocks of comments), bear in mind that each time the
766scanner must resize @code{yytext} it also must rescan the entire
767token from the beginning, so matching such tokens can
768prove slow.  @code{yytext} presently does @emph{not} dynamically grow if
769a call to @samp{unput()} results in too much text being pushed
770back; instead, a run-time error results.
771
772Also note that you cannot use @samp{%array} with C++ scanner
773classes (the @code{c++} option; see below).
774
775@node Actions, Generated scanner, Matching, Top
776@section Actions
777
778Each pattern in a rule has a corresponding action, which
779can be any arbitrary C statement.  The pattern ends at the
780first non-escaped whitespace character; the remainder of
781the line is its action.  If the action is empty, then when
782the pattern is matched the input token is simply
783discarded.  For example, here is the specification for a
784program which deletes all occurrences of "zap me" from its
785input:
786
787@example
788%%
789"zap me"
790@end example
791
792(It will copy all other characters in the input to the
793output since they will be matched by the default rule.)
794
795Here is a program which compresses multiple blanks and
796tabs down to a single blank, and throws away whitespace
797found at the end of a line:
798
799@example
800%%
801[ \t]+        putchar( ' ' );
802[ \t]+$       /* ignore this token */
803@end example
804
805If the action contains a '@{', then the action spans till
806the balancing '@}' is found, and the action may cross
807multiple lines.  @code{flex} knows about C strings and comments and
808won't be fooled by braces found within them, but also
809allows actions to begin with @samp{%@{} and will consider the
810action to be all the text up to the next @samp{%@}} (regardless of
811ordinary braces inside the action).
812
813An action consisting solely of a vertical bar ('|') means
814"same as the action for the next rule." See below for an
815illustration.
816
817Actions can include arbitrary C code, including @code{return}
818statements to return a value to whatever routine called
819@samp{yylex()}.  Each time @samp{yylex()} is called it continues
820processing tokens from where it last left off until it either
821reaches the end of the file or executes a return.
822
823Actions are free to modify @code{yytext} except for lengthening
824it (adding characters to its end--these will overwrite
825later characters in the input stream).  This however does
826not apply when using @samp{%array} (see above); in that case,
827@code{yytext} may be freely modified in any way.
828
829Actions are free to modify @code{yyleng} except they should not
830do so if the action also includes use of @samp{yymore()} (see
831below).
832
833There are a number of special directives which can be
834included within an action:
835
836@itemize -
837@item
838@samp{ECHO} copies yytext to the scanner's output.
839
840@item
841@code{BEGIN} followed by the name of a start condition
842places the scanner in the corresponding start
843condition (see below).
844
845@item
846@code{REJECT} directs the scanner to proceed on to the
847"second best" rule which matched the input (or a
848prefix of the input).  The rule is chosen as
849described above in "How the Input is Matched", and
850@code{yytext} and @code{yyleng} set up appropriately.  It may
851either be one which matched as much text as the
852originally chosen rule but came later in the @code{flex}
853input file, or one which matched less text.  For
854example, the following will both count the words in
855the input and call the routine special() whenever
856"frob" is seen:
857
858@example
859        int word_count = 0;
860%%
861
862frob        special(); REJECT;
863[^ \t\n]+   ++word_count;
864@end example
865
866Without the @code{REJECT}, any "frob"'s in the input would
867not be counted as words, since the scanner normally
868executes only one action per token.  Multiple
869@code{REJECT's} are allowed, each one finding the next
870best choice to the currently active rule.  For
871example, when the following scanner scans the token
872"abcd", it will write "abcdabcaba" to the output:
873
874@example
875%%
876a        |
877ab       |
878abc      |
879abcd     ECHO; REJECT;
880.|\n     /* eat up any unmatched character */
881@end example
882
883(The first three rules share the fourth's action
884since they use the special '|' action.)  @code{REJECT} is
885a particularly expensive feature in terms of
886scanner performance; if it is used in @emph{any} of the
887scanner's actions it will slow down @emph{all} of the
888scanner's matching.  Furthermore, @code{REJECT} cannot be used
889with the @samp{-Cf} or @samp{-CF} options (see below).
890
891Note also that unlike the other special actions,
892@code{REJECT} is a @emph{branch}; code immediately following it
893in the action will @emph{not} be executed.
894
895@item
896@samp{yymore()} tells the scanner that the next time it
897matches a rule, the corresponding token should be
898@emph{appended} onto the current value of @code{yytext} rather
899than replacing it.  For example, given the input
900"mega-kludge" the following will write
901"mega-mega-kludge" to the output:
902
903@example
904%%
905mega-    ECHO; yymore();
906kludge   ECHO;
907@end example
908
909First "mega-" is matched and echoed to the output.
910Then "kludge" is matched, but the previous "mega-"
911is still hanging around at the beginning of @code{yytext}
912so the @samp{ECHO} for the "kludge" rule will actually
913write "mega-kludge".
914@end itemize
915
916Two notes regarding use of @samp{yymore()}.  First, @samp{yymore()}
917depends on the value of @code{yyleng} correctly reflecting the
918size of the current token, so you must not modify @code{yyleng}
919if you are using @samp{yymore()}.  Second, the presence of
920@samp{yymore()} in the scanner's action entails a minor
921performance penalty in the scanner's matching speed.
922
923@itemize -
924@item
925@samp{yyless(n)} returns all but the first @var{n} characters of
926the current token back to the input stream, where
927they will be rescanned when the scanner looks for
928the next match.  @code{yytext} and @code{yyleng} are adjusted
929appropriately (e.g., @code{yyleng} will now be equal to @var{n}
930).  For example, on the input "foobar" the
931following will write out "foobarbar":
932
933@example
934%%
935foobar    ECHO; yyless(3);
936[a-z]+    ECHO;
937@end example
938
939An argument of 0 to @code{yyless} will cause the entire
940current input string to be scanned again.  Unless
941you've changed how the scanner will subsequently
942process its input (using @code{BEGIN}, for example), this
943will result in an endless loop.
944
945Note that @code{yyless} is a macro and can only be used in the
946flex input file, not from other source files.
947
948@item
949@samp{unput(c)} puts the character @code{c} back onto the input
950stream.  It will be the next character scanned.
951The following action will take the current token
952and cause it to be rescanned enclosed in
953parentheses.
954
955@example
956@{
957int i;
958/* Copy yytext because unput() trashes yytext */
959char *yycopy = strdup( yytext );
960unput( ')' );
961for ( i = yyleng - 1; i >= 0; --i )
962    unput( yycopy[i] );
963unput( '(' );
964free( yycopy );
965@}
966@end example
967
968Note that since each @samp{unput()} puts the given
969character back at the @emph{beginning} of the input stream,
970pushing back strings must be done back-to-front.
971An important potential problem when using @samp{unput()} is that
972if you are using @samp{%pointer} (the default), a call to @samp{unput()}
973@emph{destroys} the contents of @code{yytext}, starting with its
974rightmost character and devouring one character to the left
975with each call.  If you need the value of yytext preserved
976after a call to @samp{unput()} (as in the above example), you
977must either first copy it elsewhere, or build your scanner
978using @samp{%array} instead (see How The Input Is Matched).
979
980Finally, note that you cannot put back @code{EOF} to attempt to
981mark the input stream with an end-of-file.
982
983@item
984@samp{input()} reads the next character from the input
985stream.  For example, the following is one way to
986eat up C comments:
987
988@example
989%%
990"/*"        @{
991            register int c;
992
993            for ( ; ; )
994                @{
995                while ( (c = input()) != '*' &&
996                        c != EOF )
997                    ;    /* eat up text of comment */
998
999                if ( c == '*' )
1000                    @{
1001                    while ( (c = input()) == '*' )
1002                        ;
1003                    if ( c == '/' )
1004                        break;    /* found the end */
1005                    @}
1006
1007                if ( c == EOF )
1008                    @{
1009                    error( "EOF in comment" );
1010                    break;
1011                    @}
1012                @}
1013            @}
1014@end example
1015
1016(Note that if the scanner is compiled using @samp{C++},
1017then @samp{input()} is instead referred to as @samp{yyinput()},
1018in order to avoid a name clash with the @samp{C++} stream
1019by the name of @code{input}.)
1020
1021@item YY_FLUSH_BUFFER
1022flushes the scanner's internal buffer so that the next time the scanner
1023attempts to match a token, it will first refill the buffer using
1024@code{YY_INPUT} (see The Generated Scanner, below).  This action is
1025a special case of the more general @samp{yy_flush_buffer()} function,
1026described below in the section Multiple Input Buffers.
1027
1028@item
1029@samp{yyterminate()} can be used in lieu of a return
1030statement in an action.  It terminates the scanner
1031and returns a 0 to the scanner's caller, indicating
1032"all done".  By default, @samp{yyterminate()} is also
1033called when an end-of-file is encountered.  It is a
1034macro and may be redefined.
1035@end itemize
1036
1037@node Generated scanner, Start conditions, Actions, Top
1038@section The generated scanner
1039
1040The output of @code{flex} is the file @file{lex.yy.c}, which contains
1041the scanning routine @samp{yylex()}, a number of tables used by
1042it for matching tokens, and a number of auxiliary routines
1043and macros.  By default, @samp{yylex()} is declared as follows:
1044
1045@example
1046int yylex()
1047    @{
1048    @dots{} various definitions and the actions in here @dots{}
1049    @}
1050@end example
1051
1052(If your environment supports function prototypes, then it
1053will be "int yylex( void  )".)   This  definition  may  be
1054changed by defining the "YY_DECL" macro.  For example, you
1055could use:
1056
1057@example
1058#define YY_DECL float lexscan( a, b ) float a, b;
1059@end example
1060
1061to give the scanning routine the name @code{lexscan}, returning a
1062float, and taking two floats as arguments.  Note that if
1063you give arguments to the scanning routine using a
1064K&R-style/non-prototyped function declaration, you must
1065terminate the definition with a semi-colon (@samp{;}).
1066
1067Whenever @samp{yylex()} is called, it scans tokens from the
1068global input file @code{yyin} (which defaults to stdin).  It
1069continues until it either reaches an end-of-file (at which
1070point it returns the value 0) or one of its actions
1071executes a @code{return} statement.
1072
1073If the scanner reaches an end-of-file, subsequent calls are undefined
1074unless either @code{yyin} is pointed at a new input file (in which case
1075scanning continues from that file), or @samp{yyrestart()} is called.
1076@samp{yyrestart()} takes one argument, a @samp{FILE *} pointer (which
1077can be nil, if you've set up @code{YY_INPUT} to scan from a source
1078other than @code{yyin}), and initializes @code{yyin} for scanning from
1079that file.  Essentially there is no difference between just assigning
1080@code{yyin} to a new input file or using @samp{yyrestart()} to do so;
1081the latter is available for compatibility with previous versions of
1082@code{flex}, and because it can be used to switch input files in the
1083middle of scanning.  It can also be used to throw away the current
1084input buffer, by calling it with an argument of @code{yyin}; but
1085better is to use @code{YY_FLUSH_BUFFER} (see above).  Note that
1086@samp{yyrestart()} does @emph{not} reset the start condition to
1087@code{INITIAL} (see Start Conditions, below).
1088
1089
1090If @samp{yylex()} stops scanning due to executing a @code{return}
1091statement in one of the actions, the scanner may then be called
1092again and it will resume scanning where it left off.
1093
1094By default (and for purposes of efficiency), the scanner
1095uses block-reads rather than simple @samp{getc()} calls to read
1096characters from @code{yyin}.  The nature of how it gets its input
1097can be controlled by defining the @code{YY_INPUT} macro.
1098YY_INPUT's calling sequence is
1099"YY_INPUT(buf,result,max_size)".  Its action is to place
1100up to @var{max_size} characters in the character array @var{buf} and
1101return in the integer variable @var{result} either the number of
1102characters read or the constant YY_NULL (0 on Unix
1103systems) to indicate EOF.  The default YY_INPUT reads from
1104the global file-pointer "yyin".
1105
1106A sample definition of YY_INPUT (in the definitions
1107section of the input file):
1108
1109@example
1110%@{
1111#define YY_INPUT(buf,result,max_size) \
1112    @{ \
1113    int c = getchar(); \
1114    result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
1115    @}
1116%@}
1117@end example
1118
1119This definition will change the input processing to occur
1120one character at a time.
1121
1122When the scanner receives an end-of-file indication from
1123YY_INPUT, it then checks the @samp{yywrap()} function.  If
1124@samp{yywrap()} returns false (zero), then it is assumed that the
1125function has gone ahead and set up @code{yyin} to point to
1126another input file, and scanning continues.  If it returns
1127true (non-zero), then the scanner terminates, returning 0
1128to its caller.  Note that in either case, the start
1129condition remains unchanged; it does @emph{not} revert to @code{INITIAL}.
1130
1131If you do not supply your own version of @samp{yywrap()}, then you
1132must either use @samp{%option noyywrap} (in which case the scanner
1133behaves as though @samp{yywrap()} returned 1), or you must link with
1134@samp{-lfl} to obtain the default version of the routine, which always
1135returns 1.
1136
1137Three routines are available for scanning from in-memory
1138buffers rather than files: @samp{yy_scan_string()},
1139@samp{yy_scan_bytes()}, and @samp{yy_scan_buffer()}.  See the discussion
1140of them below in the section Multiple Input Buffers.
1141
1142The scanner writes its @samp{ECHO} output to the @code{yyout} global
1143(default, stdout), which may be redefined by the user
1144simply by assigning it to some other @code{FILE} pointer.
1145
1146@node Start conditions, Multiple buffers, Generated scanner, Top
1147@section Start conditions
1148
1149@code{flex} provides a mechanism for conditionally activating
1150rules.  Any rule whose pattern is prefixed with "<sc>"
1151will only be active when the scanner is in the start
1152condition named "sc".  For example,
1153
1154@example
1155<STRING>[^"]*        @{ /* eat up the string body ... */
1156            @dots{}
1157            @}
1158@end example
1159
1160@noindent
1161will be active only when the scanner is in the "STRING"
1162start condition, and
1163
1164@example
1165<INITIAL,STRING,QUOTE>\.        @{ /* handle an escape ... */
1166            @dots{}
1167            @}
1168@end example
1169
1170@noindent
1171will be active only when the current start condition is
1172either "INITIAL", "STRING", or "QUOTE".
1173
1174Start conditions are declared in the definitions (first)
1175section of the input using unindented lines beginning with
1176either @samp{%s} or @samp{%x} followed by a list of names.  The former
1177declares @emph{inclusive} start conditions, the latter @emph{exclusive}
1178start conditions.  A start condition is activated using
1179the @code{BEGIN} action.  Until the next @code{BEGIN} action is
1180executed, rules with the given start condition will be active
1181and rules with other start conditions will be inactive.
1182If the start condition is @emph{inclusive}, then rules with no
1183start conditions at all will also be active.  If it is
1184@emph{exclusive}, then @emph{only} rules qualified with the start
1185condition will be active.  A set of rules contingent on the
1186same exclusive start condition describe a scanner which is
1187independent of any of the other rules in the @code{flex} input.
1188Because of this, exclusive start conditions make it easy
1189to specify "mini-scanners" which scan portions of the
1190input that are syntactically different from the rest
1191(e.g., comments).
1192
1193If the distinction between inclusive and exclusive start
1194conditions is still a little vague, here's a simple
1195example illustrating the connection between the two.  The set
1196of rules:
1197
1198@example
1199%s example
1200%%
1201
1202<example>foo   do_something();
1203
1204bar            something_else();
1205@end example
1206
1207@noindent
1208is equivalent to
1209
1210@example
1211%x example
1212%%
1213
1214<example>foo   do_something();
1215
1216<INITIAL,example>bar    something_else();
1217@end example
1218
1219Without the @samp{<INITIAL,example>} qualifier, the @samp{bar} pattern
1220in the second example wouldn't be active (i.e., couldn't match) when
1221in start condition @samp{example}.  If we just used @samp{<example>}
1222to qualify @samp{bar}, though, then it would only be active in
1223@samp{example} and not in @code{INITIAL}, while in the first example
1224it's active in both, because in the first example the @samp{example}
1225starting condition is an @emph{inclusive} (@samp{%s}) start condition.
1226
1227Also note that the special start-condition specifier @samp{<*>}
1228matches every start condition.  Thus, the above example
1229could also have been written;
1230
1231@example
1232%x example
1233%%
1234
1235<example>foo   do_something();
1236
1237<*>bar    something_else();
1238@end example
1239
1240The default rule (to @samp{ECHO} any unmatched character) remains
1241active in start conditions.  It is equivalent to:
1242
1243@example
1244<*>.|\\n     ECHO;
1245@end example
1246
1247@samp{BEGIN(0)} returns to the original state where only the
1248rules with no start conditions are active.  This state can
1249also be referred to as the start-condition "INITIAL", so
1250@samp{BEGIN(INITIAL)} is equivalent to @samp{BEGIN(0)}.  (The
1251parentheses around the start condition name are not required but
1252are considered good style.)
1253
1254@code{BEGIN} actions can also be given as indented code at the
1255beginning of the rules section.  For example, the
1256following will cause the scanner to enter the "SPECIAL" start
1257condition whenever @samp{yylex()} is called and the global
1258variable @code{enter_special} is true:
1259
1260@example
1261        int enter_special;
1262
1263%x SPECIAL
1264%%
1265        if ( enter_special )
1266            BEGIN(SPECIAL);
1267
1268<SPECIAL>blahblahblah
1269@dots{}more rules follow@dots{}
1270@end example
1271
1272To illustrate the uses of start conditions, here is a
1273scanner which provides two different interpretations of a
1274string like "123.456".  By default it will treat it as as
1275three tokens, the integer "123", a dot ('.'), and the
1276integer "456".  But if the string is preceded earlier in
1277the line by the string "expect-floats" it will treat it as
1278a single token, the floating-point number 123.456:
1279
1280@example
1281%@{
1282#include <math.h>
1283%@}
1284%s expect
1285
1286%%
1287expect-floats        BEGIN(expect);
1288
1289<expect>[0-9]+"."[0-9]+      @{
1290            printf( "found a float, = %f\n",
1291                    atof( yytext ) );
1292            @}
1293<expect>\n           @{
1294            /* that's the end of the line, so
1295             * we need another "expect-number"
1296             * before we'll recognize any more
1297             * numbers
1298             */
1299            BEGIN(INITIAL);
1300            @}
1301
1302[0-9]+      @{
1303
1304Version 2.5               December 1994                        18
1305
1306            printf( "found an integer, = %d\n",
1307                    atoi( yytext ) );
1308            @}
1309
1310"."         printf( "found a dot\n" );
1311@end example
1312
1313Here is a scanner which recognizes (and discards) C
1314comments while maintaining a count of the current input line.
1315
1316@example
1317%x comment
1318%%
1319        int line_num = 1;
1320
1321"/*"         BEGIN(comment);
1322
1323<comment>[^*\n]*        /* eat anything that's not a '*' */
1324<comment>"*"+[^*/\n]*   /* eat up '*'s not followed by '/'s */
1325<comment>\n             ++line_num;
1326<comment>"*"+"/"        BEGIN(INITIAL);
1327@end example
1328
1329This scanner goes to a bit of trouble to match as much
1330text as possible with each rule.  In general, when
1331attempting to write a high-speed scanner try to match as
1332much possible in each rule, as it's a big win.
1333
1334Note that start-conditions names are really integer values
1335and can be stored as such.  Thus, the above could be
1336extended in the following fashion:
1337
1338@example
1339%x comment foo
1340%%
1341        int line_num = 1;
1342        int comment_caller;
1343
1344"/*"         @{
1345             comment_caller = INITIAL;
1346             BEGIN(comment);
1347             @}
1348
1349@dots{}
1350
1351<foo>"/*"    @{
1352             comment_caller = foo;
1353             BEGIN(comment);
1354             @}
1355
1356<comment>[^*\n]*        /* eat anything that's not a '*' */
1357<comment>"*"+[^*/\n]*   /* eat up '*'s not followed by '/'s */
1358<comment>\n             ++line_num;
1359<comment>"*"+"/"        BEGIN(comment_caller);
1360@end example
1361
1362Furthermore, you can access the current start condition
1363using the integer-valued @code{YY_START} macro.  For example, the
1364above assignments to @code{comment_caller} could instead be
1365written
1366
1367@example
1368comment_caller = YY_START;
1369@end example
1370
1371Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that
1372is what's used by AT&T @code{lex}).
1373
1374Note that start conditions do not have their own
1375name-space; %s's and %x's declare names in the same fashion as
1376#define's.
1377
1378Finally, here's an example of how to match C-style quoted
1379strings using exclusive start conditions, including
1380expanded escape sequences (but not including checking for
1381a string that's too long):
1382
1383@example
1384%x str
1385
1386%%
1387        char string_buf[MAX_STR_CONST];
1388        char *string_buf_ptr;
1389
1390\"      string_buf_ptr = string_buf; BEGIN(str);
1391
1392<str>\"        @{ /* saw closing quote - all done */
1393        BEGIN(INITIAL);
1394        *string_buf_ptr = '\0';
1395        /* return string constant token type and
1396         * value to parser
1397         */
1398        @}
1399
1400<str>\n        @{
1401        /* error - unterminated string constant */
1402        /* generate error message */
1403        @}
1404
1405<str>\\[0-7]@{1,3@} @{
1406        /* octal escape sequence */
1407        int result;
1408
1409        (void) sscanf( yytext + 1, "%o", &result );
1410
1411        if ( result > 0xff )
1412                /* error, constant is out-of-bounds */
1413
1414        *string_buf_ptr++ = result;
1415        @}
1416
1417<str>\\[0-9]+ @{
1418        /* generate error - bad escape sequence; something
1419         * like '\48' or '\0777777'
1420         */
1421        @}
1422
1423<str>\\n  *string_buf_ptr++ = '\n';
1424<str>\\t  *string_buf_ptr++ = '\t';
1425<str>\\r  *string_buf_ptr++ = '\r';
1426<str>\\b  *string_buf_ptr++ = '\b';
1427<str>\\f  *string_buf_ptr++ = '\f';
1428
1429<str>\\(.|\n)  *string_buf_ptr++ = yytext[1];
1430
1431<str>[^\\\n\"]+        @{
1432        char *yptr = yytext;
1433
1434        while ( *yptr )
1435                *string_buf_ptr++ = *yptr++;
1436        @}
1437@end example
1438
1439Often, such as in some of the examples above, you wind up
1440writing a whole bunch of rules all preceded by the same
1441start condition(s).  Flex makes this a little easier and
1442cleaner by introducing a notion of start condition @dfn{scope}.
1443A start condition scope is begun with:
1444
1445@example
1446<SCs>@{
1447@end example
1448
1449@noindent
1450where SCs is a list of one or more start conditions.
1451Inside the start condition scope, every rule automatically
1452has the prefix @samp{<SCs>} applied to it, until a @samp{@}} which
1453matches the initial @samp{@{}.  So, for example,
1454
1455@example
1456<ESC>@{
1457    "\\n"   return '\n';
1458    "\\r"   return '\r';
1459    "\\f"   return '\f';
1460    "\\0"   return '\0';
1461@}
1462@end example
1463
1464@noindent
1465is equivalent to:
1466
1467@example
1468<ESC>"\\n"  return '\n';
1469<ESC>"\\r"  return '\r';
1470<ESC>"\\f"  return '\f';
1471<ESC>"\\0"  return '\0';
1472@end example
1473
1474Start condition scopes may be nested.
1475
1476Three routines are available for manipulating stacks of
1477start conditions:
1478
1479@table @samp
1480@item void yy_push_state(int new_state)
1481pushes the current start condition onto the top of
1482the start condition stack and switches to @var{new_state}
1483as though you had used @samp{BEGIN new_state} (recall that
1484start condition names are also integers).
1485
1486@item void yy_pop_state()
1487pops the top of the stack and switches to it via
1488@code{BEGIN}.
1489
1490@item int yy_top_state()
1491returns the top of the stack without altering the
1492stack's contents.
1493@end table
1494
1495The start condition stack grows dynamically and so has no
1496built-in size limitation.  If memory is exhausted, program
1497execution aborts.
1498
1499To use start condition stacks, your scanner must include a
1500@samp{%option stack} directive (see Options below).
1501
1502@node Multiple buffers, End-of-file rules, Start conditions, Top
1503@section Multiple input buffers
1504
1505Some scanners (such as those which support "include"
1506files) require reading from several input streams.  As
1507@code{flex} scanners do a large amount of buffering, one cannot
1508control where the next input will be read from by simply
1509writing a @code{YY_INPUT} which is sensitive to the scanning
1510context.  @code{YY_INPUT} is only called when the scanner reaches
1511the end of its buffer, which may be a long time after
1512scanning a statement such as an "include" which requires
1513switching the input source.
1514
1515To negotiate these sorts of problems, @code{flex} provides a
1516mechanism for creating and switching between multiple
1517input buffers.  An input buffer is created by using:
1518
1519@example
1520YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1521@end example
1522
1523@noindent
1524which takes a @code{FILE} pointer and a size and creates a buffer
1525associated with the given file and large enough to hold
1526@var{size} characters (when in doubt, use @code{YY_BUF_SIZE} for the
1527size).  It returns a @code{YY_BUFFER_STATE} handle, which may
1528then be passed to other routines (see below).  The
1529@code{YY_BUFFER_STATE} type is a pointer to an opaque @code{struct}
1530@code{yy_buffer_state} structure, so you may safely initialize
1531YY_BUFFER_STATE variables to @samp{((YY_BUFFER_STATE) 0)} if you
1532wish, and also refer to the opaque structure in order to
1533correctly declare input buffers in source files other than
1534that of your scanner.  Note that the @code{FILE} pointer in the
1535call to @code{yy_create_buffer} is only used as the value of @code{yyin}
1536seen by @code{YY_INPUT}; if you redefine @code{YY_INPUT} so it no longer
1537uses @code{yyin}, then you can safely pass a nil @code{FILE} pointer to
1538@code{yy_create_buffer}.  You select a particular buffer to scan
1539from using:
1540
1541@example
1542void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1543@end example
1544
1545switches the scanner's input buffer so subsequent tokens
1546will come from @var{new_buffer}.  Note that
1547@samp{yy_switch_to_buffer()} may be used by @samp{yywrap()} to set
1548things up for continued scanning, instead of opening a new
1549file and pointing @code{yyin} at it.  Note also that switching
1550input sources via either @samp{yy_switch_to_buffer()} or @samp{yywrap()}
1551does @emph{not} change the start condition.
1552
1553@example
1554void yy_delete_buffer( YY_BUFFER_STATE buffer )
1555@end example
1556
1557@noindent
1558is used to reclaim the storage associated with a buffer.
1559You can also clear the current contents of a buffer using:
1560
1561@example
1562void yy_flush_buffer( YY_BUFFER_STATE buffer )
1563@end example
1564
1565This function discards the buffer's contents, so the next time the
1566scanner attempts to match a token from the buffer, it will first fill
1567the buffer anew using @code{YY_INPUT}.
1568
1569@samp{yy_new_buffer()} is an alias for @samp{yy_create_buffer()},
1570provided for compatibility with the C++ use of @code{new} and @code{delete}
1571for creating and destroying dynamic objects.
1572
1573Finally, the @code{YY_CURRENT_BUFFER} macro returns a
1574@code{YY_BUFFER_STATE} handle to the current buffer.
1575
1576Here is an example of using these features for writing a
1577scanner which expands include files (the @samp{<<EOF>>} feature
1578is discussed below):
1579
1580@example
1581/* the "incl" state is used for picking up the name
1582 * of an include file
1583 */
1584%x incl
1585
1586%@{
1587#define MAX_INCLUDE_DEPTH 10
1588YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1589int include_stack_ptr = 0;
1590%@}
1591
1592%%
1593include             BEGIN(incl);
1594
1595[a-z]+              ECHO;
1596[^a-z\n]*\n?        ECHO;
1597
1598<incl>[ \t]*      /* eat the whitespace */
1599<incl>[^ \t\n]+   @{ /* got the include file name */
1600        if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1601            @{
1602            fprintf( stderr, "Includes nested too deeply" );
1603            exit( 1 );
1604            @}
1605
1606        include_stack[include_stack_ptr++] =
1607            YY_CURRENT_BUFFER;
1608
1609        yyin = fopen( yytext, "r" );
1610
1611        if ( ! yyin )
1612            error( @dots{} );
1613
1614        yy_switch_to_buffer(
1615            yy_create_buffer( yyin, YY_BUF_SIZE ) );
1616
1617        BEGIN(INITIAL);
1618        @}
1619
1620<<EOF>> @{
1621        if ( --include_stack_ptr < 0 )
1622            @{
1623            yyterminate();
1624            @}
1625
1626        else
1627            @{
1628            yy_delete_buffer( YY_CURRENT_BUFFER );
1629            yy_switch_to_buffer(
1630                 include_stack[include_stack_ptr] );
1631            @}
1632        @}
1633@end example
1634
1635Three routines are available for setting up input buffers
1636for scanning in-memory strings instead of files.  All of
1637them create a new input buffer for scanning the string,
1638and return a corresponding @code{YY_BUFFER_STATE} handle (which
1639you should delete with @samp{yy_delete_buffer()} when done with
1640it).  They also switch to the new buffer using
1641@samp{yy_switch_to_buffer()}, so the next call to @samp{yylex()} will
1642start scanning the string.
1643
1644@table @samp
1645@item yy_scan_string(const char *str)
1646scans a NUL-terminated string.
1647
1648@item yy_scan_bytes(const char *bytes, int len)
1649scans @code{len} bytes (including possibly NUL's) starting
1650at location @var{bytes}.
1651@end table
1652
1653Note that both of these functions create and scan a @emph{copy}
1654of the string or bytes.  (This may be desirable, since
1655@samp{yylex()} modifies the contents of the buffer it is
1656scanning.) You can avoid the copy by using:
1657
1658@table @samp
1659@item yy_scan_buffer(char *base, yy_size_t size)
1660which scans in place the buffer starting at @var{base},
1661consisting of @var{size} bytes, the last two bytes of
1662which @emph{must} be @code{YY_END_OF_BUFFER_CHAR} (ASCII NUL).
1663These last two bytes are not scanned; thus,
1664scanning consists of @samp{base[0]} through @samp{base[size-2]},
1665inclusive.
1666
1667If you fail to set up @var{base} in this manner (i.e.,
1668forget the final two @code{YY_END_OF_BUFFER_CHAR} bytes),
1669then @samp{yy_scan_buffer()} returns a nil pointer instead
1670of creating a new input buffer.
1671
1672The type @code{yy_size_t} is an integral type to which you
1673can cast an integer expression reflecting the size
1674of the buffer.
1675@end table
1676
1677@node End-of-file rules, Miscellaneous, Multiple buffers, Top
1678@section End-of-file rules
1679
1680The special rule "<<EOF>>" indicates actions which are to
1681be taken when an end-of-file is encountered and yywrap()
1682returns non-zero (i.e., indicates no further files to
1683process).  The action must finish by doing one of four
1684things:
1685
1686@itemize -
1687@item
1688assigning @code{yyin} to a new input file (in previous
1689versions of flex, after doing the assignment you
1690had to call the special action @code{YY_NEW_FILE}; this is
1691no longer necessary);
1692
1693@item
1694executing a @code{return} statement;
1695
1696@item
1697executing the special @samp{yyterminate()} action;
1698
1699@item
1700or, switching to a new buffer using
1701@samp{yy_switch_to_buffer()} as shown in the example
1702above.
1703@end itemize
1704
1705<<EOF>> rules may not be used with other patterns; they
1706may only be qualified with a list of start conditions.  If
1707an unqualified <<EOF>> rule is given, it applies to @emph{all}
1708start conditions which do not already have <<EOF>>
1709actions.  To specify an <<EOF>> rule for only the initial
1710start condition, use
1711
1712@example
1713<INITIAL><<EOF>>
1714@end example
1715
1716These rules are useful for catching things like unclosed
1717comments.  An example:
1718
1719@example
1720%x quote
1721%%
1722
1723@dots{}other rules for dealing with quotes@dots{}
1724
1725<quote><<EOF>>   @{
1726         error( "unterminated quote" );
1727         yyterminate();
1728         @}
1729<<EOF>>  @{
1730         if ( *++filelist )
1731             yyin = fopen( *filelist, "r" );
1732         else
1733            yyterminate();
1734         @}
1735@end example
1736
1737@node Miscellaneous, User variables, End-of-file rules, Top
1738@section Miscellaneous macros
1739
1740The macro @code{YY_USER_ACTION} can be defined to provide an
1741action which is always executed prior to the matched
1742rule's action.  For example, it could be #define'd to call
1743a routine to convert yytext to lower-case.  When
1744@code{YY_USER_ACTION} is invoked, the variable @code{yy_act} gives the
1745number of the matched rule (rules are numbered starting
1746with 1).  Suppose you want to profile how often each of
1747your rules is matched.  The following would do the trick:
1748
1749@example
1750#define YY_USER_ACTION ++ctr[yy_act]
1751@end example
1752
1753where @code{ctr} is an array to hold the counts for the different
1754rules.  Note that the macro @code{YY_NUM_RULES} gives the total number
1755of rules (including the default rule, even if you use @samp{-s}, so
1756a correct declaration for @code{ctr} is:
1757
1758@example
1759int ctr[YY_NUM_RULES];
1760@end example
1761
1762The macro @code{YY_USER_INIT} may be defined to provide an action
1763which is always executed before the first scan (and before
1764the scanner's internal initializations are done).  For
1765example, it could be used to call a routine to read in a
1766data table or open a logging file.
1767
1768The macro @samp{yy_set_interactive(is_interactive)} can be used
1769to control whether the current buffer is considered
1770@emph{interactive}.  An interactive buffer is processed more slowly,
1771but must be used when the scanner's input source is indeed
1772interactive to avoid problems due to waiting to fill
1773buffers (see the discussion of the @samp{-I} flag below).  A
1774non-zero value in the macro invocation marks the buffer as
1775interactive, a zero value as non-interactive.  Note that
1776use of this macro overrides @samp{%option always-interactive} or
1777@samp{%option never-interactive} (see Options below).
1778@samp{yy_set_interactive()} must be invoked prior to beginning to
1779scan the buffer that is (or is not) to be considered
1780interactive.
1781
1782The macro @samp{yy_set_bol(at_bol)} can be used to control
1783whether the current buffer's scanning context for the next
1784token match is done as though at the beginning of a line.
1785A non-zero macro argument makes rules anchored with
1786
1787The macro @samp{YY_AT_BOL()} returns true if the next token
1788scanned from the current buffer will have '^' rules
1789active, false otherwise.
1790
1791In the generated scanner, the actions are all gathered in
1792one large switch statement and separated using @code{YY_BREAK},
1793which may be redefined.  By default, it is simply a
1794"break", to separate each rule's action from the following
1795rule's.  Redefining @code{YY_BREAK} allows, for example, C++
1796users to #define YY_BREAK to do nothing (while being very
1797careful that every rule ends with a "break" or a
1798"return"!) to avoid suffering from unreachable statement
1799warnings where because a rule's action ends with "return",
1800the @code{YY_BREAK} is inaccessible.
1801
1802@node User variables, YACC interface, Miscellaneous, Top
1803@section Values available to the user
1804
1805This section summarizes the various values available to
1806the user in the rule actions.
1807
1808@itemize -
1809@item
1810@samp{char *yytext} holds the text of the current token.
1811It may be modified but not lengthened (you cannot
1812append characters to the end).
1813
1814If the special directive @samp{%array} appears in the
1815first section of the scanner description, then
1816@code{yytext} is instead declared @samp{char yytext[YYLMAX]},
1817where @code{YYLMAX} is a macro definition that you can
1818redefine in the first section if you don't like the
1819default value (generally 8KB).  Using @samp{%array}
1820results in somewhat slower scanners, but the value
1821of @code{yytext} becomes immune to calls to @samp{input()} and
1822@samp{unput()}, which potentially destroy its value when
1823@code{yytext} is a character pointer.  The opposite of
1824@samp{%array} is @samp{%pointer}, which is the default.
1825
1826You cannot use @samp{%array} when generating C++ scanner
1827classes (the @samp{-+} flag).
1828
1829@item
1830@samp{int yyleng} holds the length of the current token.
1831
1832@item
1833@samp{FILE *yyin} is the file which by default @code{flex} reads
1834from.  It may be redefined but doing so only makes
1835sense before scanning begins or after an EOF has
1836been encountered.  Changing it in the midst of
1837scanning will have unexpected results since @code{flex}
1838buffers its input; use @samp{yyrestart()} instead.  Once
1839scanning terminates because an end-of-file has been
1840seen, you can assign @code{yyin} at the new input file and
1841then call the scanner again to continue scanning.
1842
1843@item
1844@samp{void yyrestart( FILE *new_file )} may be called to
1845point @code{yyin} at the new input file.  The switch-over
1846to the new file is immediate (any previously
1847buffered-up input is lost).  Note that calling
1848@samp{yyrestart()} with @code{yyin} as an argument thus throws
1849away the current input buffer and continues
1850scanning the same input file.
1851
1852@item
1853@samp{FILE *yyout} is the file to which @samp{ECHO} actions are
1854done.  It can be reassigned by the user.
1855
1856@item
1857@code{YY_CURRENT_BUFFER} returns a @code{YY_BUFFER_STATE} handle
1858to the current buffer.
1859
1860@item
1861@code{YY_START} returns an integer value corresponding to
1862the current start condition.  You can subsequently
1863use this value with @code{BEGIN} to return to that start
1864condition.
1865@end itemize
1866
1867@node YACC interface, Options, User variables, Top
1868@section Interfacing with @code{yacc}
1869
1870One of the main uses of @code{flex} is as a companion to the @code{yacc}
1871parser-generator.  @code{yacc} parsers expect to call a routine
1872named @samp{yylex()} to find the next input token.  The routine
1873is supposed to return the type of the next token as well
1874as putting any associated value in the global @code{yylval}.  To
1875use @code{flex} with @code{yacc}, one specifies the @samp{-d} option to @code{yacc} to
1876instruct it to generate the file @file{y.tab.h} containing
1877definitions of all the @samp{%tokens} appearing in the @code{yacc} input.
1878This file is then included in the @code{flex} scanner.  For
1879example, if one of the tokens is "TOK_NUMBER", part of the
1880scanner might look like:
1881
1882@example
1883%@{
1884#include "y.tab.h"
1885%@}
1886
1887%%
1888
1889[0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;
1890@end example
1891
1892@node Options, Performance, YACC interface, Top
1893@section Options
1894@code{flex} has the following options:
1895
1896@table @samp
1897@item -b
1898Generate backing-up information to @file{lex.backup}.
1899This is a list of scanner states which require
1900backing up and the input characters on which they
1901do so.  By adding rules one can remove backing-up
1902states.  If @emph{all} backing-up states are eliminated
1903and @samp{-Cf} or @samp{-CF} is used, the generated scanner will
1904run faster (see the @samp{-p} flag).  Only users who wish
1905to squeeze every last cycle out of their scanners
1906need worry about this option.  (See the section on
1907Performance Considerations below.)
1908
1909@item -c
1910is a do-nothing, deprecated option included for
1911POSIX compliance.
1912
1913@item -d
1914makes the generated scanner run in @dfn{debug} mode.
1915Whenever a pattern is recognized and the global
1916@code{yy_flex_debug} is non-zero (which is the default),
1917the scanner will write to @code{stderr} a line of the
1918form:
1919
1920@example
1921--accepting rule at line 53 ("the matched text")
1922@end example
1923
1924The line number refers to the location of the rule
1925in the file defining the scanner (i.e., the file
1926that was fed to flex).  Messages are also generated
1927when the scanner backs up, accepts the default
1928rule, reaches the end of its input buffer (or
1929encounters a NUL; at this point, the two look the
1930same as far as the scanner's concerned), or reaches
1931an end-of-file.
1932
1933@item -f
1934specifies @dfn{fast scanner}.  No table compression is
1935done and stdio is bypassed.  The result is large
1936but fast.  This option is equivalent to @samp{-Cfr} (see
1937below).
1938
1939@item -h
1940generates a "help" summary of @code{flex's} options to
1941@code{stdout} and then exits.  @samp{-?} and @samp{--help} are synonyms
1942for @samp{-h}.
1943
1944@item -i
1945instructs @code{flex} to generate a @emph{case-insensitive}
1946scanner.  The case of letters given in the @code{flex} input
1947patterns will be ignored, and tokens in the input
1948will be matched regardless of case.  The matched
1949text given in @code{yytext} will have the preserved case
1950(i.e., it will not be folded).
1951
1952@item -l
1953turns on maximum compatibility with the original
1954AT&T @code{lex} implementation.  Note that this does not
1955mean @emph{full} compatibility.  Use of this option costs
1956a considerable amount of performance, and it cannot
1957be used with the @samp{-+, -f, -F, -Cf}, or @samp{-CF} options.
1958For details on the compatibilities it provides, see
1959the section "Incompatibilities With Lex And POSIX"
1960below.  This option also results in the name
1961@code{YY_FLEX_LEX_COMPAT} being #define'd in the generated
1962scanner.
1963
1964@item -n
1965is another do-nothing, deprecated option included
1966only for POSIX compliance.
1967
1968@item -p
1969generates a performance report to stderr.  The
1970report consists of comments regarding features of
1971the @code{flex} input file which will cause a serious loss
1972of performance in the resulting scanner.  If you
1973give the flag twice, you will also get comments
1974regarding features that lead to minor performance
1975losses.
1976
1977Note that the use of @code{REJECT}, @samp{%option yylineno} and
1978variable trailing context (see the Deficiencies / Bugs section below)
1979entails a substantial performance penalty; use of @samp{yymore()},
1980the @samp{^} operator, and the @samp{-I} flag entail minor performance
1981penalties.
1982
1983@item -s
1984causes the @dfn{default rule} (that unmatched scanner
1985input is echoed to @code{stdout}) to be suppressed.  If
1986the scanner encounters input that does not match
1987any of its rules, it aborts with an error.  This
1988option is useful for finding holes in a scanner's
1989rule set.
1990
1991@item -t
1992instructs @code{flex} to write the scanner it generates to
1993standard output instead of @file{lex.yy.c}.
1994
1995@item -v
1996specifies that @code{flex} should write to @code{stderr} a
1997summary of statistics regarding the scanner it
1998generates.  Most of the statistics are meaningless to
1999the casual @code{flex} user, but the first line identifies
2000the version of @code{flex} (same as reported by @samp{-V}), and
2001the next line the flags used when generating the
2002scanner, including those that are on by default.
2003
2004@item -w
2005suppresses warning messages.
2006
2007@item -B
2008instructs @code{flex} to generate a @emph{batch} scanner, the
2009opposite of @emph{interactive} scanners generated by @samp{-I}
2010(see below).  In general, you use @samp{-B} when you are
2011@emph{certain} that your scanner will never be used
2012interactively, and you want to squeeze a @emph{little} more
2013performance out of it.  If your goal is instead to
2014squeeze out a @emph{lot} more performance, you should be
2015using the @samp{-Cf} or @samp{-CF} options (discussed below),
2016which turn on @samp{-B} automatically anyway.
2017
2018@item -F
2019specifies that the @dfn{fast} scanner table
2020representation should be used (and stdio bypassed).  This
2021representation is about as fast as the full table
2022representation @samp{(-f)}, and for some sets of patterns
2023will be considerably smaller (and for others,
2024larger).  In general, if the pattern set contains
2025both "keywords" and a catch-all, "identifier" rule,
2026such as in the set:
2027
2028@example
2029"case"    return TOK_CASE;
2030"switch"  return TOK_SWITCH;
2031...
2032"default" return TOK_DEFAULT;
2033[a-z]+    return TOK_ID;
2034@end example
2035
2036@noindent
2037then you're better off using the full table
2038representation.  If only the "identifier" rule is
2039present and you then use a hash table or some such to
2040detect the keywords, you're better off using @samp{-F}.
2041
2042This option is equivalent to @samp{-CFr} (see below).  It
2043cannot be used with @samp{-+}.
2044
2045@item -I
2046instructs @code{flex} to generate an @emph{interactive} scanner.
2047An interactive scanner is one that only looks ahead
2048to decide what token has been matched if it
2049absolutely must.  It turns out that always looking one
2050extra character ahead, even if the scanner has
2051already seen enough text to disambiguate the
2052current token, is a bit faster than only looking ahead
2053when necessary.  But scanners that always look
2054ahead give dreadful interactive performance; for
2055example, when a user types a newline, it is not
2056recognized as a newline token until they enter
2057@emph{another} token, which often means typing in another
2058whole line.
2059
2060@code{Flex} scanners default to @emph{interactive} unless you use
2061the @samp{-Cf} or @samp{-CF} table-compression options (see
2062below).  That's because if you're looking for
2063high-performance you should be using one of these
2064options, so if you didn't, @code{flex} assumes you'd
2065rather trade off a bit of run-time performance for
2066intuitive interactive behavior.  Note also that you
2067@emph{cannot} use @samp{-I} in conjunction with @samp{-Cf} or @samp{-CF}.
2068Thus, this option is not really needed; it is on by
2069default for all those cases in which it is allowed.
2070
2071You can force a scanner to @emph{not} be interactive by
2072using @samp{-B} (see above).
2073
2074@item -L
2075instructs @code{flex} not to generate @samp{#line} directives.
2076Without this option, @code{flex} peppers the generated
2077scanner with #line directives so error messages in
2078the actions will be correctly located with respect
2079to either the original @code{flex} input file (if the
2080errors are due to code in the input file), or
2081@file{lex.yy.c} (if the errors are @code{flex's} fault -- you
2082should report these sorts of errors to the email
2083address given below).
2084
2085@item -T
2086makes @code{flex} run in @code{trace} mode.  It will generate a
2087lot of messages to @code{stderr} concerning the form of
2088the input and the resultant non-deterministic and
2089deterministic finite automata.  This option is
2090mostly for use in maintaining @code{flex}.
2091
2092@item -V
2093prints the version number to @code{stdout} and exits.
2094@samp{--version} is a synonym for @samp{-V}.
2095
2096@item -7
2097instructs @code{flex} to generate a 7-bit scanner, i.e.,
2098one which can only recognized 7-bit characters in
2099its input.  The advantage of using @samp{-7} is that the
2100scanner's tables can be up to half the size of
2101those generated using the @samp{-8} option (see below).
2102The disadvantage is that such scanners often hang
2103or crash if their input contains an 8-bit
2104character.
2105
2106Note, however, that unless you generate your
2107scanner using the @samp{-Cf} or @samp{-CF} table compression options,
2108use of @samp{-7} will save only a small amount of table
2109space, and make your scanner considerably less
2110portable.  @code{Flex's} default behavior is to generate
2111an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, in
2112which case @code{flex} defaults to generating 7-bit
2113scanners unless your site was always configured to
2114generate 8-bit scanners (as will often be the case
2115with non-USA sites).  You can tell whether flex
2116generated a 7-bit or an 8-bit scanner by inspecting
2117the flag summary in the @samp{-v} output as described
2118above.
2119
2120Note that if you use @samp{-Cfe} or @samp{-CFe} (those table
2121compression options, but also using equivalence
2122classes as discussed see below), flex still
2123defaults to generating an 8-bit scanner, since
2124usually with these compression options full 8-bit
2125tables are not much more expensive than 7-bit
2126tables.
2127
2128@item -8
2129instructs @code{flex} to generate an 8-bit scanner, i.e.,
2130one which can recognize 8-bit characters.  This
2131flag is only needed for scanners generated using
2132@samp{-Cf} or @samp{-CF}, as otherwise flex defaults to
2133generating an 8-bit scanner anyway.
2134
2135See the discussion of @samp{-7} above for flex's default
2136behavior and the tradeoffs between 7-bit and 8-bit
2137scanners.
2138
2139@item -+
2140specifies that you want flex to generate a C++
2141scanner class.  See the section on Generating C++
2142Scanners below for details.
2143
2144@item -C[aefFmr]
2145controls the degree of table compression and, more
2146generally, trade-offs between small scanners and
2147fast scanners.
2148
2149@samp{-Ca} ("align") instructs flex to trade off larger
2150tables in the generated scanner for faster
2151performance because the elements of the tables are better
2152aligned for memory access and computation.  On some
2153RISC architectures, fetching and manipulating
2154long-words is more efficient than with smaller-sized
2155units such as shortwords.  This option can double
2156the size of the tables used by your scanner.
2157
2158@samp{-Ce} directs @code{flex} to construct @dfn{equivalence classes},
2159i.e., sets of characters which have identical
2160lexical properties (for example, if the only appearance
2161of digits in the @code{flex} input is in the character
2162class "[0-9]" then the digits '0', '1', @dots{}, '9'
2163will all be put in the same equivalence class).
2164Equivalence classes usually give dramatic
2165reductions in the final table/object file sizes
2166(typically a factor of 2-5) and are pretty cheap
2167performance-wise (one array look-up per character
2168scanned).
2169
2170@samp{-Cf} specifies that the @emph{full} scanner tables should
2171be generated - @code{flex} should not compress the tables
2172by taking advantages of similar transition
2173functions for different states.
2174
2175@samp{-CF} specifies that the alternate fast scanner
2176representation (described above under the @samp{-F} flag)
2177should be used.  This option cannot be used with
2178@samp{-+}.
2179
2180@samp{-Cm} directs @code{flex} to construct @dfn{meta-equivalence
2181classes}, which are sets of equivalence classes (or
2182characters, if equivalence classes are not being
2183used) that are commonly used together.
2184Meta-equivalence classes are often a big win when using
2185compressed tables, but they have a moderate
2186performance impact (one or two "if" tests and one array
2187look-up per character scanned).
2188
2189@samp{-Cr} causes the generated scanner to @emph{bypass} use of
2190the standard I/O library (stdio) for input.
2191Instead of calling @samp{fread()} or @samp{getc()}, the scanner
2192will use the @samp{read()} system call, resulting in a
2193performance gain which varies from system to
2194system, but in general is probably negligible unless
2195you are also using @samp{-Cf} or @samp{-CF}.  Using @samp{-Cr} can cause
2196strange behavior if, for example, you read from
2197@code{yyin} using stdio prior to calling the scanner
2198(because the scanner will miss whatever text your
2199previous reads left in the stdio input buffer).
2200
2201@samp{-Cr} has no effect if you define @code{YY_INPUT} (see The
2202Generated Scanner above).
2203
2204A lone @samp{-C} specifies that the scanner tables should
2205be compressed but neither equivalence classes nor
2206meta-equivalence classes should be used.
2207
2208The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense
2209together - there is no opportunity for
2210meta-equivalence classes if the table is not being
2211compressed.  Otherwise the options may be freely
2212mixed, and are cumulative.
2213
2214The default setting is @samp{-Cem}, which specifies that
2215@code{flex} should generate equivalence classes and
2216meta-equivalence classes.  This setting provides the
2217highest degree of table compression.  You can trade
2218off faster-executing scanners at the cost of larger
2219tables with the following generally being true:
2220
2221@example
2222slowest & smallest
2223      -Cem
2224      -Cm
2225      -Ce
2226      -C
2227      -C@{f,F@}e
2228      -C@{f,F@}
2229      -C@{f,F@}a
2230fastest & largest
2231@end example
2232
2233Note that scanners with the smallest tables are
2234usually generated and compiled the quickest, so
2235during development you will usually want to use the
2236default, maximal compression.
2237
2238@samp{-Cfe} is often a good compromise between speed and
2239size for production scanners.
2240
2241@item -ooutput
2242directs flex to write the scanner to the file @samp{out-}
2243@code{put} instead of @file{lex.yy.c}.  If you combine @samp{-o} with
2244the @samp{-t} option, then the scanner is written to
2245@code{stdout} but its @samp{#line} directives (see the @samp{-L} option
2246above) refer to the file @code{output}.
2247
2248@item -Pprefix
2249changes the default @samp{yy} prefix used by @code{flex} for all
2250globally-visible variable and function names to
2251instead be @var{prefix}.  For example, @samp{-Pfoo} changes the
2252name of @code{yytext} to @file{footext}.  It also changes the
2253name of the default output file from @file{lex.yy.c} to
2254@file{lex.foo.c}.  Here are all of the names affected:
2255
2256@example
2257yy_create_buffer
2258yy_delete_buffer
2259yy_flex_debug
2260yy_init_buffer
2261yy_flush_buffer
2262yy_load_buffer_state
2263yy_switch_to_buffer
2264yyin
2265yyleng
2266yylex
2267yylineno
2268yyout
2269yyrestart
2270yytext
2271yywrap
2272@end example
2273
2274(If you are using a C++ scanner, then only @code{yywrap}
2275and @code{yyFlexLexer} are affected.) Within your scanner
2276itself, you can still refer to the global variables
2277and functions using either version of their name;
2278but externally, they have the modified name.
2279
2280This option lets you easily link together multiple
2281@code{flex} programs into the same executable.  Note,
2282though, that using this option also renames
2283@samp{yywrap()}, so you now @emph{must} either provide your own
2284(appropriately-named) version of the routine for
2285your scanner, or use @samp{%option noyywrap}, as linking
2286with @samp{-lfl} no longer provides one for you by
2287default.
2288
2289@item -Sskeleton_file
2290overrides the default skeleton file from which @code{flex}
2291constructs its scanners.  You'll never need this
2292option unless you are doing @code{flex} maintenance or
2293development.
2294@end table
2295
2296@code{flex} also provides a mechanism for controlling options
2297within the scanner specification itself, rather than from
2298the flex command-line.  This is done by including @samp{%option}
2299directives in the first section of the scanner
2300specification.  You can specify multiple options with a single
2301@samp{%option} directive, and multiple directives in the first
2302section of your flex input file.  Most options are given
2303simply as names, optionally preceded by the word "no"
2304(with no intervening whitespace) to negate their meaning.
2305A number are equivalent to flex flags or their negation:
2306
2307@example
23087bit            -7 option
23098bit            -8 option
2310align           -Ca option
2311backup          -b option
2312batch           -B option
2313c++             -+ option
2314
2315caseful or
2316case-sensitive  opposite of -i (default)
2317
2318case-insensitive or
2319caseless        -i option
2320
2321debug           -d option
2322default         opposite of -s option
2323ecs             -Ce option
2324fast            -F option
2325full            -f option
2326interactive     -I option
2327lex-compat      -l option
2328meta-ecs        -Cm option
2329perf-report     -p option
2330read            -Cr option
2331stdout          -t option
2332verbose         -v option
2333warn            opposite of -w option
2334                (use "%option nowarn" for -w)
2335
2336array           equivalent to "%array"
2337pointer         equivalent to "%pointer" (default)
2338@end example
2339
2340Some @samp{%option's} provide features otherwise not available:
2341
2342@table @samp
2343@item always-interactive
2344instructs flex to generate a scanner which always
2345considers its input "interactive".  Normally, on
2346each new input file the scanner calls @samp{isatty()} in
2347an attempt to determine whether the scanner's input
2348source is interactive and thus should be read a
2349character at a time.  When this option is used,
2350however, then no such call is made.
2351
2352@item main
2353directs flex to provide a default @samp{main()} program
2354for the scanner, which simply calls @samp{yylex()}.  This
2355option implies @code{noyywrap} (see below).
2356
2357@item never-interactive
2358instructs flex to generate a scanner which never
2359considers its input "interactive" (again, no call
2360made to @samp{isatty())}.  This is the opposite of @samp{always-}
2361@emph{interactive}.
2362
2363@item stack
2364enables the use of start condition stacks (see
2365Start Conditions above).
2366
2367@item stdinit
2368if unset (i.e., @samp{%option nostdinit}) initializes @code{yyin}
2369and @code{yyout} to nil @code{FILE} pointers, instead of @code{stdin}
2370and @code{stdout}.
2371
2372@item yylineno
2373directs @code{flex} to generate a scanner that maintains the number
2374of the current line read from its input in the global variable
2375@code{yylineno}.  This option is implied by @samp{%option lex-compat}.
2376
2377@item yywrap
2378if unset (i.e., @samp{%option noyywrap}), makes the
2379scanner not call @samp{yywrap()} upon an end-of-file, but
2380simply assume that there are no more files to scan
2381(until the user points @code{yyin} at a new file and calls
2382@samp{yylex()} again).
2383@end table
2384
2385@code{flex} scans your rule actions to determine whether you use
2386the @code{REJECT} or @samp{yymore()} features.  The @code{reject} and @code{yymore}
2387options are available to override its decision as to
2388whether you use the options, either by setting them (e.g.,
2389@samp{%option reject}) to indicate the feature is indeed used, or
2390unsetting them to indicate it actually is not used (e.g.,
2391@samp{%option noyymore}).
2392
2393Three options take string-delimited values, offset with '=':
2394
2395@example
2396%option outfile="ABC"
2397@end example
2398
2399@noindent
2400is equivalent to @samp{-oABC}, and
2401
2402@example
2403%option prefix="XYZ"
2404@end example
2405
2406@noindent
2407is equivalent to @samp{-PXYZ}.
2408
2409Finally,
2410
2411@example
2412%option yyclass="foo"
2413@end example
2414
2415@noindent
2416only applies when generating a C++ scanner (@samp{-+} option).  It
2417informs @code{flex} that you have derived @samp{foo} as a subclass of
2418@code{yyFlexLexer} so @code{flex} will place your actions in the member
2419function @samp{foo::yylex()} instead of @samp{yyFlexLexer::yylex()}.
2420It also generates a @samp{yyFlexLexer::yylex()} member function that
2421emits a run-time error (by invoking @samp{yyFlexLexer::LexerError()})
2422if called.  See Generating C++ Scanners, below, for additional
2423information.
2424
2425A number of options are available for lint purists who
2426want to suppress the appearance of unneeded routines in
2427the generated scanner.  Each of the following, if unset,
2428results in the corresponding routine not appearing in the
2429generated scanner:
2430
2431@example
2432input, unput
2433yy_push_state, yy_pop_state, yy_top_state
2434yy_scan_buffer, yy_scan_bytes, yy_scan_string
2435@end example
2436
2437@noindent
2438(though @samp{yy_push_state()} and friends won't appear anyway
2439unless you use @samp{%option stack}).
2440
2441@node Performance, C++, Options, Top
2442@section Performance considerations
2443
2444The main design goal of @code{flex} is that it generate
2445high-performance scanners.  It has been optimized for dealing
2446well with large sets of rules.  Aside from the effects on
2447scanner speed of the table compression @samp{-C} options outlined
2448above, there are a number of options/actions which degrade
2449performance.  These are, from most expensive to least:
2450
2451@example
2452REJECT
2453%option yylineno
2454arbitrary trailing context
2455
2456pattern sets that require backing up
2457%array
2458%option interactive
2459%option always-interactive
2460
2461'^' beginning-of-line operator
2462yymore()
2463@end example
2464
2465with the first three all being quite expensive and the
2466last two being quite cheap.  Note also that @samp{unput()} is
2467implemented as a routine call that potentially does quite
2468a bit of work, while @samp{yyless()} is a quite-cheap macro; so
2469if just putting back some excess text you scanned, use
2470@samp{yyless()}.
2471
2472@code{REJECT} should be avoided at all costs when performance is
2473important.  It is a particularly expensive option.
2474
2475Getting rid of backing up is messy and often may be an
2476enormous amount of work for a complicated scanner.  In
2477principal, one begins by using the @samp{-b} flag to generate a
2478@file{lex.backup} file.  For example, on the input
2479
2480@example
2481%%
2482foo        return TOK_KEYWORD;
2483foobar     return TOK_KEYWORD;
2484@end example
2485
2486@noindent
2487the file looks like:
2488
2489@example
2490State #6 is non-accepting -
2491 associated rule line numbers:
2492       2       3
2493 out-transitions: [ o ]
2494 jam-transitions: EOF [ \001-n  p-\177 ]
2495
2496State #8 is non-accepting -
2497 associated rule line numbers:
2498       3
2499 out-transitions: [ a ]
2500 jam-transitions: EOF [ \001-`  b-\177 ]
2501
2502State #9 is non-accepting -
2503 associated rule line numbers:
2504       3
2505 out-transitions: [ r ]
2506 jam-transitions: EOF [ \001-q  s-\177 ]
2507
2508Compressed tables always back up.
2509@end example
2510
2511The first few lines tell us that there's a scanner state
2512in which it can make a transition on an 'o' but not on any
2513other character, and that in that state the currently
2514scanned text does not match any rule.  The state occurs
2515when trying to match the rules found at lines 2 and 3 in
2516the input file.  If the scanner is in that state and then
2517reads something other than an 'o', it will have to back up
2518to find a rule which is matched.  With a bit of
2519head-scratching one can see that this must be the state it's in
2520when it has seen "fo".  When this has happened, if
2521anything other than another 'o' is seen, the scanner will
2522have to back up to simply match the 'f' (by the default
2523rule).
2524
2525The comment regarding State #8 indicates there's a problem
2526when "foob" has been scanned.  Indeed, on any character
2527other than an 'a', the scanner will have to back up to
2528accept "foo".  Similarly, the comment for State #9
2529concerns when "fooba" has been scanned and an 'r' does not
2530follow.
2531
2532The final comment reminds us that there's no point going
2533to all the trouble of removing backing up from the rules
2534unless we're using @samp{-Cf} or @samp{-CF}, since there's no
2535performance gain doing so with compressed scanners.
2536
2537The way to remove the backing up is to add "error" rules:
2538
2539@example
2540%%
2541foo         return TOK_KEYWORD;
2542foobar      return TOK_KEYWORD;
2543
2544fooba       |
2545foob        |
2546fo          @{
2547            /* false alarm, not really a keyword */
2548            return TOK_ID;
2549            @}
2550@end example
2551
2552Eliminating backing up among a list of keywords can also
2553be done using a "catch-all" rule:
2554
2555@example
2556%%
2557foo         return TOK_KEYWORD;
2558foobar      return TOK_KEYWORD;
2559
2560[a-z]+      return TOK_ID;
2561@end example
2562
2563This is usually the best solution when appropriate.
2564
2565Backing up messages tend to cascade.  With a complicated
2566set of rules it's not uncommon to get hundreds of
2567messages.  If one can decipher them, though, it often only
2568takes a dozen or so rules to eliminate the backing up
2569(though it's easy to make a mistake and have an error rule
2570accidentally match a valid token.  A possible future @code{flex}
2571feature will be to automatically add rules to eliminate
2572backing up).
2573
2574It's important to keep in mind that you gain the benefits
2575of eliminating backing up only if you eliminate @emph{every}
2576instance of backing up.  Leaving just one means you gain
2577nothing.
2578
2579@var{Variable} trailing context (where both the leading and
2580trailing parts do not have a fixed length) entails almost
2581the same performance loss as @code{REJECT} (i.e., substantial).
2582So when possible a rule like:
2583
2584@example
2585%%
2586mouse|rat/(cat|dog)   run();
2587@end example
2588
2589@noindent
2590is better written:
2591
2592@example
2593%%
2594mouse/cat|dog         run();
2595rat/cat|dog           run();
2596@end example
2597
2598@noindent
2599or as
2600
2601@example
2602%%
2603mouse|rat/cat         run();
2604mouse|rat/dog         run();
2605@end example
2606
2607Note that here the special '|' action does @emph{not} provide any
2608savings, and can even make things worse (see Deficiencies
2609/ Bugs below).
2610
2611Another area where the user can increase a scanner's
2612performance (and one that's easier to implement) arises from
2613the fact that the longer the tokens matched, the faster
2614the scanner will run.  This is because with long tokens
2615the processing of most input characters takes place in the
2616(short) inner scanning loop, and does not often have to go
2617through the additional work of setting up the scanning
2618environment (e.g., @code{yytext}) for the action.  Recall the
2619scanner for C comments:
2620
2621@example
2622%x comment
2623%%
2624        int line_num = 1;
2625
2626"/*"         BEGIN(comment);
2627
2628<comment>[^*\n]*
2629<comment>"*"+[^*/\n]*
2630<comment>\n             ++line_num;
2631<comment>"*"+"/"        BEGIN(INITIAL);
2632@end example
2633
2634This could be sped up by writing it as:
2635
2636@example
2637%x comment
2638%%
2639        int line_num = 1;
2640
2641"/*"         BEGIN(comment);
2642
2643<comment>[^*\n]*
2644<comment>[^*\n]*\n      ++line_num;
2645<comment>"*"+[^*/\n]*
2646<comment>"*"+[^*/\n]*\n ++line_num;
2647<comment>"*"+"/"        BEGIN(INITIAL);
2648@end example
2649
2650Now instead of each newline requiring the processing of
2651another action, recognizing the newlines is "distributed"
2652over the other rules to keep the matched text as long as
2653possible.  Note that @emph{adding} rules does @emph{not} slow down the
2654scanner!  The speed of the scanner is independent of the
2655number of rules or (modulo the considerations given at the
2656beginning of this section) how complicated the rules are
2657with regard to operators such as '*' and '|'.
2658
2659A final example in speeding up a scanner: suppose you want
2660to scan through a file containing identifiers and
2661keywords, one per line and with no other extraneous
2662characters, and recognize all the keywords.  A natural first
2663approach is:
2664
2665@example
2666%%
2667asm      |
2668auto     |
2669break    |
2670@dots{} etc @dots{}
2671volatile |
2672while    /* it's a keyword */
2673
2674.|\n     /* it's not a keyword */
2675@end example
2676
2677To eliminate the back-tracking, introduce a catch-all
2678rule:
2679
2680@example
2681%%
2682asm      |
2683auto     |
2684break    |
2685... etc ...
2686volatile |
2687while    /* it's a keyword */
2688
2689[a-z]+   |
2690.|\n     /* it's not a keyword */
2691@end example
2692
2693Now, if it's guaranteed that there's exactly one word per
2694line, then we can reduce the total number of matches by a
2695half by merging in the recognition of newlines with that
2696of the other tokens:
2697
2698@example
2699%%
2700asm\n    |
2701auto\n   |
2702break\n  |
2703@dots{} etc @dots{}
2704volatile\n |
2705while\n  /* it's a keyword */
2706
2707[a-z]+\n |
2708.|\n     /* it's not a keyword */
2709@end example
2710
2711One has to be careful here, as we have now reintroduced
2712backing up into the scanner.  In particular, while @emph{we} know
2713that there will never be any characters in the input
2714stream other than letters or newlines, @code{flex} can't figure
2715this out, and it will plan for possibly needing to back up
2716when it has scanned a token like "auto" and then the next
2717character is something other than a newline or a letter.
2718Previously it would then just match the "auto" rule and be
2719done, but now it has no "auto" rule, only a "auto\n" rule.
2720To eliminate the possibility of backing up, we could
2721either duplicate all rules but without final newlines, or,
2722since we never expect to encounter such an input and
2723therefore don't how it's classified, we can introduce one
2724more catch-all rule, this one which doesn't include a
2725newline:
2726
2727@example
2728%%
2729asm\n    |
2730auto\n   |
2731break\n  |
2732@dots{} etc @dots{}
2733volatile\n |
2734while\n  /* it's a keyword */
2735
2736[a-z]+\n |
2737[a-z]+   |
2738.|\n     /* it's not a keyword */
2739@end example
2740
2741Compiled with @samp{-Cf}, this is about as fast as one can get a
2742@code{flex} scanner to go for this particular problem.
2743
2744A final note: @code{flex} is slow when matching NUL's,
2745particularly when a token contains multiple NUL's.  It's best to
2746write rules which match @emph{short} amounts of text if it's
2747anticipated that the text will often include NUL's.
2748
2749Another final note regarding performance: as mentioned
2750above in the section How the Input is Matched, dynamically
2751resizing @code{yytext} to accommodate huge tokens is a slow
2752process because it presently requires that the (huge) token
2753be rescanned from the beginning.  Thus if performance is
2754vital, you should attempt to match "large" quantities of
2755text but not "huge" quantities, where the cutoff between
2756the two is at about 8K characters/token.
2757
2758@node C++, Incompatibilities, Performance, Top
2759@section Generating C++ scanners
2760
2761@code{flex} provides two different ways to generate scanners for
2762use with C++.  The first way is to simply compile a
2763scanner generated by @code{flex} using a C++ compiler instead of a C
2764compiler.  You should not encounter any compilations
2765errors (please report any you find to the email address
2766given in the Author section below).  You can then use C++
2767code in your rule actions instead of C code.  Note that
2768the default input source for your scanner remains @code{yyin},
2769and default echoing is still done to @code{yyout}.  Both of these
2770remain @samp{FILE *} variables and not C++ @code{streams}.
2771
2772You can also use @code{flex} to generate a C++ scanner class, using
2773the @samp{-+} option, (or, equivalently, @samp{%option c++}), which
2774is automatically specified if the name of the flex executable ends
2775in a @samp{+}, such as @code{flex++}.  When using this option, flex
2776defaults to generating the scanner to the file @file{lex.yy.cc} instead
2777of @file{lex.yy.c}.  The generated scanner includes the header file
2778@file{FlexLexer.h}, which defines the interface to two C++ classes.
2779
2780The first class, @code{FlexLexer}, provides an abstract base
2781class defining the general scanner class interface.  It
2782provides the following member functions:
2783
2784@table @samp
2785@item const char* YYText()
2786returns the text of the most recently matched
2787token, the equivalent of @code{yytext}.
2788
2789@item int YYLeng()
2790returns the length of the most recently matched
2791token, the equivalent of @code{yyleng}.
2792
2793@item int lineno() const
2794returns the current input line number (see @samp{%option yylineno}),
2795or 1 if @samp{%option yylineno} was not used.
2796
2797@item void set_debug( int flag )
2798sets the debugging flag for the scanner, equivalent to assigning to
2799@code{yy_flex_debug} (see the Options section above).  Note that you
2800must build the scanner using @samp{%option debug} to include debugging
2801information in it.
2802
2803@item int debug() const
2804returns the current setting of the debugging flag.
2805@end table
2806
2807Also provided are member functions equivalent to
2808@samp{yy_switch_to_buffer(), yy_create_buffer()} (though the
2809first argument is an @samp{istream*} object pointer and not a
2810@samp{FILE*}, @samp{yy_flush_buffer()}, @samp{yy_delete_buffer()},
2811and @samp{yyrestart()} (again, the first argument is a @samp{istream*}
2812object pointer).
2813
2814The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer},
2815which is derived from @code{FlexLexer}.  It defines the following
2816additional member functions:
2817
2818@table @samp
2819@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
2820constructs a @code{yyFlexLexer} object using the given
2821streams for input and output.  If not specified,
2822the streams default to @code{cin} and @code{cout}, respectively.
2823
2824@item virtual int yylex()
2825performs the same role is @samp{yylex()} does for ordinary
2826flex scanners: it scans the input stream, consuming
2827tokens, until a rule's action returns a value.  If you derive a subclass
2828@var{S}
2829from @code{yyFlexLexer}
2830and want to access the member functions and variables of
2831@var{S}
2832inside @samp{yylex()},
2833then you need to use @samp{%option yyclass="@var{S}"}
2834to inform @code{flex}
2835that you will be using that subclass instead of @code{yyFlexLexer}.
2836In this case, rather than generating @samp{yyFlexLexer::yylex()},
2837@code{flex} generates @samp{@var{S}::yylex()}
2838(and also generates a dummy @samp{yyFlexLexer::yylex()}
2839that calls @samp{yyFlexLexer::LexerError()}
2840if called).
2841
2842@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)
2843reassigns @code{yyin} to @code{new_in}
2844(if non-nil)
2845and @code{yyout} to @code{new_out}
2846(ditto), deleting the previous input buffer if @code{yyin}
2847is reassigned.
2848
2849@item int yylex( istream* new_in = 0, ostream* new_out = 0 )
2850first switches the input streams via @samp{switch_streams( new_in, new_out )}
2851and then returns the value of @samp{yylex()}.
2852@end table
2853
2854In addition, @code{yyFlexLexer} defines the following protected
2855virtual functions which you can redefine in derived
2856classes to tailor the scanner:
2857
2858@table @samp
2859@item virtual int LexerInput( char* buf, int max_size )
2860reads up to @samp{max_size} characters into @var{buf} and
2861returns the number of characters read.  To indicate
2862end-of-input, return 0 characters.  Note that
2863"interactive" scanners (see the @samp{-B} and @samp{-I} flags)
2864define the macro @code{YY_INTERACTIVE}.  If you redefine
2865@code{LexerInput()} and need to take different actions
2866depending on whether or not the scanner might be
2867scanning an interactive input source, you can test
2868for the presence of this name via @samp{#ifdef}.
2869
2870@item virtual void LexerOutput( const char* buf, int size )
2871writes out @var{size} characters from the buffer @var{buf},
2872which, while NUL-terminated, may also contain
2873"internal" NUL's if the scanner's rules can match
2874text with NUL's in them.
2875
2876@item virtual void LexerError( const char* msg )
2877reports a fatal error message.  The default version
2878of this function writes the message to the stream
2879@code{cerr} and exits.
2880@end table
2881
2882Note that a @code{yyFlexLexer} object contains its @emph{entire}
2883scanning state.  Thus you can use such objects to create
2884reentrant scanners.  You can instantiate multiple instances of
2885the same @code{yyFlexLexer} class, and you can also combine
2886multiple C++ scanner classes together in the same program
2887using the @samp{-P} option discussed above.
2888Finally, note that the @samp{%array} feature is not available to
2889C++ scanner classes; you must use @samp{%pointer} (the default).
2890
2891Here is an example of a simple C++ scanner:
2892
2893@example
2894    // An example of using the flex C++ scanner class.
2895
2896%@{
2897int mylineno = 0;
2898%@}
2899
2900string  \"[^\n"]+\"
2901
2902ws      [ \t]+
2903
2904alpha   [A-Za-z]
2905dig     [0-9]
2906name    (@{alpha@}|@{dig@}|\$)(@{alpha@}|@{dig@}|[_.\-/$])*
2907num1    [-+]?@{dig@}+\.?([eE][-+]?@{dig@}+)?
2908num2    [-+]?@{dig@}*\.@{dig@}+([eE][-+]?@{dig@}+)?
2909number  @{num1@}|@{num2@}
2910
2911%%
2912
2913@{ws@}    /* skip blanks and tabs */
2914
2915"/*"    @{
2916        int c;
2917
2918        while((c = yyinput()) != 0)
2919            @{
2920            if(c == '\n')
2921                ++mylineno;
2922
2923            else if(c == '*')
2924                @{
2925                if((c = yyinput()) == '/')
2926                    break;
2927                else
2928                    unput(c);
2929                @}
2930            @}
2931        @}
2932
2933@{number@}  cout << "number " << YYText() << '\n';
2934
2935\n        mylineno++;
2936
2937@{name@}    cout << "name " << YYText() << '\n';
2938
2939@{string@}  cout << "string " << YYText() << '\n';
2940
2941%%
2942
2943Version 2.5               December 1994                        44
2944
2945int main( int /* argc */, char** /* argv */ )
2946    @{
2947    FlexLexer* lexer = new yyFlexLexer;
2948    while(lexer->yylex() != 0)
2949        ;
2950    return 0;
2951    @}
2952@end example
2953
2954If you want to create multiple (different) lexer classes,
2955you use the @samp{-P} flag (or the @samp{prefix=} option) to rename each
2956@code{yyFlexLexer} to some other @code{xxFlexLexer}.  You then can
2957include @samp{<FlexLexer.h>} in your other sources once per lexer
2958class, first renaming @code{yyFlexLexer} as follows:
2959
2960@example
2961#undef yyFlexLexer
2962#define yyFlexLexer xxFlexLexer
2963#include <FlexLexer.h>
2964
2965#undef yyFlexLexer
2966#define yyFlexLexer zzFlexLexer
2967#include <FlexLexer.h>
2968@end example
2969
2970if, for example, you used @samp{%option prefix="xx"} for one of
2971your scanners and @samp{%option prefix="zz"} for the other.
2972
2973IMPORTANT: the present form of the scanning class is
2974@emph{experimental} and may change considerably between major
2975releases.
2976
2977@node Incompatibilities, Diagnostics, C++, Top
2978@section Incompatibilities with @code{lex} and POSIX
2979
2980@code{flex} is a rewrite of the AT&T Unix @code{lex} tool (the two
2981implementations do not share any code, though), with some
2982extensions and incompatibilities, both of which are of
2983concern to those who wish to write scanners acceptable to
2984either implementation.  Flex is fully compliant with the
2985POSIX @code{lex} specification, except that when using @samp{%pointer}
2986(the default), a call to @samp{unput()} destroys the contents of
2987@code{yytext}, which is counter to the POSIX specification.
2988
2989In this section we discuss all of the known areas of
2990incompatibility between flex, AT&T lex, and the POSIX
2991specification.
2992
2993@code{flex's} @samp{-l} option turns on maximum compatibility with the
2994original AT&T @code{lex} implementation, at the cost of a major
2995loss in the generated scanner's performance.  We note
2996below which incompatibilities can be overcome using the @samp{-l}
2997option.
2998
2999@code{flex} is fully compatible with @code{lex} with the following
3000exceptions:
3001
3002@itemize -
3003@item
3004The undocumented @code{lex} scanner internal variable @code{yylineno}
3005is not supported unless @samp{-l} or @samp{%option yylineno} is used.
3006@code{yylineno} should be maintained on a per-buffer basis, rather
3007than a per-scanner (single global variable) basis.  @code{yylineno} is
3008not part of the POSIX specification.
3009
3010@item
3011The @samp{input()} routine is not redefinable, though it
3012may be called to read characters following whatever
3013has been matched by a rule.  If @samp{input()} encounters
3014an end-of-file the normal @samp{yywrap()} processing is
3015done.  A ``real'' end-of-file is returned by
3016@samp{input()} as @code{EOF}.
3017
3018Input is instead controlled by defining the
3019@code{YY_INPUT} macro.
3020
3021The @code{flex} restriction that @samp{input()} cannot be
3022redefined is in accordance with the POSIX
3023specification, which simply does not specify any way of
3024controlling the scanner's input other than by making
3025an initial assignment to @code{yyin}.
3026
3027@item
3028The @samp{unput()} routine is not redefinable.  This
3029restriction is in accordance with POSIX.
3030
3031@item
3032@code{flex} scanners are not as reentrant as @code{lex} scanners.
3033In particular, if you have an interactive scanner
3034and an interrupt handler which long-jumps out of
3035the scanner, and the scanner is subsequently called
3036again, you may get the following message:
3037
3038@example
3039fatal flex scanner internal error--end of buffer missed
3040@end example
3041
3042To reenter the scanner, first use
3043
3044@example
3045yyrestart( yyin );
3046@end example
3047
3048Note that this call will throw away any buffered
3049input; usually this isn't a problem with an
3050interactive scanner.
3051
3052Also note that flex C++ scanner classes @emph{are}
3053reentrant, so if using C++ is an option for you, you
3054should use them instead.  See "Generating C++
3055Scanners" above for details.
3056
3057@item
3058@samp{output()} is not supported.  Output from the @samp{ECHO}
3059macro is done to the file-pointer @code{yyout} (default
3060@code{stdout}).
3061
3062@samp{output()} is not part of the POSIX specification.
3063
3064@item
3065@code{lex} does not support exclusive start conditions
3066(%x), though they are in the POSIX specification.
3067
3068@item
3069When definitions are expanded, @code{flex} encloses them
3070in parentheses.  With lex, the following:
3071
3072@example
3073NAME    [A-Z][A-Z0-9]*
3074%%
3075foo@{NAME@}?      printf( "Found it\n" );
3076%%
3077@end example
3078
3079will not match the string "foo" because when the
3080macro is expanded the rule is equivalent to
3081"foo[A-Z][A-Z0-9]*?" and the precedence is such that the
3082'?' is associated with "[A-Z0-9]*".  With @code{flex}, the
3083rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and
3084so the string "foo" will match.
3085
3086Note that if the definition begins with @samp{^} or ends
3087with @samp{$} then it is @emph{not} expanded with parentheses, to
3088allow these operators to appear in definitions
3089without losing their special meanings.  But the
3090@samp{<s>, /}, and @samp{<<EOF>>} operators cannot be used in a
3091@code{flex} definition.
3092
3093Using @samp{-l} results in the @code{lex} behavior of no
3094parentheses around the definition.
3095
3096The POSIX specification is that the definition be enclosed in
3097parentheses.
3098
3099@item
3100Some implementations of @code{lex} allow a rule's action to begin on
3101a separate line, if the rule's pattern has trailing whitespace:
3102
3103@example
3104%%
3105foo|bar<space here>
3106  @{ foobar_action(); @}
3107@end example
3108
3109@code{flex} does not support this feature.
3110
3111@item
3112The @code{lex} @samp{%r} (generate a Ratfor scanner) option is
3113not supported.  It is not part of the POSIX
3114specification.
3115
3116@item
3117After a call to @samp{unput()}, @code{yytext} is undefined until
3118the next token is matched, unless the scanner was
3119built using @samp{%array}.  This is not the case with @code{lex}
3120or the POSIX specification.  The @samp{-l} option does
3121away with this incompatibility.
3122
3123@item
3124The precedence of the @samp{@{@}} (numeric range) operator
3125is different.  @code{lex} interprets "abc@{1,3@}" as "match
3126one, two, or three occurrences of 'abc'", whereas
3127@code{flex} interprets it as "match 'ab' followed by one,
3128two, or three occurrences of 'c'".  The latter is
3129in agreement with the POSIX specification.
3130
3131@item
3132The precedence of the @samp{^} operator is different.  @code{lex}
3133interprets "^foo|bar" as "match either 'foo' at the
3134beginning of a line, or 'bar' anywhere", whereas
3135@code{flex} interprets it as "match either 'foo' or 'bar'
3136if they come at the beginning of a line".  The
3137latter is in agreement with the POSIX specification.
3138
3139@item
3140The special table-size declarations such as @samp{%a}
3141supported by @code{lex} are not required by @code{flex} scanners;
3142@code{flex} ignores them.
3143
3144@item
3145The name FLEX_SCANNER is #define'd so scanners may
3146be written for use with either @code{flex} or @code{lex}.
3147Scanners also include @code{YY_FLEX_MAJOR_VERSION} and
3148@code{YY_FLEX_MINOR_VERSION} indicating which version of
3149@code{flex} generated the scanner (for example, for the
31502.5 release, these defines would be 2 and 5
3151respectively).
3152@end itemize
3153
3154The following @code{flex} features are not included in @code{lex} or the
3155POSIX specification:
3156
3157@example
3158C++ scanners
3159%option
3160start condition scopes
3161start condition stacks
3162interactive/non-interactive scanners
3163yy_scan_string() and friends
3164yyterminate()
3165yy_set_interactive()
3166yy_set_bol()
3167YY_AT_BOL()
3168<<EOF>>
3169<*>
3170YY_DECL
3171YY_START
3172YY_USER_ACTION
3173YY_USER_INIT
3174#line directives
3175%@{@}'s around actions
3176multiple actions on a line
3177@end example
3178
3179@noindent
3180plus almost all of the flex flags.  The last feature in
3181the list refers to the fact that with @code{flex} you can put
3182multiple actions on the same line, separated with
3183semicolons, while with @code{lex}, the following
3184
3185@example
3186foo    handle_foo(); ++num_foos_seen;
3187@end example
3188
3189@noindent
3190is (rather surprisingly) truncated to
3191
3192@example
3193foo    handle_foo();
3194@end example
3195
3196@code{flex} does not truncate the action.  Actions that are not
3197enclosed in braces are simply terminated at the end of the
3198line.
3199
3200@node Diagnostics, Files, Incompatibilities, Top
3201@section Diagnostics
3202
3203@table @samp
3204@item warning, rule cannot be matched
3205indicates that the given
3206rule cannot be matched because it follows other rules that
3207will always match the same text as it.  For example, in
3208the following "foo" cannot be matched because it comes
3209after an identifier "catch-all" rule:
3210
3211@example
3212[a-z]+    got_identifier();
3213foo       got_foo();
3214@end example
3215
3216Using @code{REJECT} in a scanner suppresses this warning.
3217
3218@item warning, -s option given but default rule can be matched
3219means that it is possible (perhaps only in a particular
3220start condition) that the default rule (match any single
3221character) is the only one that will match a particular
3222input.  Since @samp{-s} was given, presumably this is not
3223intended.
3224
3225@item reject_used_but_not_detected undefined
3226@itemx yymore_used_but_not_detected undefined
3227These errors can
3228occur at compile time.  They indicate that the scanner
3229uses @code{REJECT} or @samp{yymore()} but that @code{flex} failed to notice the
3230fact, meaning that @code{flex} scanned the first two sections
3231looking for occurrences of these actions and failed to
3232find any, but somehow you snuck some in (via a #include
3233file, for example).  Use @samp{%option reject} or @samp{%option yymore}
3234to indicate to flex that you really do use these features.
3235
3236@item flex scanner jammed
3237a scanner compiled with @samp{-s} has
3238encountered an input string which wasn't matched by any of
3239its rules.  This error can also occur due to internal
3240problems.
3241
3242@item token too large, exceeds YYLMAX
3243your scanner uses @samp{%array}
3244and one of its rules matched a string longer than the @samp{YYL-}
3245@code{MAX} constant (8K bytes by default).  You can increase the
3246value by #define'ing @code{YYLMAX} in the definitions section of
3247your @code{flex} input.
3248
3249@item scanner requires -8 flag to use the character '@var{x}'
3250Your
3251scanner specification includes recognizing the 8-bit
3252character @var{x} and you did not specify the -8 flag, and your
3253scanner defaulted to 7-bit because you used the @samp{-Cf} or @samp{-CF}
3254table compression options.  See the discussion of the @samp{-7}
3255flag for details.
3256
3257@item flex scanner push-back overflow
3258you used @samp{unput()} to push
3259back so much text that the scanner's buffer could not hold
3260both the pushed-back text and the current token in @code{yytext}.
3261Ideally the scanner should dynamically resize the buffer
3262in this case, but at present it does not.
3263
3264@item input buffer overflow, can't enlarge buffer because scanner uses REJECT
3265the scanner was working on matching an
3266extremely large token and needed to expand the input
3267buffer.  This doesn't work with scanners that use @code{REJECT}.
3268
3269@item fatal flex scanner internal error--end of buffer missed
3270This can occur in an scanner which is reentered after a
3271long-jump has jumped out (or over) the scanner's
3272activation frame.  Before reentering the scanner, use:
3273
3274@example
3275yyrestart( yyin );
3276@end example
3277
3278@noindent
3279or, as noted above, switch to using the C++ scanner class.
3280
3281@item too many start conditions in <> construct!
3282you listed
3283more start conditions in a <> construct than exist (so you
3284must have listed at least one of them twice).
3285@end table
3286
3287@node Files, Deficiencies, Diagnostics, Top
3288@section Files
3289
3290@table @file
3291@item -lfl
3292library with which scanners must be linked.
3293
3294@item lex.yy.c
3295generated scanner (called @file{lexyy.c} on some systems).
3296
3297@item lex.yy.cc
3298generated C++ scanner class, when using @samp{-+}.
3299
3300@item <FlexLexer.h>
3301header file defining the C++ scanner base class,
3302@code{FlexLexer}, and its derived class, @code{yyFlexLexer}.
3303
3304@item flex.skl
3305skeleton scanner.  This file is only used when
3306building flex, not when flex executes.
3307
3308@item lex.backup
3309backing-up information for @samp{-b} flag (called @file{lex.bck}
3310on some systems).
3311@end table
3312
3313@node Deficiencies, See also, Files, Top
3314@section Deficiencies / Bugs
3315
3316Some trailing context patterns cannot be properly matched
3317and generate warning messages ("dangerous trailing
3318context").  These are patterns where the ending of the first
3319part of the rule matches the beginning of the second part,
3320such as "zx*/xy*", where the 'x*' matches the 'x' at the
3321beginning of the trailing context.  (Note that the POSIX
3322draft states that the text matched by such patterns is
3323undefined.)
3324
3325For some trailing context rules, parts which are actually
3326fixed-length are not recognized as such, leading to the
3327abovementioned performance loss.  In particular, parts
3328using '|' or @{n@} (such as "foo@{3@}") are always considered
3329variable-length.
3330
3331Combining trailing context with the special '|' action can
3332result in @emph{fixed} trailing context being turned into the
3333more expensive @var{variable} trailing context.  For example, in
3334the following:
3335
3336@example
3337%%
3338abc      |
3339xyz/def
3340@end example
3341
3342Use of @samp{unput()} invalidates yytext and yyleng, unless the
3343@samp{%array} directive or the @samp{-l} option has been used.
3344
3345Pattern-matching of NUL's is substantially slower than
3346matching other characters.
3347
3348Dynamic resizing of the input buffer is slow, as it
3349entails rescanning all the text matched so far by the
3350current (generally huge) token.
3351
3352Due to both buffering of input and read-ahead, you cannot
3353intermix calls to <stdio.h> routines, such as, for
3354example, @samp{getchar()}, with @code{flex} rules and expect it to work.
3355Call @samp{input()} instead.
3356
3357The total table entries listed by the @samp{-v} flag excludes the
3358number of table entries needed to determine what rule has
3359been matched.  The number of entries is equal to the
3360number of DFA states if the scanner does not use @code{REJECT}, and
3361somewhat greater than the number of states if it does.
3362
3363@code{REJECT} cannot be used with the @samp{-f} or @samp{-F} options.
3364
3365The @code{flex} internal algorithms need documentation.
3366
3367@node See also, Author, Deficiencies, Top
3368@section See also
3369
3370@code{lex}(1), @code{yacc}(1), @code{sed}(1), @code{awk}(1).
3371
3372John Levine, Tony Mason, and Doug Brown: Lex & Yacc;
3373O'Reilly and Associates.  Be sure to get the 2nd edition.
3374
3375M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator.
3376
3377Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers:
3378Principles, Techniques and Tools; Addison-Wesley (1986).
3379Describes the pattern-matching techniques used by @code{flex}
3380(deterministic finite automata).
3381
3382@node Author,  , See also, Top
3383@section Author
3384
3385Vern Paxson, with the help of many ideas and much inspiration from
3386Van Jacobson.  Original version by Jef Poskanzer.  The fast table
3387representation is a partial implementation of a design done by Van
3388Jacobson.  The implementation was done by Kevin Gong and Vern Paxson.
3389
3390Thanks to the many @code{flex} beta-testers, feedbackers, and
3391contributors, especially Francois Pinard, Casey Leedom, Stan
3392Adermann, Terry Allen, David Barker-Plummer, John Basrai, Nelson
3393H.F. Beebe, @samp{benson@@odi.com}, Karl Berry, Peter A. Bigot,
3394Simon Blanchard, Keith Bostic, Frederic Brehm, Ian Brockbank, Kin
3395Cho, Nick Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin,
3396Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels, Chris
3397G. Demetriou, Theo Deraadt, Mike Donahue, Chuck Doucette, Tom Epperly,
3398Leo Eskin, Chris Faylor, Chris Flatters, Jon Forrest, Joe Gayda, Kaveh
3399R. Ghazi, Eric Goldman, Christopher M.  Gould, Ulrich Grepel, Peer
3400Griebel, Jan Hajic, Charles Hemphill, NORO Hideo, Jarkko Hietaniemi,
3401Scott Hofmann, Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
3402Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
3403Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
3404Amir Katz, @samp{ken@@ken.hilco.com}, Kevin B. Kenny, Steve Kirsch,
3405Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee, Rohan Lenard,
3406Craig Leres, John Levine, Steve Liddle, Mike Long, Mohamed el Lozy,
3407Brian Madsen, Malte, Joe Marshall, Bengt Martensson, Chris Metcalf,
3408Luke Mewburn, Jim Meyering, R.  Alexander Milowski, Erik Naggum,
3409G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, Richard Ohnemus,
3410Karsten Pahnke, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
3411Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, Frederic
3412Raimbault, Pat Rankin, Rick Richardson, Kevin Rodgers, Kai Uwe Rommel,
3413Jim Roskind, Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf
3414Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex
3415Siegel, Eckehard Stolz, Jan-Erik Strvmquist, Mike Stump, Paul Stuart,
3416Dave Tallman, Ian Lance Taylor, Chris Thewalt, Richard M. Timoney,
3417Jodi Tsai, Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms,
3418Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn, and
3419those whose names have slipped my marginal mail-archiving skills but
3420whose contributions are appreciated all the same.
3421
3422Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore,
3423Craig Leres, John Levine, Bob Mulcahy, G.T.  Nicol, Francois Pinard,
3424Rich Salz, and Richard Stallman for help with various distribution
3425headaches.
3426
3427Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
3428to Benson Margulies and Fred Burke for C++ support; to Kent Williams
3429and Tom Epperly for C++ class support; to Ove Ewerlid for support of
3430NUL's; and to Eric Hughes for support of multiple buffers.
3431
3432This work was primarily done when I was with the Real Time Systems
3433Group at the Lawrence Berkeley Laboratory in Berkeley, CA.  Many thanks
3434to all there for the support I received.
3435
3436Send comments to @samp{vern@@ee.lbl.gov}.
3437
3438@c @node Index,  , Top, Top
3439@c @unnumbered Index
3440@c
3441@c @printindex cp
3442
3443@contents
3444@bye
3445
3446@c Local variables:
3447@c texinfo-column-for-description: 32
3448@c End:
3449