Deleted Added
full compact
lex.1 (2309) lex.1 (16519)
1.TH FLEX 1 "November 1993" "Version 2.4"
1.TH FLEX 1 "April 1995" "Version 2.5"
2.SH NAME
3flex \- fast lexical analyzer generator
4.SH SYNOPSIS
5.B flex
2.SH NAME
3flex \- fast lexical analyzer generator
4.SH SYNOPSIS
5.B flex
6.B [\-bcdfhilnpstvwBFILTV78+ \-C[aefFmr] \-Pprefix \-Sskeleton]
6.B [\-bcdfhilnpstvwBFILTV78+? \-C[aefFmr] \-ooutput \-Pprefix \-Sskeleton]
7.B [\-\-help \-\-version]
7.I [filename ...]
8.I [filename ...]
9.SH OVERVIEW
10This manual describes
11.I flex,
12a tool for generating programs that perform pattern-matching on text. The
13manual includes both tutorial and reference sections:
14.nf
15
16 Description
17 a brief overview of the tool
18
19 Some Simple Examples
20
21 Format Of The Input File
22
23 Patterns
24 the extended regular expressions used by flex
25
26 How The Input Is Matched
27 the rules for determining what has been matched
28
29 Actions
30 how to specify what to do when a pattern is matched
31
32 The Generated Scanner
33 details regarding the scanner that flex produces;
34 how to control the input source
35
36 Start Conditions
37 introducing context into your scanners, and
38 managing "mini-scanners"
39
40 Multiple Input Buffers
41 how to manipulate multiple input sources; how to
42 scan from strings instead of files
43
44 End-of-file Rules
45 special rules for matching the end of the input
46
47 Miscellaneous Macros
48 a summary of macros available to the actions
49
50 Values Available To The User
51 a summary of values available to the actions
52
53 Interfacing With Yacc
54 connecting flex scanners together with yacc parsers
55
56 Options
57 flex command-line options, and the "%option"
58 directive
59
60 Performance Considerations
61 how to make your scanner go as fast as possible
62
63 Generating C++ Scanners
64 the (experimental) facility for generating C++
65 scanner classes
66
67 Incompatibilities With Lex And POSIX
68 how flex differs from AT&T lex and the POSIX lex
69 standard
70
71 Diagnostics
72 those error messages produced by flex (or scanners
73 it generates) whose meanings might not be apparent
74
75 Files
76 files used by flex
77
78 Deficiencies / Bugs
79 known problems with flex
80
81 See Also
82 other documentation, related tools
83
84 Author
85 includes contact information
86
87.fi
8.SH DESCRIPTION
9.I flex
10is a tool for generating
11.I scanners:
12programs which recognized lexical patterns in text.
13.I flex
14reads
15the given input files, or its standard input if no file names are given,

--- 6 unchanged lines hidden (view full) ---

22which defines a routine
23.B yylex().
24This file is compiled and linked with the
25.B \-ll
26library to produce an executable. When the executable is run,
27it analyzes its input for occurrences
28of the regular expressions. Whenever it finds one, it executes
29the corresponding C code.
88.SH DESCRIPTION
89.I flex
90is a tool for generating
91.I scanners:
92programs which recognized lexical patterns in text.
93.I flex
94reads
95the given input files, or its standard input if no file names are given,

--- 6 unchanged lines hidden (view full) ---

102which defines a routine
103.B yylex().
104This file is compiled and linked with the
105.B \-ll
106library to produce an executable. When the executable is run,
107it analyzes its input for occurrences
108of the regular expressions. Whenever it finds one, it executes
109the corresponding C code.
110.SH SOME SIMPLE EXAMPLES
30.PP
111.PP
31For full documentation, see
32.B lexdoc(1).
33This manual entry is intended for use as a quick reference.
112First some simple examples to get the flavor of how one uses
113.I flex.
114The following
115.I flex
116input specifies a scanner which whenever it encounters the string
117"username" will replace it with the user's login name:
118.nf
119
120 %%
121 username printf( "%s", getlogin() );
122
123.fi
124By default, any text not matched by a
125.I flex
126scanner
127is copied to the output, so the net effect of this scanner is
128to copy its input file to its output with each occurrence
129of "username" expanded.
130In this input, there is just one rule. "username" is the
131.I pattern
132and the "printf" is the
133.I action.
134The "%%" marks the beginning of the rules.
135.PP
136Here's another simple example:
137.nf
138
139 int num_lines = 0, num_chars = 0;
140
141 %%
142 \\n ++num_lines; ++num_chars;
143 . ++num_chars;
144
145 %%
146 main()
147 {
148 yylex();
149 printf( "# of lines = %d, # of chars = %d\\n",
150 num_lines, num_chars );
151 }
152
153.fi
154This scanner counts the number of characters and the number
155of lines in its input (it produces no output other than the
156final report on the counts). The first line
157declares two globals, "num_lines" and "num_chars", which are accessible
158both inside
159.B yylex()
160and in the
161.B main()
162routine declared after the second "%%". There are two rules, one
163which matches a newline ("\\n") and increments both the line count and
164the character count, and one which matches any character other than
165a newline (indicated by the "." regular expression).
166.PP
167A somewhat more complicated example:
168.nf
169
170 /* scanner for a toy Pascal-like language */
171
172 %{
173 /* need this for the call to atof() below */
174 #include <math.h>
175 %}
176
177 DIGIT [0-9]
178 ID [a-z][a-z0-9]*
179
180 %%
181
182 {DIGIT}+ {
183 printf( "An integer: %s (%d)\\n", yytext,
184 atoi( yytext ) );
185 }
186
187 {DIGIT}+"."{DIGIT}* {
188 printf( "A float: %s (%g)\\n", yytext,
189 atof( yytext ) );
190 }
191
192 if|then|begin|end|procedure|function {
193 printf( "A keyword: %s\\n", yytext );
194 }
195
196 {ID} printf( "An identifier: %s\\n", yytext );
197
198 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
199
200 "{"[^}\\n]*"}" /* eat up one-line comments */
201
202 [ \\t\\n]+ /* eat up whitespace */
203
204 . printf( "Unrecognized character: %s\\n", yytext );
205
206 %%
207
208 main( argc, argv )
209 int argc;
210 char **argv;
211 {
212 ++argv, --argc; /* skip over program name */
213 if ( argc > 0 )
214 yyin = fopen( argv[0], "r" );
215 else
216 yyin = stdin;
217
218 yylex();
219 }
220
221.fi
222This is the beginnings of a simple scanner for a language like
223Pascal. It identifies different types of
224.I tokens
225and reports on what it has seen.
226.PP
227The details of this example will be explained in the following
228sections.
229.SH FORMAT OF THE INPUT FILE
230The
231.I flex
232input file consists of three sections, separated by a line with just
233.B %%
234in it:
235.nf
236
237 definitions
238 %%
239 rules
240 %%
241 user code
242
243.fi
244The
245.I definitions
246section contains declarations of simple
247.I name
248definitions to simplify the scanner specification, and declarations of
249.I start conditions,
250which are explained in a later section.
251.PP
252Name definitions have the form:
253.nf
254
255 name definition
256
257.fi
258The "name" is a word beginning with a letter or an underscore ('_')
259followed by zero or more letters, digits, '_', or '-' (dash).
260The definition is taken to begin at the first non-white-space character
261following the name and continuing to the end of the line.
262The definition can subsequently be referred to using "{name}", which
263will expand to "(definition)". For example,
264.nf
265
266 DIGIT [0-9]
267 ID [a-z][a-z0-9]*
268
269.fi
270defines "DIGIT" to be a regular expression which matches a
271single digit, and
272"ID" to be a regular expression which matches a letter
273followed by zero-or-more letters-or-digits.
274A subsequent reference to
275.nf
276
277 {DIGIT}+"."{DIGIT}*
278
279.fi
280is identical to
281.nf
282
283 ([0-9])+"."([0-9])*
284
285.fi
286and matches one-or-more digits followed by a '.' followed
287by zero-or-more digits.
288.PP
289The
290.I rules
291section of the
292.I flex
293input contains a series of rules of the form:
294.nf
295
296 pattern action
297
298.fi
299where the pattern must be unindented and the action must begin
300on the same line.
301.PP
302See below for a further description of patterns and actions.
303.PP
304Finally, the user code section is simply copied to
305.B lex.yy.c
306verbatim.
307It is used for companion routines which call or are called
308by the scanner. The presence of this section is optional;
309if it is missing, the second
310.B %%
311in the input file may be skipped, too.
312.PP
313In the definitions and rules sections, any
314.I indented
315text or text enclosed in
316.B %{
317and
318.B %}
319is copied verbatim to the output (with the %{}'s removed).
320The %{}'s must appear unindented on lines by themselves.
321.PP
322In the rules section,
323any indented or %{} text appearing before the
324first rule may be used to declare variables
325which are local to the scanning routine and (after the declarations)
326code which is to be executed whenever the scanning routine is entered.
327Other indented or %{} text in the rule section is still copied to the output,
328but its meaning is not well-defined and it may well cause compile-time
329errors (this feature is present for
330.I POSIX
331compliance; see below for other such features).
332.PP
333In the definitions section (but not in the rules section),
334an unindented comment (i.e., a line
335beginning with "/*") is also copied verbatim to the output up
336to the next "*/".
337.SH PATTERNS
338The patterns in the input are written using an extended set of regular
339expressions. These are:
340.nf
341
342 x match the character 'x'
343 . any character (byte) except newline
344 [xyz] a "character class"; in this case, the pattern
345 matches either an 'x', a 'y', or a 'z'
346 [abj-oZ] a "character class" with a range in it; matches
347 an 'a', a 'b', any letter from 'j' through 'o',
348 or a 'Z'
349 [^A-Z] a "negated character class", i.e., any character
350 but those in the class. In this case, any
351 character EXCEPT an uppercase letter.
352 [^A-Z\\n] any character EXCEPT an uppercase letter or
353 a newline
354 r* zero or more r's, where r is any regular expression
355 r+ one or more r's
356 r? zero or one r's (that is, "an optional r")
357 r{2,5} anywhere from two to five r's
358 r{2,} two or more r's
359 r{4} exactly 4 r's
360 {name} the expansion of the "name" definition
361 (see above)
362 "[xyz]\\"foo"
363 the literal string: [xyz]"foo
364 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
365 then the ANSI-C interpretation of \\x.
366 Otherwise, a literal 'X' (used to escape
367 operators such as '*')
368 \\0 a NUL character (ASCII code 0)
369 \\123 the character with octal value 123
370 \\x2a the character with hexadecimal value 2a
371 (r) match an r; parentheses are used to override
372 precedence (see below)
373
374
375 rs the regular expression r followed by the
376 regular expression s; called "concatenation"
377
378
379 r|s either an r or an s
380
381
382 r/s an r but only if it is followed by an s. The
383 text matched by s is included when determining
384 whether this rule is the "longest match",
385 but is then returned to the input before
386 the action is executed. So the action only
387 sees the text matched by r. This type
388 of pattern is called trailing context".
389 (There are some combinations of r/s that flex
390 cannot match correctly; see notes in the
391 Deficiencies / Bugs section below regarding
392 "dangerous trailing context".)
393 ^r an r, but only at the beginning of a line (i.e.,
394 which just starting to scan, or right after a
395 newline has been scanned).
396 r$ an r, but only at the end of a line (i.e., just
397 before a newline). Equivalent to "r/\\n".
398
399 Note that flex's notion of "newline" is exactly
400 whatever the C compiler used to compile flex
401 interprets '\\n' as; in particular, on some DOS
402 systems you must either filter out \\r's in the
403 input yourself, or explicitly use r/\\r\\n for "r$".
404
405
406 <s>r an r, but only in start condition s (see
407 below for discussion of start conditions)
408 <s1,s2,s3>r
409 same, but in any of start conditions s1,
410 s2, or s3
411 <*>r an r in any start condition, even an exclusive one.
412
413
414 <<EOF>> an end-of-file
415 <s1,s2><<EOF>>
416 an end-of-file when in start condition s1 or s2
417
418.fi
419Note that inside of a character class, all regular expression operators
420lose their special meaning except escape ('\\') and the character class
421operators, '-', ']', and, at the beginning of the class, '^'.
422.PP
423The regular expressions listed above are grouped according to
424precedence, from highest precedence at the top to lowest at the bottom.
425Those grouped together have equal precedence. For example,
426.nf
427
428 foo|bar*
429
430.fi
431is the same as
432.nf
433
434 (foo)|(ba(r*))
435
436.fi
437since the '*' operator has higher precedence than concatenation,
438and concatenation higher than alternation ('|'). This pattern
439therefore matches
440.I either
441the string "foo"
442.I or
443the string "ba" followed by zero-or-more r's.
444To match "foo" or zero-or-more "bar"'s, use:
445.nf
446
447 foo|(bar)*
448
449.fi
450and to match zero-or-more "foo"'s-or-"bar"'s:
451.nf
452
453 (foo|bar)*
454
455.fi
456.PP
457In addition to characters and ranges of characters, character classes
458can also contain character class
459.I expressions.
460These are expressions enclosed inside
461.B [:
462and
463.B :]
464delimiters (which themselves must appear between the '[' and ']' of the
465character class; other elements may occur inside the character class, too).
466The valid expressions are:
467.nf
468
469 [:alnum:] [:alpha:] [:blank:]
470 [:cntrl:] [:digit:] [:graph:]
471 [:lower:] [:print:] [:punct:]
472 [:space:] [:upper:] [:xdigit:]
473
474.fi
475These expressions all designate a set of characters equivalent to
476the corresponding standard C
477.B isXXX
478function. For example,
479.B [:alnum:]
480designates those characters for which
481.B isalnum()
482returns true - i.e., any alphabetic or numeric.
483Some systems don't provide
484.B isblank(),
485so flex defines
486.B [:blank:]
487as a blank or a tab.
488.PP
489For example, the following character classes are all equivalent:
490.nf
491
492 [[:alnum:]]
493 [[:alpha:][:digit:]
494 [[:alpha:]0-9]
495 [a-zA-Z0-9]
496
497.fi
498If your scanner is case-insensitive (the
499.B \-i
500flag), then
501.B [:upper:]
502and
503.B [:lower:]
504are equivalent to
505.B [:alpha:].
506.PP
507Some notes on patterns:
508.IP -
509A negated character class such as the example "[^A-Z]"
510above
511.I will match a newline
512unless "\\n" (or an equivalent escape sequence) is one of the
513characters explicitly present in the negated character class
514(e.g., "[^A-Z\\n]"). This is unlike how many other regular
515expression tools treat negated character classes, but unfortunately
516the inconsistency is historically entrenched.
517Matching newlines means that a pattern like [^"]* can match the entire
518input unless there's another quote in the input.
519.IP -
520A rule can have at most one instance of trailing context (the '/' operator
521or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
522can only occur at the beginning of a pattern, and, as well as with '/' and '$',
523cannot be grouped inside parentheses. A '^' which does not occur at
524the beginning of a rule or a '$' which does not occur at the end of
525a rule loses its special properties and is treated as a normal character.
526.IP
527The following are illegal:
528.nf
529
530 foo/bar$
531 <sc1>foo<sc2>bar
532
533.fi
534Note that the first of these, can be written "foo/bar\\n".
535.IP
536The following will result in '$' or '^' being treated as a normal character:
537.nf
538
539 foo|(bar$)
540 foo|^bar
541
542.fi
543If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
544could be used (the special '|' action is explained below):
545.nf
546
547 foo |
548 bar$ /* action goes here */
549
550.fi
551A similar trick will work for matching a foo or a
552bar-at-the-beginning-of-a-line.
553.SH HOW THE INPUT IS MATCHED
554When the generated scanner is run, it analyzes its input looking
555for strings which match any of its patterns. If it finds more than
556one match, it takes the one matching the most text (for trailing
557context rules, this includes the length of the trailing part, even
558though it will then be returned to the input). If it finds two
559or more matches of the same length, the
560rule listed first in the
561.I flex
562input file is chosen.
563.PP
564Once the match is determined, the text corresponding to the match
565(called the
566.I token)
567is made available in the global character pointer
568.B yytext,
569and its length in the global integer
570.B yyleng.
571The
572.I action
573corresponding to the matched pattern is then executed (a more
574detailed description of actions follows), and then the remaining
575input is scanned for another match.
576.PP
577If no match is found, then the
578.I default rule
579is executed: the next character in the input is considered matched and
580copied to the standard output. Thus, the simplest legal
581.I flex
582input is:
583.nf
584
585 %%
586
587.fi
588which generates a scanner that simply copies its input (one character
589at a time) to its output.
590.PP
591Note that
592.B yytext
593can be defined in two different ways: either as a character
594.I pointer
595or as a character
596.I array.
597You can control which definition
598.I flex
599uses by including one of the special directives
600.B %pointer
601or
602.B %array
603in the first (definitions) section of your flex input. The default is
604.B %pointer,
605unless you use the
606.B -l
607lex compatibility option, in which case
608.B yytext
609will be an array.
610The advantage of using
611.B %pointer
612is substantially faster scanning and no buffer overflow when matching
613very large tokens (unless you run out of dynamic memory). The disadvantage
614is that you are restricted in how your actions can modify
615.B yytext
616(see the next section), and calls to the
617.B unput()
618function destroys the present contents of
619.B yytext,
620which can be a considerable porting headache when moving between different
621.I lex
622versions.
623.PP
624The advantage of
625.B %array
626is that you can then modify
627.B yytext
628to your heart's content, and calls to
629.B unput()
630do not destroy
631.B yytext
632(see below). Furthermore, existing
633.I lex
634programs sometimes access
635.B yytext
636externally using declarations of the form:
637.nf
638 extern char yytext[];
639.fi
640This definition is erroneous when used with
641.B %pointer,
642but correct for
643.B %array.
644.PP
645.B %array
646defines
647.B yytext
648to be an array of
649.B YYLMAX
650characters, which defaults to a fairly large value. You can change
651the size by simply #define'ing
652.B YYLMAX
653to a different value in the first section of your
654.I flex
655input. As mentioned above, with
656.B %pointer
657yytext grows dynamically to accommodate large tokens. While this means your
658.B %pointer
659scanner can accommodate very large tokens (such as matching entire blocks
660of comments), bear in mind that each time the scanner must resize
661.B yytext
662it also must rescan the entire token from the beginning, so matching such
663tokens can prove slow.
664.B yytext
665presently does
666.I not
667dynamically grow if a call to
668.B unput()
669results in too much text being pushed back; instead, a run-time error results.
670.PP
671Also note that you cannot use
672.B %array
673with C++ scanner classes
674(the
675.B c++
676option; see below).
677.SH ACTIONS
678Each pattern in a rule has a corresponding action, which can be any
679arbitrary C statement. The pattern ends at the first non-escaped
680whitespace character; the remainder of the line is its action. If the
681action is empty, then when the pattern is matched the input token
682is simply discarded. For example, here is the specification for a program
683which deletes all occurrences of "zap me" from its input:
684.nf
685
686 %%
687 "zap me"
688
689.fi
690(It will copy all other characters in the input to the output since
691they will be matched by the default rule.)
692.PP
693Here is a program which compresses multiple blanks and tabs down to
694a single blank, and throws away whitespace found at the end of a line:
695.nf
696
697 %%
698 [ \\t]+ putchar( ' ' );
699 [ \\t]+$ /* ignore this token */
700
701.fi
702.PP
703If the action contains a '{', then the action spans till the balancing '}'
704is found, and the action may cross multiple lines.
705.I flex
706knows about C strings and comments and won't be fooled by braces found
707within them, but also allows actions to begin with
708.B %{
709and will consider the action to be all the text up to the next
710.B %}
711(regardless of ordinary braces inside the action).
712.PP
713An action consisting solely of a vertical bar ('|') means "same as
714the action for the next rule." See below for an illustration.
715.PP
716Actions can include arbitrary C code, including
717.B return
718statements to return a value to whatever routine called
719.B yylex().
720Each time
721.B yylex()
722is called it continues processing tokens from where it last left
723off until it either reaches
724the end of the file or executes a return.
725.PP
726Actions are free to modify
727.B yytext
728except for lengthening it (adding
729characters to its end--these will overwrite later characters in the
730input stream). This however does not apply when using
731.B %array
732(see above); in that case,
733.B yytext
734may be freely modified in any way.
735.PP
736Actions are free to modify
737.B yyleng
738except they should not do so if the action also includes use of
739.B yymore()
740(see below).
741.PP
742There are a number of special directives which can be included within
743an action:
744.IP -
745.B ECHO
746copies yytext to the scanner's output.
747.IP -
748.B BEGIN
749followed by the name of a start condition places the scanner in the
750corresponding start condition (see below).
751.IP -
752.B REJECT
753directs the scanner to proceed on to the "second best" rule which matched the
754input (or a prefix of the input). The rule is chosen as described
755above in "How the Input is Matched", and
756.B yytext
757and
758.B yyleng
759set up appropriately.
760It may either be one which matched as much text
761as the originally chosen rule but came later in the
762.I flex
763input file, or one which matched less text.
764For example, the following will both count the
765words in the input and call the routine special() whenever "frob" is seen:
766.nf
767
768 int word_count = 0;
769 %%
770
771 frob special(); REJECT;
772 [^ \\t\\n]+ ++word_count;
773
774.fi
775Without the
776.B REJECT,
777any "frob"'s in the input would not be counted as words, since the
778scanner normally executes only one action per token.
779Multiple
780.B REJECT's
781are allowed, each one finding the next best choice to the currently
782active rule. For example, when the following scanner scans the token
783"abcd", it will write "abcdabcaba" to the output:
784.nf
785
786 %%
787 a |
788 ab |
789 abc |
790 abcd ECHO; REJECT;
791 .|\\n /* eat up any unmatched character */
792
793.fi
794(The first three rules share the fourth's action since they use
795the special '|' action.)
796.B REJECT
797is a particularly expensive feature in terms of scanner performance;
798if it is used in
799.I any
800of the scanner's actions it will slow down
801.I all
802of the scanner's matching. Furthermore,
803.B REJECT
804cannot be used with the
805.I -Cf
806or
807.I -CF
808options (see below).
809.IP
810Note also that unlike the other special actions,
811.B REJECT
812is a
813.I branch;
814code immediately following it in the action will
815.I not
816be executed.
817.IP -
818.B yymore()
819tells the scanner that the next time it matches a rule, the corresponding
820token should be
821.I appended
822onto the current value of
823.B yytext
824rather than replacing it. For example, given the input "mega-kludge"
825the following will write "mega-mega-kludge" to the output:
826.nf
827
828 %%
829 mega- ECHO; yymore();
830 kludge ECHO;
831
832.fi
833First "mega-" is matched and echoed to the output. Then "kludge"
834is matched, but the previous "mega-" is still hanging around at the
835beginning of
836.B yytext
837so the
838.B ECHO
839for the "kludge" rule will actually write "mega-kludge".
840.PP
841Two notes regarding use of
842.B yymore().
843First,
844.B yymore()
845depends on the value of
846.I yyleng
847correctly reflecting the size of the current token, so you must not
848modify
849.I yyleng
850if you are using
851.B yymore().
852Second, the presence of
853.B yymore()
854in the scanner's action entails a minor performance penalty in the
855scanner's matching speed.
856.IP -
857.B yyless(n)
858returns all but the first
859.I n
860characters of the current token back to the input stream, where they
861will be rescanned when the scanner looks for the next match.
862.B yytext
863and
864.B yyleng
865are adjusted appropriately (e.g.,
866.B yyleng
867will now be equal to
868.I n
869). For example, on the input "foobar" the following will write out
870"foobarbar":
871.nf
872
873 %%
874 foobar ECHO; yyless(3);
875 [a-z]+ ECHO;
876
877.fi
878An argument of 0 to
879.B yyless
880will cause the entire current input string to be scanned again. Unless you've
881changed how the scanner will subsequently process its input (using
882.B BEGIN,
883for example), this will result in an endless loop.
884.PP
885Note that
886.B yyless
887is a macro and can only be used in the flex input file, not from
888other source files.
889.IP -
890.B unput(c)
891puts the character
892.I c
893back onto the input stream. It will be the next character scanned.
894The following action will take the current token and cause it
895to be rescanned enclosed in parentheses.
896.nf
897
898 {
899 int i;
900 /* Copy yytext because unput() trashes yytext */
901 char *yycopy = strdup( yytext );
902 unput( ')' );
903 for ( i = yyleng - 1; i >= 0; --i )
904 unput( yycopy[i] );
905 unput( '(' );
906 free( yycopy );
907 }
908
909.fi
910Note that since each
911.B unput()
912puts the given character back at the
913.I beginning
914of the input stream, pushing back strings must be done back-to-front.
915.PP
916An important potential problem when using
917.B unput()
918is that if you are using
919.B %pointer
920(the default), a call to
921.B unput()
922.I destroys
923the contents of
924.I yytext,
925starting with its rightmost character and devouring one character to
926the left with each call. If you need the value of yytext preserved
927after a call to
928.B unput()
929(as in the above example),
930you must either first copy it elsewhere, or build your scanner using
931.B %array
932instead (see How The Input Is Matched).
933.PP
934Finally, note that you cannot put back
935.B EOF
936to attempt to mark the input stream with an end-of-file.
937.IP -
938.B input()
939reads the next character from the input stream. For example,
940the following is one way to eat up C comments:
941.nf
942
943 %%
944 "/*" {
945 register int c;
946
947 for ( ; ; )
948 {
949 while ( (c = input()) != '*' &&
950 c != EOF )
951 ; /* eat up text of comment */
952
953 if ( c == '*' )
954 {
955 while ( (c = input()) == '*' )
956 ;
957 if ( c == '/' )
958 break; /* found the end */
959 }
960
961 if ( c == EOF )
962 {
963 error( "EOF in comment" );
964 break;
965 }
966 }
967 }
968
969.fi
970(Note that if the scanner is compiled using
971.B C++,
972then
973.B input()
974is instead referred to as
975.B yyinput(),
976in order to avoid a name clash with the
977.B C++
978stream by the name of
979.I input.)
980.IP -
981.B YY_FLUSH_BUFFER
982flushes the scanner's internal buffer
983so that the next time the scanner attempts to match a token, it will
984first refill the buffer using
985.B YY_INPUT
986(see The Generated Scanner, below). This action is a special case
987of the more general
988.B yy_flush_buffer()
989function, described below in the section Multiple Input Buffers.
990.IP -
991.B yyterminate()
992can be used in lieu of a return statement in an action. It terminates
993the scanner and returns a 0 to the scanner's caller, indicating "all done".
994By default,
995.B yyterminate()
996is also called when an end-of-file is encountered. It is a macro and
997may be redefined.
998.SH THE GENERATED SCANNER
999The output of
1000.I flex
1001is the file
1002.B lex.yy.c,
1003which contains the scanning routine
1004.B yylex(),
1005a number of tables used by it for matching tokens, and a number
1006of auxiliary routines and macros. By default,
1007.B yylex()
1008is declared as follows:
1009.nf
1010
1011 int yylex()
1012 {
1013 ... various definitions and the actions in here ...
1014 }
1015
1016.fi
1017(If your environment supports function prototypes, then it will
1018be "int yylex( void )".) This definition may be changed by defining
1019the "YY_DECL" macro. For example, you could use:
1020.nf
1021
1022 #define YY_DECL float lexscan( a, b ) float a, b;
1023
1024.fi
1025to give the scanning routine the name
1026.I lexscan,
1027returning a float, and taking two floats as arguments. Note that
1028if you give arguments to the scanning routine using a
1029K&R-style/non-prototyped function declaration, you must terminate
1030the definition with a semi-colon (;).
1031.PP
1032Whenever
1033.B yylex()
1034is called, it scans tokens from the global input file
1035.I yyin
1036(which defaults to stdin). It continues until it either reaches
1037an end-of-file (at which point it returns the value 0) or
1038one of its actions executes a
1039.I return
1040statement.
1041.PP
1042If the scanner reaches an end-of-file, subsequent calls are undefined
1043unless either
1044.I yyin
1045is pointed at a new input file (in which case scanning continues from
1046that file), or
1047.B yyrestart()
1048is called.
1049.B yyrestart()
1050takes one argument, a
1051.B FILE *
1052pointer (which can be nil, if you've set up
1053.B YY_INPUT
1054to scan from a source other than
1055.I yyin),
1056and initializes
1057.I yyin
1058for scanning from that file. Essentially there is no difference between
1059just assigning
1060.I yyin
1061to a new input file or using
1062.B yyrestart()
1063to do so; the latter is available for compatibility with previous versions
1064of
1065.I flex,
1066and because it can be used to switch input files in the middle of scanning.
1067It can also be used to throw away the current input buffer, by calling
1068it with an argument of
1069.I yyin;
1070but better is to use
1071.B YY_FLUSH_BUFFER
1072(see above).
1073Note that
1074.B yyrestart()
1075does
1076.I not
1077reset the start condition to
1078.B INITIAL
1079(see Start Conditions, below).
1080.PP
1081If
1082.B yylex()
1083stops scanning due to executing a
1084.I return
1085statement in one of the actions, the scanner may then be called again and it
1086will resume scanning where it left off.
1087.PP
1088By default (and for purposes of efficiency), the scanner uses
1089block-reads rather than simple
1090.I getc()
1091calls to read characters from
1092.I yyin.
1093The nature of how it gets its input can be controlled by defining the
1094.B YY_INPUT
1095macro.
1096YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
1097action is to place up to
1098.I max_size
1099characters in the character array
1100.I buf
1101and return in the integer variable
1102.I result
1103either the
1104number of characters read or the constant YY_NULL (0 on Unix systems)
1105to indicate EOF. The default YY_INPUT reads from the
1106global file-pointer "yyin".
1107.PP
1108A sample definition of YY_INPUT (in the definitions
1109section of the input file):
1110.nf
1111
1112 %{
1113 #define YY_INPUT(buf,result,max_size) \\
1114 { \\
1115 int c = getchar(); \\
1116 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
1117 }
1118 %}
1119
1120.fi
1121This definition will change the input processing to occur
1122one character at a time.
1123.PP
1124When the scanner receives an end-of-file indication from YY_INPUT,
1125it then checks the
1126.B yywrap()
1127function. If
1128.B yywrap()
1129returns false (zero), then it is assumed that the
1130function has gone ahead and set up
1131.I yyin
1132to point to another input file, and scanning continues. If it returns
1133true (non-zero), then the scanner terminates, returning 0 to its
1134caller. Note that in either case, the start condition remains unchanged;
1135it does
1136.I not
1137revert to
1138.B INITIAL.
1139.PP
1140If you do not supply your own version of
1141.B yywrap(),
1142then you must either use
1143.B %option noyywrap
1144(in which case the scanner behaves as though
1145.B yywrap()
1146returned 1), or you must link with
1147.B \-ll
1148to obtain the default version of the routine, which always returns 1.
1149.PP
1150Three routines are available for scanning from in-memory buffers rather
1151than files:
1152.B yy_scan_string(), yy_scan_bytes(),
1153and
1154.B yy_scan_buffer().
1155See the discussion of them below in the section Multiple Input Buffers.
1156.PP
1157The scanner writes its
1158.B ECHO
1159output to the
1160.I yyout
1161global (default, stdout), which may be redefined by the user simply
1162by assigning it to some other
1163.B FILE
1164pointer.
1165.SH START CONDITIONS
1166.I flex
1167provides a mechanism for conditionally activating rules. Any rule
1168whose pattern is prefixed with "<sc>" will only be active when
1169the scanner is in the start condition named "sc". For example,
1170.nf
1171
1172 <STRING>[^"]* { /* eat up the string body ... */
1173 ...
1174 }
1175
1176.fi
1177will be active only when the scanner is in the "STRING" start
1178condition, and
1179.nf
1180
1181 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
1182 ...
1183 }
1184
1185.fi
1186will be active only when the current start condition is
1187either "INITIAL", "STRING", or "QUOTE".
1188.PP
1189Start conditions
1190are declared in the definitions (first) section of the input
1191using unindented lines beginning with either
1192.B %s
1193or
1194.B %x
1195followed by a list of names.
1196The former declares
1197.I inclusive
1198start conditions, the latter
1199.I exclusive
1200start conditions. A start condition is activated using the
1201.B BEGIN
1202action. Until the next
1203.B BEGIN
1204action is executed, rules with the given start
1205condition will be active and
1206rules with other start conditions will be inactive.
1207If the start condition is
1208.I inclusive,
1209then rules with no start conditions at all will also be active.
1210If it is
1211.I exclusive,
1212then
1213.I only
1214rules qualified with the start condition will be active.
1215A set of rules contingent on the same exclusive start condition
1216describe a scanner which is independent of any of the other rules in the
1217.I flex
1218input. Because of this,
1219exclusive start conditions make it easy to specify "mini-scanners"
1220which scan portions of the input that are syntactically different
1221from the rest (e.g., comments).
1222.PP
1223If the distinction between inclusive and exclusive start conditions
1224is still a little vague, here's a simple example illustrating the
1225connection between the two. The set of rules:
1226.nf
1227
1228 %s example
1229 %%
1230
1231 <example>foo do_something();
1232
1233 bar something_else();
1234
1235.fi
1236is equivalent to
1237.nf
1238
1239 %x example
1240 %%
1241
1242 <example>foo do_something();
1243
1244 <INITIAL,example>bar something_else();
1245
1246.fi
1247Without the
1248.B <INITIAL,example>
1249qualifier, the
1250.I bar
1251pattern in the second example wouldn't be active (i.e., couldn't match)
1252when in start condition
1253.B example.
1254If we just used
1255.B <example>
1256to qualify
1257.I bar,
1258though, then it would only be active in
1259.B example
1260and not in
1261.B INITIAL,
1262while in the first example it's active in both, because in the first
1263example the
1264.B example
1265startion condition is an
1266.I inclusive
1267.B (%s)
1268start condition.
1269.PP
1270Also note that the special start-condition specifier
1271.B <*>
1272matches every start condition. Thus, the above example could also
1273have been written;
1274.nf
1275
1276 %x example
1277 %%
1278
1279 <example>foo do_something();
1280
1281 <*>bar something_else();
1282
1283.fi
1284.PP
1285The default rule (to
1286.B ECHO
1287any unmatched character) remains active in start conditions. It
1288is equivalent to:
1289.nf
1290
1291 <*>.|\\n ECHO;
1292
1293.fi
1294.PP
1295.B BEGIN(0)
1296returns to the original state where only the rules with
1297no start conditions are active. This state can also be
1298referred to as the start-condition "INITIAL", so
1299.B BEGIN(INITIAL)
1300is equivalent to
1301.B BEGIN(0).
1302(The parentheses around the start condition name are not required but
1303are considered good style.)
1304.PP
1305.B BEGIN
1306actions can also be given as indented code at the beginning
1307of the rules section. For example, the following will cause
1308the scanner to enter the "SPECIAL" start condition whenever
1309.B yylex()
1310is called and the global variable
1311.I enter_special
1312is true:
1313.nf
1314
1315 int enter_special;
1316
1317 %x SPECIAL
1318 %%
1319 if ( enter_special )
1320 BEGIN(SPECIAL);
1321
1322 <SPECIAL>blahblahblah
1323 ...more rules follow...
1324
1325.fi
1326.PP
1327To illustrate the uses of start conditions,
1328here is a scanner which provides two different interpretations
1329of a string like "123.456". By default it will treat it as
1330three tokens, the integer "123", a dot ('.'), and the integer "456".
1331But if the string is preceded earlier in the line by the string
1332"expect-floats"
1333it will treat it as a single token, the floating-point number
1334123.456:
1335.nf
1336
1337 %{
1338 #include <math.h>
1339 %}
1340 %s expect
1341
1342 %%
1343 expect-floats BEGIN(expect);
1344
1345 <expect>[0-9]+"."[0-9]+ {
1346 printf( "found a float, = %f\\n",
1347 atof( yytext ) );
1348 }
1349 <expect>\\n {
1350 /* that's the end of the line, so
1351 * we need another "expect-number"
1352 * before we'll recognize any more
1353 * numbers
1354 */
1355 BEGIN(INITIAL);
1356 }
1357
1358 [0-9]+ {
1359 printf( "found an integer, = %d\\n",
1360 atoi( yytext ) );
1361 }
1362
1363 "." printf( "found a dot\\n" );
1364
1365.fi
1366Here is a scanner which recognizes (and discards) C comments while
1367maintaining a count of the current input line.
1368.nf
1369
1370 %x comment
1371 %%
1372 int line_num = 1;
1373
1374 "/*" BEGIN(comment);
1375
1376 <comment>[^*\\n]* /* eat anything that's not a '*' */
1377 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1378 <comment>\\n ++line_num;
1379 <comment>"*"+"/" BEGIN(INITIAL);
1380
1381.fi
1382This scanner goes to a bit of trouble to match as much
1383text as possible with each rule. In general, when attempting to write
1384a high-speed scanner try to match as much possible in each rule, as
1385it's a big win.
1386.PP
1387Note that start-conditions names are really integer values and
1388can be stored as such. Thus, the above could be extended in the
1389following fashion:
1390.nf
1391
1392 %x comment foo
1393 %%
1394 int line_num = 1;
1395 int comment_caller;
1396
1397 "/*" {
1398 comment_caller = INITIAL;
1399 BEGIN(comment);
1400 }
1401
1402 ...
1403
1404 <foo>"/*" {
1405 comment_caller = foo;
1406 BEGIN(comment);
1407 }
1408
1409 <comment>[^*\\n]* /* eat anything that's not a '*' */
1410 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1411 <comment>\\n ++line_num;
1412 <comment>"*"+"/" BEGIN(comment_caller);
1413
1414.fi
1415Furthermore, you can access the current start condition using
1416the integer-valued
1417.B YY_START
1418macro. For example, the above assignments to
1419.I comment_caller
1420could instead be written
1421.nf
1422
1423 comment_caller = YY_START;
1424
1425.fi
1426Flex provides
1427.B YYSTATE
1428as an alias for
1429.B YY_START
1430(since that is what's used by AT&T
1431.I lex).
1432.PP
1433Note that start conditions do not have their own name-space; %s's and %x's
1434declare names in the same fashion as #define's.
1435.PP
1436Finally, here's an example of how to match C-style quoted strings using
1437exclusive start conditions, including expanded escape sequences (but
1438not including checking for a string that's too long):
1439.nf
1440
1441 %x str
1442
1443 %%
1444 char string_buf[MAX_STR_CONST];
1445 char *string_buf_ptr;
1446
1447
1448 \\" string_buf_ptr = string_buf; BEGIN(str);
1449
1450 <str>\\" { /* saw closing quote - all done */
1451 BEGIN(INITIAL);
1452 *string_buf_ptr = '\\0';
1453 /* return string constant token type and
1454 * value to parser
1455 */
1456 }
1457
1458 <str>\\n {
1459 /* error - unterminated string constant */
1460 /* generate error message */
1461 }
1462
1463 <str>\\\\[0-7]{1,3} {
1464 /* octal escape sequence */
1465 int result;
1466
1467 (void) sscanf( yytext + 1, "%o", &result );
1468
1469 if ( result > 0xff )
1470 /* error, constant is out-of-bounds */
1471
1472 *string_buf_ptr++ = result;
1473 }
1474
1475 <str>\\\\[0-9]+ {
1476 /* generate error - bad escape sequence; something
1477 * like '\\48' or '\\0777777'
1478 */
1479 }
1480
1481 <str>\\\\n *string_buf_ptr++ = '\\n';
1482 <str>\\\\t *string_buf_ptr++ = '\\t';
1483 <str>\\\\r *string_buf_ptr++ = '\\r';
1484 <str>\\\\b *string_buf_ptr++ = '\\b';
1485 <str>\\\\f *string_buf_ptr++ = '\\f';
1486
1487 <str>\\\\(.|\\n) *string_buf_ptr++ = yytext[1];
1488
1489 <str>[^\\\\\\n\\"]+ {
1490 char *yptr = yytext;
1491
1492 while ( *yptr )
1493 *string_buf_ptr++ = *yptr++;
1494 }
1495
1496.fi
1497.PP
1498Often, such as in some of the examples above, you wind up writing a
1499whole bunch of rules all preceded by the same start condition(s). Flex
1500makes this a little easier and cleaner by introducing a notion of
1501start condition
1502.I scope.
1503A start condition scope is begun with:
1504.nf
1505
1506 <SCs>{
1507
1508.fi
1509where
1510.I SCs
1511is a list of one or more start conditions. Inside the start condition
1512scope, every rule automatically has the prefix
1513.I <SCs>
1514applied to it, until a
1515.I '}'
1516which matches the initial
1517.I '{'.
1518So, for example,
1519.nf
1520
1521 <ESC>{
1522 "\\\\n" return '\\n';
1523 "\\\\r" return '\\r';
1524 "\\\\f" return '\\f';
1525 "\\\\0" return '\\0';
1526 }
1527
1528.fi
1529is equivalent to:
1530.nf
1531
1532 <ESC>"\\\\n" return '\\n';
1533 <ESC>"\\\\r" return '\\r';
1534 <ESC>"\\\\f" return '\\f';
1535 <ESC>"\\\\0" return '\\0';
1536
1537.fi
1538Start condition scopes may be nested.
1539.PP
1540Three routines are available for manipulating stacks of start conditions:
1541.TP
1542.B void yy_push_state(int new_state)
1543pushes the current start condition onto the top of the start condition
1544stack and switches to
1545.I new_state
1546as though you had used
1547.B BEGIN new_state
1548(recall that start condition names are also integers).
1549.TP
1550.B void yy_pop_state()
1551pops the top of the stack and switches to it via
1552.B BEGIN.
1553.TP
1554.B int yy_top_state()
1555returns the top of the stack without altering the stack's contents.
1556.PP
1557The start condition stack grows dynamically and so has no built-in
1558size limitation. If memory is exhausted, program execution aborts.
1559.PP
1560To use start condition stacks, your scanner must include a
1561.B %option stack
1562directive (see Options below).
1563.SH MULTIPLE INPUT BUFFERS
1564Some scanners (such as those which support "include" files)
1565require reading from several input streams. As
1566.I flex
1567scanners do a large amount of buffering, one cannot control
1568where the next input will be read from by simply writing a
1569.B YY_INPUT
1570which is sensitive to the scanning context.
1571.B YY_INPUT
1572is only called when the scanner reaches the end of its buffer, which
1573may be a long time after scanning a statement such as an "include"
1574which requires switching the input source.
1575.PP
1576To negotiate these sorts of problems,
1577.I flex
1578provides a mechanism for creating and switching between multiple
1579input buffers. An input buffer is created by using:
1580.nf
1581
1582 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1583
1584.fi
1585which takes a
1586.I FILE
1587pointer and a size and creates a buffer associated with the given
1588file and large enough to hold
1589.I size
1590characters (when in doubt, use
1591.B YY_BUF_SIZE
1592for the size). It returns a
1593.B YY_BUFFER_STATE
1594handle, which may then be passed to other routines (see below). The
1595.B YY_BUFFER_STATE
1596type is a pointer to an opaque
1597.B struct yy_buffer_state
1598structure, so you may safely initialize YY_BUFFER_STATE variables to
1599.B ((YY_BUFFER_STATE) 0)
1600if you wish, and also refer to the opaque structure in order to
1601correctly declare input buffers in source files other than that
1602of your scanner. Note that the
1603.I FILE
1604pointer in the call to
1605.B yy_create_buffer
1606is only used as the value of
1607.I yyin
1608seen by
1609.B YY_INPUT;
1610if you redefine
1611.B YY_INPUT
1612so it no longer uses
1613.I yyin,
1614then you can safely pass a nil
1615.I FILE
1616pointer to
1617.B yy_create_buffer.
1618You select a particular buffer to scan from using:
1619.nf
1620
1621 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1622
1623.fi
1624switches the scanner's input buffer so subsequent tokens will
1625come from
1626.I new_buffer.
1627Note that
1628.B yy_switch_to_buffer()
1629may be used by yywrap() to set things up for continued scanning, instead
1630of opening a new file and pointing
1631.I yyin
1632at it. Note also that switching input sources via either
1633.B yy_switch_to_buffer()
1634or
1635.B yywrap()
1636does
1637.I not
1638change the start condition.
1639.nf
1640
1641 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1642
1643.fi
1644is used to reclaim the storage associated with a buffer. (
1645.B buffer
1646can be nil, in which case the routine does nothing.)
1647You can also clear the current contents of a buffer using:
1648.nf
1649
1650 void yy_flush_buffer( YY_BUFFER_STATE buffer )
1651
1652.fi
1653This function discards the buffer's contents,
1654so the next time the scanner attempts to match a token from the
1655buffer, it will first fill the buffer anew using
1656.B YY_INPUT.
1657.PP
1658.B yy_new_buffer()
1659is an alias for
1660.B yy_create_buffer(),
1661provided for compatibility with the C++ use of
1662.I new
1663and
1664.I delete
1665for creating and destroying dynamic objects.
1666.PP
1667Finally, the
1668.B YY_CURRENT_BUFFER
1669macro returns a
1670.B YY_BUFFER_STATE
1671handle to the current buffer.
1672.PP
1673Here is an example of using these features for writing a scanner
1674which expands include files (the
1675.B <<EOF>>
1676feature is discussed below):
1677.nf
1678
1679 /* the "incl" state is used for picking up the name
1680 * of an include file
1681 */
1682 %x incl
1683
1684 %{
1685 #define MAX_INCLUDE_DEPTH 10
1686 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1687 int include_stack_ptr = 0;
1688 %}
1689
1690 %%
1691 include BEGIN(incl);
1692
1693 [a-z]+ ECHO;
1694 [^a-z\\n]*\\n? ECHO;
1695
1696 <incl>[ \\t]* /* eat the whitespace */
1697 <incl>[^ \\t\\n]+ { /* got the include file name */
1698 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1699 {
1700 fprintf( stderr, "Includes nested too deeply" );
1701 exit( 1 );
1702 }
1703
1704 include_stack[include_stack_ptr++] =
1705 YY_CURRENT_BUFFER;
1706
1707 yyin = fopen( yytext, "r" );
1708
1709 if ( ! yyin )
1710 error( ... );
1711
1712 yy_switch_to_buffer(
1713 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1714
1715 BEGIN(INITIAL);
1716 }
1717
1718 <<EOF>> {
1719 if ( --include_stack_ptr < 0 )
1720 {
1721 yyterminate();
1722 }
1723
1724 else
1725 {
1726 yy_delete_buffer( YY_CURRENT_BUFFER );
1727 yy_switch_to_buffer(
1728 include_stack[include_stack_ptr] );
1729 }
1730 }
1731
1732.fi
1733Three routines are available for setting up input buffers for
1734scanning in-memory strings instead of files. All of them create
1735a new input buffer for scanning the string, and return a corresponding
1736.B YY_BUFFER_STATE
1737handle (which you should delete with
1738.B yy_delete_buffer()
1739when done with it). They also switch to the new buffer using
1740.B yy_switch_to_buffer(),
1741so the next call to
1742.B yylex()
1743will start scanning the string.
1744.TP
1745.B yy_scan_string(const char *str)
1746scans a NUL-terminated string.
1747.TP
1748.B yy_scan_bytes(const char *bytes, int len)
1749scans
1750.I len
1751bytes (including possibly NUL's)
1752starting at location
1753.I bytes.
1754.PP
1755Note that both of these functions create and scan a
1756.I copy
1757of the string or bytes. (This may be desirable, since
1758.B yylex()
1759modifies the contents of the buffer it is scanning.) You can avoid the
1760copy by using:
1761.TP
1762.B yy_scan_buffer(char *base, yy_size_t size)
1763which scans in place the buffer starting at
1764.I base,
1765consisting of
1766.I size
1767bytes, the last two bytes of which
1768.I must
1769be
1770.B YY_END_OF_BUFFER_CHAR
1771(ASCII NUL).
1772These last two bytes are not scanned; thus, scanning
1773consists of
1774.B base[0]
1775through
1776.B base[size-2],
1777inclusive.
1778.IP
1779If you fail to set up
1780.I base
1781in this manner (i.e., forget the final two
1782.B YY_END_OF_BUFFER_CHAR
1783bytes), then
1784.B yy_scan_buffer()
1785returns a nil pointer instead of creating a new input buffer.
1786.IP
1787The type
1788.B yy_size_t
1789is an integral type to which you can cast an integer expression
1790reflecting the size of the buffer.
1791.SH END-OF-FILE RULES
1792The special rule "<<EOF>>" indicates
1793actions which are to be taken when an end-of-file is
1794encountered and yywrap() returns non-zero (i.e., indicates
1795no further files to process). The action must finish
1796by doing one of four things:
1797.IP -
1798assigning
1799.I yyin
1800to a new input file (in previous versions of flex, after doing the
1801assignment you had to call the special action
1802.B YY_NEW_FILE;
1803this is no longer necessary);
1804.IP -
1805executing a
1806.I return
1807statement;
1808.IP -
1809executing the special
1810.B yyterminate()
1811action;
1812.IP -
1813or, switching to a new buffer using
1814.B yy_switch_to_buffer()
1815as shown in the example above.
1816.PP
1817<<EOF>> rules may not be used with other
1818patterns; they may only be qualified with a list of start
1819conditions. If an unqualified <<EOF>> rule is given, it
1820applies to
1821.I all
1822start conditions which do not already have <<EOF>> actions. To
1823specify an <<EOF>> rule for only the initial start condition, use
1824.nf
1825
1826 <INITIAL><<EOF>>
1827
1828.fi
1829.PP
1830These rules are useful for catching things like unclosed comments.
1831An example:
1832.nf
1833
1834 %x quote
1835 %%
1836
1837 ...other rules for dealing with quotes...
1838
1839 <quote><<EOF>> {
1840 error( "unterminated quote" );
1841 yyterminate();
1842 }
1843 <<EOF>> {
1844 if ( *++filelist )
1845 yyin = fopen( *filelist, "r" );
1846 else
1847 yyterminate();
1848 }
1849
1850.fi
1851.SH MISCELLANEOUS MACROS
1852The macro
1853.B YY_USER_ACTION
1854can be defined to provide an action
1855which is always executed prior to the matched rule's action. For example,
1856it could be #define'd to call a routine to convert yytext to lower-case.
1857When
1858.B YY_USER_ACTION
1859is invoked, the variable
1860.I yy_act
1861gives the number of the matched rule (rules are numbered starting with 1).
1862Suppose you want to profile how often each of your rules is matched. The
1863following would do the trick:
1864.nf
1865
1866 #define YY_USER_ACTION ++ctr[yy_act]
1867
1868.fi
1869where
1870.I ctr
1871is an array to hold the counts for the different rules. Note that
1872the macro
1873.B YY_NUM_RULES
1874gives the total number of rules (including the default rule, even if
1875you use
1876.B \-s),
1877so a correct declaration for
1878.I ctr
1879is:
1880.nf
1881
1882 int ctr[YY_NUM_RULES];
1883
1884.fi
1885.PP
1886The macro
1887.B YY_USER_INIT
1888may be defined to provide an action which is always executed before
1889the first scan (and before the scanner's internal initializations are done).
1890For example, it could be used to call a routine to read
1891in a data table or open a logging file.
1892.PP
1893The macro
1894.B yy_set_interactive(is_interactive)
1895can be used to control whether the current buffer is considered
1896.I interactive.
1897An interactive buffer is processed more slowly,
1898but must be used when the scanner's input source is indeed
1899interactive to avoid problems due to waiting to fill buffers
1900(see the discussion of the
1901.B \-I
1902flag below). A non-zero value
1903in the macro invocation marks the buffer as interactive, a zero
1904value as non-interactive. Note that use of this macro overrides
1905.B %option always-interactive
1906or
1907.B %option never-interactive
1908(see Options below).
1909.B yy_set_interactive()
1910must be invoked prior to beginning to scan the buffer that is
1911(or is not) to be considered interactive.
1912.PP
1913The macro
1914.B yy_set_bol(at_bol)
1915can be used to control whether the current buffer's scanning
1916context for the next token match is done as though at the
1917beginning of a line. A non-zero macro argument makes rules anchored with
1918'^' active, while a zero argument makes '^' rules inactive.
1919.PP
1920The macro
1921.B YY_AT_BOL()
1922returns true if the next token scanned from the current buffer
1923will have '^' rules active, false otherwise.
1924.PP
1925In the generated scanner, the actions are all gathered in one large
1926switch statement and separated using
1927.B YY_BREAK,
1928which may be redefined. By default, it is simply a "break", to separate
1929each rule's action from the following rule's.
1930Redefining
1931.B YY_BREAK
1932allows, for example, C++ users to
1933#define YY_BREAK to do nothing (while being very careful that every
1934rule ends with a "break" or a "return"!) to avoid suffering from
1935unreachable statement warnings where because a rule's action ends with
1936"return", the
1937.B YY_BREAK
1938is inaccessible.
1939.SH VALUES AVAILABLE TO THE USER
1940This section summarizes the various values available to the user
1941in the rule actions.
1942.IP -
1943.B char *yytext
1944holds the text of the current token. It may be modified but not lengthened
1945(you cannot append characters to the end).
1946.IP
1947If the special directive
1948.B %array
1949appears in the first section of the scanner description, then
1950.B yytext
1951is instead declared
1952.B char yytext[YYLMAX],
1953where
1954.B YYLMAX
1955is a macro definition that you can redefine in the first section
1956if you don't like the default value (generally 8KB). Using
1957.B %array
1958results in somewhat slower scanners, but the value of
1959.B yytext
1960becomes immune to calls to
1961.I input()
1962and
1963.I unput(),
1964which potentially destroy its value when
1965.B yytext
1966is a character pointer. The opposite of
1967.B %array
1968is
1969.B %pointer,
1970which is the default.
1971.IP
1972You cannot use
1973.B %array
1974when generating C++ scanner classes
1975(the
1976.B \-+
1977flag).
1978.IP -
1979.B int yyleng
1980holds the length of the current token.
1981.IP -
1982.B FILE *yyin
1983is the file which by default
1984.I flex
1985reads from. It may be redefined but doing so only makes sense before
1986scanning begins or after an EOF has been encountered. Changing it in
1987the midst of scanning will have unexpected results since
1988.I flex
1989buffers its input; use
1990.B yyrestart()
1991instead.
1992Once scanning terminates because an end-of-file
1993has been seen, you can assign
1994.I yyin
1995at the new input file and then call the scanner again to continue scanning.
1996.IP -
1997.B void yyrestart( FILE *new_file )
1998may be called to point
1999.I yyin
2000at the new input file. The switch-over to the new file is immediate
2001(any previously buffered-up input is lost). Note that calling
2002.B yyrestart()
2003with
2004.I yyin
2005as an argument thus throws away the current input buffer and continues
2006scanning the same input file.
2007.IP -
2008.B FILE *yyout
2009is the file to which
2010.B ECHO
2011actions are done. It can be reassigned by the user.
2012.IP -
2013.B YY_CURRENT_BUFFER
2014returns a
2015.B YY_BUFFER_STATE
2016handle to the current buffer.
2017.IP -
2018.B YY_START
2019returns an integer value corresponding to the current start
2020condition. You can subsequently use this value with
2021.B BEGIN
2022to return to that start condition.
2023.SH INTERFACING WITH YACC
2024One of the main uses of
2025.I flex
2026is as a companion to the
2027.I yacc
2028parser-generator.
2029.I yacc
2030parsers expect to call a routine named
2031.B yylex()
2032to find the next input token. The routine is supposed to
2033return the type of the next token as well as putting any associated
2034value in the global
2035.B yylval.
2036To use
2037.I flex
2038with
2039.I yacc,
2040one specifies the
2041.B \-d
2042option to
2043.I yacc
2044to instruct it to generate the file
2045.B y.tab.h
2046containing definitions of all the
2047.B %tokens
2048appearing in the
2049.I yacc
2050input. This file is then included in the
2051.I flex
2052scanner. For example, if one of the tokens is "TOK_NUMBER",
2053part of the scanner might look like:
2054.nf
2055
2056 %{
2057 #include "y.tab.h"
2058 %}
2059
2060 %%
2061
2062 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
2063
2064.fi
34.SH OPTIONS
35.I flex
36has the following options:
37.TP
38.B \-b
2065.SH OPTIONS
2066.I flex
2067has the following options:
2068.TP
2069.B \-b
39generate backing-up information to
2070Generate backing-up information to
40.I lex.backup.
2071.I lex.backup.
41This is a list of scanner states which require backing up and the input
42characters on which they do so. By adding rules one can remove
43backing-up states. If all backing-up states are eliminated and
2072This is a list of scanner states which require backing up
2073and the input characters on which they do so. By adding rules one
2074can remove backing-up states. If
2075.I all
2076backing-up states are eliminated and
44.B \-Cf
45or
46.B \-CF
2077.B \-Cf
2078or
2079.B \-CF
47is used, the generated scanner will run faster.
2080is used, the generated scanner will run faster (see the
2081.B \-p
2082flag). Only users who wish to squeeze every last cycle out of their
2083scanners need worry about this option. (See the section on Performance
2084Considerations below.)
48.TP
49.B \-c
50is a do-nothing, deprecated option included for POSIX compliance.
2085.TP
2086.B \-c
2087is a do-nothing, deprecated option included for POSIX compliance.
51.IP
52.B NOTE:
53in previous releases of
54.I flex
55.B \-c
56specified table-compression options. This functionality is
57now given by the
58.B \-C
59flag. To ease the the impact of this change, when
60.I flex
61encounters
62.B \-c,
63it currently issues a warning message and assumes that
64.B \-C
65was desired instead. In the future this "promotion" of
66.B \-c
67to
68.B \-C
69will go away in the name of full POSIX compliance (unless
70the POSIX meaning is removed first).
71.TP
72.B \-d
73makes the generated scanner run in
74.I debug
75mode. Whenever a pattern is recognized and the global
76.B yy_flex_debug
2088.TP
2089.B \-d
2090makes the generated scanner run in
2091.I debug
2092mode. Whenever a pattern is recognized and the global
2093.B yy_flex_debug
77is non-zero (which is the default), the scanner will
78write to
2094is non-zero (which is the default),
2095the scanner will write to
79.I stderr
80a line of the form:
81.nf
82
83 --accepting rule at line 53 ("the matched text")
84
85.fi
86The line number refers to the location of the rule in the file
87defining the scanner (i.e., the file that was fed to flex). Messages
88are also generated when the scanner backs up, accepts the
89default rule, reaches the end of its input buffer (or encounters
2096.I stderr
2097a line of the form:
2098.nf
2099
2100 --accepting rule at line 53 ("the matched text")
2101
2102.fi
2103The line number refers to the location of the rule in the file
2104defining the scanner (i.e., the file that was fed to flex). Messages
2105are also generated when the scanner backs up, accepts the
2106default rule, reaches the end of its input buffer (or encounters
90a NUL; the two look the same as far as the scanner's concerned),
2107a NUL; at this point, the two look the same as far as the scanner's concerned),
91or reaches an end-of-file.
92.TP
93.B \-f
94specifies
95.I fast scanner.
96No table compression is done and stdio is bypassed.
97The result is large but fast. This option is equivalent to
98.B \-Cfr
99(see below).
100.TP
101.B \-h
102generates a "help" summary of
103.I flex's
104options to
2108or reaches an end-of-file.
2109.TP
2110.B \-f
2111specifies
2112.I fast scanner.
2113No table compression is done and stdio is bypassed.
2114The result is large but fast. This option is equivalent to
2115.B \-Cfr
2116(see below).
2117.TP
2118.B \-h
2119generates a "help" summary of
2120.I flex's
2121options to
105.I stderr
2122.I stdout
106and then exits.
2123and then exits.
2124.B \-?
2125and
2126.B \-\-help
2127are synonyms for
2128.B \-h.
107.TP
108.B \-i
109instructs
110.I flex
111to generate a
112.I case-insensitive
113scanner. The case of letters given in the
114.I flex
115input patterns will
116be ignored, and tokens in the input will be matched regardless of case. The
117matched text given in
118.I yytext
119will have the preserved case (i.e., it will not be folded).
120.TP
121.B \-l
2129.TP
2130.B \-i
2131instructs
2132.I flex
2133to generate a
2134.I case-insensitive
2135scanner. The case of letters given in the
2136.I flex
2137input patterns will
2138be ignored, and tokens in the input will be matched regardless of case. The
2139matched text given in
2140.I yytext
2141will have the preserved case (i.e., it will not be folded).
2142.TP
2143.B \-l
122turns on maximum compatibility with the original AT&T lex implementation,
123at a considerable performance cost. This option is incompatible with
124.B \-+, \-f, \-F, \-Cf,
2144turns on maximum compatibility with the original AT&T
2145.I lex
2146implementation. Note that this does not mean
2147.I full
2148compatibility. Use of this option costs a considerable amount of
2149performance, and it cannot be used with the
2150.B \-+, -f, -F, -Cf,
125or
2151or
126.B \-CF.
127See
128.I lexdoc(1)
129for details.
2152.B -CF
2153options. For details on the compatibilities it provides, see the section
2154"Incompatibilities With Lex And POSIX" below. This option also results
2155in the name
2156.B YY_FLEX_LEX_COMPAT
2157being #define'd in the generated scanner.
130.TP
131.B \-n
132is another do-nothing, deprecated option included only for
133POSIX compliance.
134.TP
135.B \-p
136generates a performance report to stderr. The report
137consists of comments regarding features of the
138.I flex
2158.TP
2159.B \-n
2160is another do-nothing, deprecated option included only for
2161POSIX compliance.
2162.TP
2163.B \-p
2164generates a performance report to stderr. The report
2165consists of comments regarding features of the
2166.I flex
139input file which will cause a loss of performance in the resulting scanner.
140If you give the flag twice, you will also get comments regarding
2167input file which will cause a serious loss of performance in the resulting
2168scanner. If you give the flag twice, you will also get comments regarding
141features that lead to minor performance losses.
2169features that lead to minor performance losses.
2170.IP
2171Note that the use of
2172.B REJECT,
2173.B %option yylineno,
2174and variable trailing context (see the Deficiencies / Bugs section below)
2175entails a substantial performance penalty; use of
2176.I yymore(),
2177the
2178.B ^
2179operator,
2180and the
2181.B \-I
2182flag entail minor performance penalties.
142.TP
143.B \-s
144causes the
145.I default rule
146(that unmatched scanner input is echoed to
147.I stdout)
148to be suppressed. If the scanner encounters input that does not
2183.TP
2184.B \-s
2185causes the
2186.I default rule
2187(that unmatched scanner input is echoed to
2188.I stdout)
2189to be suppressed. If the scanner encounters input that does not
149match any of its rules, it aborts with an error.
2190match any of its rules, it aborts with an error. This option is
2191useful for finding holes in a scanner's rule set.
150.TP
151.B \-t
152instructs
153.I flex
154to write the scanner it generates to standard output instead
155of
156.B lex.yy.c.
157.TP
158.B \-v
159specifies that
160.I flex
161should write to
162.I stderr
163a summary of statistics regarding the scanner it generates.
2192.TP
2193.B \-t
2194instructs
2195.I flex
2196to write the scanner it generates to standard output instead
2197of
2198.B lex.yy.c.
2199.TP
2200.B \-v
2201specifies that
2202.I flex
2203should write to
2204.I stderr
2205a summary of statistics regarding the scanner it generates.
2206Most of the statistics are meaningless to the casual
2207.I flex
2208user, but the first line identifies the version of
2209.I flex
2210(same as reported by
2211.B \-V),
2212and the next line the flags used when generating the scanner, including
2213those that are on by default.
164.TP
165.B \-w
166suppresses warning messages.
167.TP
168.B \-B
169instructs
170.I flex
171to generate a
172.I batch
2214.TP
2215.B \-w
2216suppresses warning messages.
2217.TP
2218.B \-B
2219instructs
2220.I flex
2221to generate a
2222.I batch
173scanner instead of an
2223scanner, the opposite of
174.I interactive
2224.I interactive
175scanner (see
2225scanners generated by
176.B \-I
2226.B \-I
177below). See
178.I lexdoc(1)
179for details. Scanners using
2227(see below). In general, you use
2228.B \-B
2229when you are
2230.I certain
2231that your scanner will never be used interactively, and you want to
2232squeeze a
2233.I little
2234more performance out of it. If your goal is instead to squeeze out a
2235.I lot
2236more performance, you should be using the
180.B \-Cf
181or
182.B \-CF
2237.B \-Cf
2238or
2239.B \-CF
183compression options automatically specify this option, too.
2240options (discussed below), which turn on
2241.B \-B
2242automatically anyway.
184.TP
185.B \-F
186specifies that the
187.ul
188fast
2243.TP
2244.B \-F
2245specifies that the
2246.ul
2247fast
189scanner table representation should be used (and stdio bypassed).
190This representation is about as fast as the full table representation
2248scanner table representation should be used (and stdio
2249bypassed). This representation is
2250about as fast as the full table representation
191.B (-f),
192and for some sets of patterns will be considerably smaller (and for
2251.B (-f),
2252and for some sets of patterns will be considerably smaller (and for
193others, larger). It cannot be used with the
194.B \-+
195option. See
196.B lexdoc(1)
197for more details.
2253others, larger). In general, if the pattern set contains both "keywords"
2254and a catch-all, "identifier" rule, such as in the set:
2255.nf
2256
2257 "case" return TOK_CASE;
2258 "switch" return TOK_SWITCH;
2259 ...
2260 "default" return TOK_DEFAULT;
2261 [a-z]+ return TOK_ID;
2262
2263.fi
2264then you're better off using the full table representation. If only
2265the "identifier" rule is present and you then use a hash table or some such
2266to detect the keywords, you're better off using
2267.B -F.
198.IP
199This option is equivalent to
200.B \-CFr
2268.IP
2269This option is equivalent to
2270.B \-CFr
201(see below).
2271(see below). It cannot be used with
2272.B \-+.
202.TP
203.B \-I
204instructs
205.I flex
206to generate an
207.I interactive
2273.TP
2274.B \-I
2275instructs
2276.I flex
2277to generate an
2278.I interactive
208scanner, that is, a scanner which stops immediately rather than
209looking ahead if it knows
210that the currently scanned text cannot be part of a longer rule's match.
211This is the opposite of
212.I batch
213scanners (see
214.B \-B
215above). See
216.B lexdoc(1)
217for details.
2279scanner. An interactive scanner is one that only looks ahead to decide
2280what token has been matched if it absolutely must. It turns out that
2281always looking one extra character ahead, even if the scanner has already
2282seen enough text to disambiguate the current token, is a bit faster than
2283only looking ahead when necessary. But scanners that always look ahead
2284give dreadful interactive performance; for example, when a user types
2285a newline, it is not recognized as a newline token until they enter
2286.I another
2287token, which often means typing in another whole line.
218.IP
2288.IP
219Note,
220.B \-I
221cannot be used in conjunction with
222.I full
2289.I Flex
2290scanners default to
2291.I interactive
2292unless you use the
2293.B \-Cf
223or
2294or
224.I fast tables,
225i.e., the
226.B \-f, \-F, \-Cf,
227or
228.B \-CF
2295.B \-CF
229flags. For other table compression options,
2296table-compression options (see below). That's because if you're looking
2297for high-performance you should be using one of these options, so if you
2298didn't,
2299.I flex
2300assumes you'd rather trade off a bit of run-time performance for intuitive
2301interactive behavior. Note also that you
2302.I cannot
2303use
230.B \-I
2304.B \-I
231is the default.
2305in conjunction with
2306.B \-Cf
2307or
2308.B \-CF.
2309Thus, this option is not really needed; it is on by default for all those
2310cases in which it is allowed.
2311.IP
2312You can force a scanner to
2313.I not
2314be interactive by using
2315.B \-B
2316(see above).
232.TP
233.B \-L
234instructs
235.I flex
236not to generate
237.B #line
2317.TP
2318.B \-L
2319instructs
2320.I flex
2321not to generate
2322.B #line
238directives in
239.B lex.yy.c.
240The default is to generate such directives so error
241messages in the actions will be correctly
242located with respect to the original
2323directives. Without this option,
243.I flex
2324.I flex
244input file, and not to
245the fairly meaningless line numbers of
246.B lex.yy.c.
2325peppers the generated scanner
2326with #line directives so error messages in the actions will be correctly
2327located with respect to either the original
2328.I flex
2329input file (if the errors are due to code in the input file), or
2330.B lex.yy.c
2331(if the errors are
2332.I flex's
2333fault -- you should report these sorts of errors to the email address
2334given below).
247.TP
248.B \-T
249makes
250.I flex
251run in
252.I trace
253mode. It will generate a lot of messages to
254.I stderr
255concerning
256the form of the input and the resultant non-deterministic and deterministic
257finite automata. This option is mostly for use in maintaining
258.I flex.
259.TP
260.B \-V
261prints the version number to
2335.TP
2336.B \-T
2337makes
2338.I flex
2339run in
2340.I trace
2341mode. It will generate a lot of messages to
2342.I stderr
2343concerning
2344the form of the input and the resultant non-deterministic and deterministic
2345finite automata. This option is mostly for use in maintaining
2346.I flex.
2347.TP
2348.B \-V
2349prints the version number to
262.I stderr
2350.I stdout
263and exits.
2351and exits.
2352.B \-\-version
2353is a synonym for
2354.B \-V.
264.TP
265.B \-7
266instructs
267.I flex
2355.TP
2356.B \-7
2357instructs
2358.I flex
268to generate a 7-bit scanner, which can save considerable table space,
269especially when using
2359to generate a 7-bit scanner, i.e., one which can only recognized 7-bit
2360characters in its input. The advantage of using
2361.B \-7
2362is that the scanner's tables can be up to half the size of those generated
2363using the
2364.B \-8
2365option (see below). The disadvantage is that such scanners often hang
2366or crash if their input contains an 8-bit character.
2367.IP
2368Note, however, that unless you generate your scanner using the
270.B \-Cf
271or
272.B \-CF
2369.B \-Cf
2370or
2371.B \-CF
273(and, at most sites,
2372table compression options, use of
274.B \-7
2373.B \-7
275is on by default for these options. To see if this is the case, use the
276.B -v
277verbose flag and check the flag summary it reports).
2374will save only a small amount of table space, and make your scanner
2375considerably less portable.
2376.I Flex's
2377default behavior is to generate an 8-bit scanner unless you use the
2378.B \-Cf
2379or
2380.B \-CF,
2381in which case
2382.I flex
2383defaults to generating 7-bit scanners unless your site was always
2384configured to generate 8-bit scanners (as will often be the case
2385with non-USA sites). You can tell whether flex generated a 7-bit
2386or an 8-bit scanner by inspecting the flag summary in the
2387.B \-v
2388output as described above.
2389.IP
2390Note that if you use
2391.B \-Cfe
2392or
2393.B \-CFe
2394(those table compression options, but also using equivalence classes as
2395discussed see below), flex still defaults to generating an 8-bit
2396scanner, since usually with these compression options full 8-bit tables
2397are not much more expensive than 7-bit tables.
278.TP
279.B \-8
280instructs
281.I flex
2398.TP
2399.B \-8
2400instructs
2401.I flex
282to generate an 8-bit scanner. This is the default except for the
2402to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2403characters. This flag is only needed for scanners generated using
283.B \-Cf
2404.B \-Cf
284and
285.B \-CF
286compression options, for which the default is site-dependent, and
287can be checked by inspecting the flag summary generated by the
288.B \-v
289option.
2405or
2406.B \-CF,
2407as otherwise flex defaults to generating an 8-bit scanner anyway.
2408.IP
2409See the discussion of
2410.B \-7
2411above for flex's default behavior and the tradeoffs between 7-bit
2412and 8-bit scanners.
290.TP
291.B \-+
292specifies that you want flex to generate a C++
2413.TP
2414.B \-+
2415specifies that you want flex to generate a C++
293scanner class. See the section on Generating C++ Scanners in
294.I lexdoc(1)
295for details.
2416scanner class. See the section on Generating C++ Scanners below for
2417details.
296.TP
297.B \-C[aefFmr]
2418.TP
2419.B \-C[aefFmr]
298controls the degree of table compression and scanner optimization.
2420controls the degree of table compression and, more generally, trade-offs
2421between small scanners and fast scanners.
299.IP
300.B \-Ca
2422.IP
2423.B \-Ca
301trade off larger tables in the generated scanner for faster performance
302because the elements of the tables are better aligned for memory access
303and computation. This option can double the size of the tables used by
304your scanner.
2424("align") instructs flex to trade off larger tables in the
2425generated scanner for faster performance because the elements of
2426the tables are better aligned for memory access and computation. On some
2427RISC architectures, fetching and manipulating longwords is more efficient
2428than with smaller-sized units such as shortwords. This option can
2429double the size of the tables used by your scanner.
305.IP
306.B \-Ce
307directs
308.I flex
309to construct
310.I equivalence classes,
311i.e., sets of characters
2430.IP
2431.B \-Ce
2432directs
2433.I flex
2434to construct
2435.I equivalence classes,
2436i.e., sets of characters
312which have identical lexical properties.
313Equivalence classes usually give
2437which have identical lexical properties (for example, if the only
2438appearance of digits in the
2439.I flex
2440input is in the character class
2441"[0-9]" then the digits '0', '1', ..., '9' will all be put
2442in the same equivalence class). Equivalence classes usually give
314dramatic reductions in the final table/object file sizes (typically
315a factor of 2-5) and are pretty cheap performance-wise (one array
316look-up per character scanned).
317.IP
318.B \-Cf
319specifies that the
320.I full
321scanner tables should be generated -
322.I flex
323should not compress the
324tables by taking advantages of similar transition functions for
325different states.
326.IP
327.B \-CF
2443dramatic reductions in the final table/object file sizes (typically
2444a factor of 2-5) and are pretty cheap performance-wise (one array
2445look-up per character scanned).
2446.IP
2447.B \-Cf
2448specifies that the
2449.I full
2450scanner tables should be generated -
2451.I flex
2452should not compress the
2453tables by taking advantages of similar transition functions for
2454different states.
2455.IP
2456.B \-CF
328specifies that the alternate fast scanner representation (described in
329.B lexdoc(1))
2457specifies that the alternate fast scanner representation (described
2458above under the
2459.B \-F
2460flag)
330should be used. This option cannot be used with
331.B \-+.
332.IP
333.B \-Cm
334directs
335.I flex
336to construct
337.I meta-equivalence classes,
338which are sets of equivalence classes (or characters, if equivalence
339classes are not being used) that are commonly used together. Meta-equivalence
340classes are often a big win when using compressed tables, but they
341have a moderate performance impact (one or two "if" tests and one
342array look-up per character scanned).
343.IP
344.B \-Cr
345causes the generated scanner to
346.I bypass
2461should be used. This option cannot be used with
2462.B \-+.
2463.IP
2464.B \-Cm
2465directs
2466.I flex
2467to construct
2468.I meta-equivalence classes,
2469which are sets of equivalence classes (or characters, if equivalence
2470classes are not being used) that are commonly used together. Meta-equivalence
2471classes are often a big win when using compressed tables, but they
2472have a moderate performance impact (one or two "if" tests and one
2473array look-up per character scanned).
2474.IP
2475.B \-Cr
2476causes the generated scanner to
2477.I bypass
347using stdio for input. In general this option results in a minor
348performance gain only worthwhile if used in conjunction with
2478use of the standard I/O library (stdio) for input. Instead of calling
2479.B fread()
2480or
2481.B getc(),
2482the scanner will use the
2483.B read()
2484system call, resulting in a performance gain which varies from system
2485to system, but in general is probably negligible unless you are also using
349.B \-Cf
350or
351.B \-CF.
2486.B \-Cf
2487or
2488.B \-CF.
352It can cause surprising behavior if you use stdio yourself to
353read from
2489Using
2490.B \-Cr
2491can cause strange behavior if, for example, you read from
354.I yyin
2492.I yyin
355prior to calling the scanner.
2493using stdio prior to calling the scanner (because the scanner will miss
2494whatever text your previous reads left in the stdio input buffer).
356.IP
2495.IP
2496.B \-Cr
2497has no effect if you define
2498.B YY_INPUT
2499(see The Generated Scanner above).
2500.IP
357A lone
358.B \-C
359specifies that the scanner tables should be compressed but neither
360equivalence classes nor meta-equivalence classes should be used.
361.IP
362The options
363.B \-Cf
364or
365.B \-CF
366and
367.B \-Cm
368do not make sense together - there is no opportunity for meta-equivalence
369classes if the table is not being compressed. Otherwise the options
2501A lone
2502.B \-C
2503specifies that the scanner tables should be compressed but neither
2504equivalence classes nor meta-equivalence classes should be used.
2505.IP
2506The options
2507.B \-Cf
2508or
2509.B \-CF
2510and
2511.B \-Cm
2512do not make sense together - there is no opportunity for meta-equivalence
2513classes if the table is not being compressed. Otherwise the options
370may be freely mixed.
2514may be freely mixed, and are cumulative.
371.IP
372The default setting is
373.B \-Cem,
374which specifies that
375.I flex
376should generate equivalence classes
377and meta-equivalence classes. This setting provides the highest
378degree of table compression. You can trade off

--- 7 unchanged lines hidden (view full) ---

386 -Ce
387 -C
388 -C{f,F}e
389 -C{f,F}
390 -C{f,F}a
391 fastest & largest
392
393.fi
2515.IP
2516The default setting is
2517.B \-Cem,
2518which specifies that
2519.I flex
2520should generate equivalence classes
2521and meta-equivalence classes. This setting provides the highest
2522degree of table compression. You can trade off

--- 7 unchanged lines hidden (view full) ---

2530 -Ce
2531 -C
2532 -C{f,F}e
2533 -C{f,F}
2534 -C{f,F}a
2535 fastest & largest
2536
2537.fi
2538Note that scanners with the smallest tables are usually generated and
2539compiled the quickest, so
2540during development you will usually want to use the default, maximal
2541compression.
394.IP
2542.IP
395.B \-C
396options are cumulative.
2543.B \-Cfe
2544is often a good compromise between speed and size for production
2545scanners.
397.TP
2546.TP
2547.B \-ooutput
2548directs flex to write the scanner to the file
2549.B output
2550instead of
2551.B lex.yy.c.
2552If you combine
2553.B \-o
2554with the
2555.B \-t
2556option, then the scanner is written to
2557.I stdout
2558but its
2559.B #line
2560directives (see the
2561.B \\-L
2562option above) refer to the file
2563.B output.
2564.TP
398.B \-Pprefix
399changes the default
400.I "yy"
401prefix used by
402.I flex
2565.B \-Pprefix
2566changes the default
2567.I "yy"
2568prefix used by
2569.I flex
403to be
404.I prefix
405instead. See
406.I lexdoc(1)
407for a description of all the global variables and file names that
408this affects.
2570for all globally-visible variable and function names to instead be
2571.I prefix.
2572For example,
2573.B \-Pfoo
2574changes the name of
2575.B yytext
2576to
2577.B footext.
2578It also changes the name of the default output file from
2579.B lex.yy.c
2580to
2581.B lex.foo.c.
2582Here are all of the names affected:
2583.nf
2584
2585 yy_create_buffer
2586 yy_delete_buffer
2587 yy_flex_debug
2588 yy_init_buffer
2589 yy_flush_buffer
2590 yy_load_buffer_state
2591 yy_switch_to_buffer
2592 yyin
2593 yyleng
2594 yylex
2595 yylineno
2596 yyout
2597 yyrestart
2598 yytext
2599 yywrap
2600
2601.fi
2602(If you are using a C++ scanner, then only
2603.B yywrap
2604and
2605.B yyFlexLexer
2606are affected.)
2607Within your scanner itself, you can still refer to the global variables
2608and functions using either version of their name; but externally, they
2609have the modified name.
2610.IP
2611This option lets you easily link together multiple
2612.I flex
2613programs into the same executable. Note, though, that using this
2614option also renames
2615.B yywrap(),
2616so you now
2617.I must
2618either
2619provide your own (appropriately-named) version of the routine for your
2620scanner, or use
2621.B %option noyywrap,
2622as linking with
2623.B \-ll
2624no longer provides one for you by default.
409.TP
410.B \-Sskeleton_file
411overrides the default skeleton file from which
412.I flex
413constructs its scanners. You'll never need this option unless you are doing
414.I flex
415maintenance or development.
2625.TP
2626.B \-Sskeleton_file
2627overrides the default skeleton file from which
2628.I flex
2629constructs its scanners. You'll never need this option unless you are doing
2630.I flex
2631maintenance or development.
416.SH SUMMARY OF FLEX REGULAR EXPRESSIONS
417The patterns in the input are written using an extended set of regular
418expressions. These are:
2632.PP
2633.I flex
2634also provides a mechanism for controlling options within the
2635scanner specification itself, rather than from the flex command-line.
2636This is done by including
2637.B %option
2638directives in the first section of the scanner specification.
2639You can specify multiple options with a single
2640.B %option
2641directive, and multiple directives in the first section of your flex input
2642file.
2643.PP
2644Most options are given simply as names, optionally preceded by the
2645word "no" (with no intervening whitespace) to negate their meaning.
2646A number are equivalent to flex flags or their negation:
419.nf
420
2647.nf
2648
421 x match the character 'x'
422 . any character except newline
423 [xyz] a "character class"; in this case, the pattern
424 matches either an 'x', a 'y', or a 'z'
425 [abj-oZ] a "character class" with a range in it; matches
426 an 'a', a 'b', any letter from 'j' through 'o',
427 or a 'Z'
428 [^A-Z] a "negated character class", i.e., any character
429 but those in the class. In this case, any
430 character EXCEPT an uppercase letter.
431 [^A-Z\\n] any character EXCEPT an uppercase letter or
432 a newline
433 r* zero or more r's, where r is any regular expression
434 r+ one or more r's
435 r? zero or one r's (that is, "an optional r")
436 r{2,5} anywhere from two to five r's
437 r{2,} two or more r's
438 r{4} exactly 4 r's
439 {name} the expansion of the "name" definition
440 (see above)
441 "[xyz]\\"foo"
442 the literal string: [xyz]"foo
443 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
444 then the ANSI-C interpretation of \\x.
445 Otherwise, a literal 'X' (used to escape
446 operators such as '*')
447 \\123 the character with octal value 123
448 \\x2a the character with hexadecimal value 2a
449 (r) match an r; parentheses are used to override
450 precedence (see below)
2649 7bit -7 option
2650 8bit -8 option
2651 align -Ca option
2652 backup -b option
2653 batch -B option
2654 c++ -+ option
451
2655
2656 caseful or
2657 case-sensitive opposite of -i (default)
452
2658
453 rs the regular expression r followed by the
454 regular expression s; called "concatenation"
2659 case-insensitive or
2660 caseless -i option
455
2661
2662 debug -d option
2663 default opposite of -s option
2664 ecs -Ce option
2665 fast -F option
2666 full -f option
2667 interactive -I option
2668 lex-compat -l option
2669 meta-ecs -Cm option
2670 perf-report -p option
2671 read -Cr option
2672 stdout -t option
2673 verbose -v option
2674 warn opposite of -w option
2675 (use "%option nowarn" for -w)
456
2676
457 r|s either an r or an s
2677 array equivalent to "%array"
2678 pointer equivalent to "%pointer" (default)
458
2679
2680.fi
2681Some
2682.B %option's
2683provide features otherwise not available:
2684.TP
2685.B always-interactive
2686instructs flex to generate a scanner which always considers its input
2687"interactive". Normally, on each new input file the scanner calls
2688.B isatty()
2689in an attempt to determine whether
2690the scanner's input source is interactive and thus should be read a
2691character at a time. When this option is used, however, then no
2692such call is made.
2693.TP
2694.B main
2695directs flex to provide a default
2696.B main()
2697program for the scanner, which simply calls
2698.B yylex().
2699This option implies
2700.B noyywrap
2701(see below).
2702.TP
2703.B never-interactive
2704instructs flex to generate a scanner which never considers its input
2705"interactive" (again, no call made to
2706.B isatty()).
2707This is the opposite of
2708.B always-interactive.
2709.TP
2710.B stack
2711enables the use of start condition stacks (see Start Conditions above).
2712.TP
2713.B stdinit
2714if set (i.e.,
2715.B %option stdinit)
2716initializes
2717.I yyin
2718and
2719.I yyout
2720to
2721.I stdin
2722and
2723.I stdout,
2724instead of the default of
2725.I nil.
2726Some existing
2727.I lex
2728programs depend on this behavior, even though it is not compliant with
2729ANSI C, which does not require
2730.I stdin
2731and
2732.I stdout
2733to be compile-time constant.
2734.TP
2735.B yylineno
2736directs
2737.I flex
2738to generate a scanner that maintains the number of the current line
2739read from its input in the global variable
2740.B yylineno.
2741This option is implied by
2742.B %option lex-compat.
2743.TP
2744.B yywrap
2745if unset (i.e.,
2746.B %option noyywrap),
2747makes the scanner not call
2748.B yywrap()
2749upon an end-of-file, but simply assume that there are no more
2750files to scan (until the user points
2751.I yyin
2752at a new file and calls
2753.B yylex()
2754again).
2755.PP
2756.I flex
2757scans your rule actions to determine whether you use the
2758.B REJECT
2759or
2760.B yymore()
2761features. The
2762.B reject
2763and
2764.B yymore
2765options are available to override its decision as to whether you use the
2766options, either by setting them (e.g.,
2767.B %option reject)
2768to indicate the feature is indeed used, or
2769unsetting them to indicate it actually is not used
2770(e.g.,
2771.B %option noyymore).
2772.PP
2773Three options take string-delimited values, offset with '=':
2774.nf
459
2775
460 r/s an r but only if it is followed by an s. The
461 s is not part of the matched text. This type
462 of pattern is called as "trailing context".
463 ^r an r, but only at the beginning of a line
464 r$ an r, but only at the end of a line. Equivalent
465 to "r/\\n".
2776 %option outfile="ABC"
466
2777
2778.fi
2779is equivalent to
2780.B -oABC,
2781and
2782.nf
467
2783
468 <s>r an r, but only in start condition s (see
469 below for discussion of start conditions)
470 <s1,s2,s3>r
471 same, but in any of start conditions s1,
472 s2, or s3
473 <*>r an r in any start condition, even an exclusive one.
2784 %option prefix="XYZ"
474
2785
2786.fi
2787is equivalent to
2788.B -PXYZ.
2789Finally,
2790.nf
475
2791
476 <<EOF>> an end-of-file
477 <s1,s2><<EOF>>
478 an end-of-file when in start condition s1 or s2
2792 %option yyclass="foo"
479
480.fi
2793
2794.fi
481The regular expressions listed above are grouped according to
482precedence, from highest precedence at the top to lowest at the bottom.
483Those grouped together have equal precedence.
2795only applies when generating a C++ scanner (
2796.B \-+
2797option). It informs
2798.I flex
2799that you have derived
2800.B foo
2801as a subclass of
2802.B yyFlexLexer,
2803so
2804.I flex
2805will place your actions in the member function
2806.B foo::yylex()
2807instead of
2808.B yyFlexLexer::yylex().
2809It also generates a
2810.B yyFlexLexer::yylex()
2811member function that emits a run-time error (by invoking
2812.B yyFlexLexer::LexerError())
2813if called.
2814See Generating C++ Scanners, below, for additional information.
484.PP
2815.PP
485Some notes on patterns:
486.IP -
487Negated character classes
488.I match newlines
489unless "\\n" (or an equivalent escape sequence) is one of the
490characters explicitly present in the negated character class
491(e.g., "[^A-Z\\n]").
492.IP -
493A rule can have at most one instance of trailing context (the '/' operator
494or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
495can only occur at the beginning of a pattern, and, as well as with '/' and '$',
496cannot be grouped inside parentheses. The following are all illegal:
2816A number of options are available for lint purists who want to suppress
2817the appearance of unneeded routines in the generated scanner. Each of the
2818following, if unset
2819(e.g.,
2820.B %option nounput
2821), results in the corresponding routine not appearing in
2822the generated scanner:
497.nf
498
2823.nf
2824
499 foo/bar$
500 foo|(bar$)
501 foo|^bar
502 <sc1>foo<sc2>bar
2825 input, unput
2826 yy_push_state, yy_pop_state, yy_top_state
2827 yy_scan_buffer, yy_scan_bytes, yy_scan_string
503
504.fi
2828
2829.fi
505.SH SUMMARY OF SPECIAL ACTIONS
506In addition to arbitrary C code, the following can appear in actions:
507.IP -
508.B ECHO
509copies yytext to the scanner's output.
510.IP -
511.B BEGIN
512followed by the name of a start condition places the scanner in the
513corresponding start condition.
514.IP -
2830(though
2831.B yy_push_state()
2832and friends won't appear anyway unless you use
2833.B %option stack).
2834.SH PERFORMANCE CONSIDERATIONS
2835The main design goal of
2836.I flex
2837is that it generate high-performance scanners. It has been optimized
2838for dealing well with large sets of rules. Aside from the effects on
2839scanner speed of the table compression
2840.B \-C
2841options outlined above,
2842there are a number of options/actions which degrade performance. These
2843are, from most expensive to least:
2844.nf
2845
2846 REJECT
2847 %option yylineno
2848 arbitrary trailing context
2849
2850 pattern sets that require backing up
2851 %array
2852 %option interactive
2853 %option always-interactive
2854
2855 '^' beginning-of-line operator
2856 yymore()
2857
2858.fi
2859with the first three all being quite expensive and the last two
2860being quite cheap. Note also that
2861.B unput()
2862is implemented as a routine call that potentially does quite a bit of
2863work, while
2864.B yyless()
2865is a quite-cheap macro; so if just putting back some excess text you
2866scanned, use
2867.B yyless().
2868.PP
515.B REJECT
2869.B REJECT
516directs the scanner to proceed on to the "second best" rule which matched the
517input (or a prefix of the input).
518.B yytext
519and
520.B yyleng
521are set up appropriately. Note that
522.B REJECT
523is a particularly expensive feature in terms scanner performance;
524if it is used in
525.I any
526of the scanner's actions it will slow down
527.I all
528of the scanner's matching. Furthermore,
529.B REJECT
530cannot be used with the
531.B \-f
2870should be avoided at all costs when performance is important.
2871It is a particularly expensive option.
2872.PP
2873Getting rid of backing up is messy and often may be an enormous
2874amount of work for a complicated scanner. In principal, one begins
2875by using the
2876.B \-b
2877flag to generate a
2878.I lex.backup
2879file. For example, on the input
2880.nf
2881
2882 %%
2883 foo return TOK_KEYWORD;
2884 foobar return TOK_KEYWORD;
2885
2886.fi
2887the file looks like:
2888.nf
2889
2890 State #6 is non-accepting -
2891 associated rule line numbers:
2892 2 3
2893 out-transitions: [ o ]
2894 jam-transitions: EOF [ \\001-n p-\\177 ]
2895
2896 State #8 is non-accepting -
2897 associated rule line numbers:
2898 3
2899 out-transitions: [ a ]
2900 jam-transitions: EOF [ \\001-` b-\\177 ]
2901
2902 State #9 is non-accepting -
2903 associated rule line numbers:
2904 3
2905 out-transitions: [ r ]
2906 jam-transitions: EOF [ \\001-q s-\\177 ]
2907
2908 Compressed tables always back up.
2909
2910.fi
2911The first few lines tell us that there's a scanner state in
2912which it can make a transition on an 'o' but not on any other
2913character, and that in that state the currently scanned text does not match
2914any rule. The state occurs when trying to match the rules found
2915at lines 2 and 3 in the input file.
2916If the scanner is in that state and then reads
2917something other than an 'o', it will have to back up to find
2918a rule which is matched. With
2919a bit of headscratching one can see that this must be the
2920state it's in when it has seen "fo". When this has happened,
2921if anything other than another 'o' is seen, the scanner will
2922have to back up to simply match the 'f' (by the default rule).
2923.PP
2924The comment regarding State #8 indicates there's a problem
2925when "foob" has been scanned. Indeed, on any character other
2926than an 'a', the scanner will have to back up to accept "foo".
2927Similarly, the comment for State #9 concerns when "fooba" has
2928been scanned and an 'r' does not follow.
2929.PP
2930The final comment reminds us that there's no point going to
2931all the trouble of removing backing up from the rules unless
2932we're using
2933.B \-Cf
532or
2934or
533.B \-F
534options.
535.IP
536Note also that unlike the other special actions,
2935.B \-CF,
2936since there's no performance gain doing so with compressed scanners.
2937.PP
2938The way to remove the backing up is to add "error" rules:
2939.nf
2940
2941 %%
2942 foo return TOK_KEYWORD;
2943 foobar return TOK_KEYWORD;
2944
2945 fooba |
2946 foob |
2947 fo {
2948 /* false alarm, not really a keyword */
2949 return TOK_ID;
2950 }
2951
2952.fi
2953.PP
2954Eliminating backing up among a list of keywords can also be
2955done using a "catch-all" rule:
2956.nf
2957
2958 %%
2959 foo return TOK_KEYWORD;
2960 foobar return TOK_KEYWORD;
2961
2962 [a-z]+ return TOK_ID;
2963
2964.fi
2965This is usually the best solution when appropriate.
2966.PP
2967Backing up messages tend to cascade.
2968With a complicated set of rules it's not uncommon to get hundreds
2969of messages. If one can decipher them, though, it often
2970only takes a dozen or so rules to eliminate the backing up (though
2971it's easy to make a mistake and have an error rule accidentally match
2972a valid token. A possible future
2973.I flex
2974feature will be to automatically add rules to eliminate backing up).
2975.PP
2976It's important to keep in mind that you gain the benefits of eliminating
2977backing up only if you eliminate
2978.I every
2979instance of backing up. Leaving just one means you gain nothing.
2980.PP
2981.I Variable
2982trailing context (where both the leading and trailing parts do not have
2983a fixed length) entails almost the same performance loss as
537.B REJECT
2984.B REJECT
538is a
539.I branch;
540code immediately following it in the action will
2985(i.e., substantial). So when possible a rule like:
2986.nf
2987
2988 %%
2989 mouse|rat/(cat|dog) run();
2990
2991.fi
2992is better written:
2993.nf
2994
2995 %%
2996 mouse/cat|dog run();
2997 rat/cat|dog run();
2998
2999.fi
3000or as
3001.nf
3002
3003 %%
3004 mouse|rat/cat run();
3005 mouse|rat/dog run();
3006
3007.fi
3008Note that here the special '|' action does
541.I not
3009.I not
542be executed.
543.IP -
544.B yymore()
545tells the scanner that the next time it matches a rule, the corresponding
546token should be
547.I appended
548onto the current value of
3010provide any savings, and can even make things worse (see
3011Deficiencies / Bugs below).
3012.LP
3013Another area where the user can increase a scanner's performance
3014(and one that's easier to implement) arises from the fact that
3015the longer the tokens matched, the faster the scanner will run.
3016This is because with long tokens the processing of most input
3017characters takes place in the (short) inner scanning loop, and
3018does not often have to go through the additional work of setting up
3019the scanning environment (e.g.,
3020.B yytext)
3021for the action. Recall the scanner for C comments:
3022.nf
3023
3024 %x comment
3025 %%
3026 int line_num = 1;
3027
3028 "/*" BEGIN(comment);
3029
3030 <comment>[^*\\n]*
3031 <comment>"*"+[^*/\\n]*
3032 <comment>\\n ++line_num;
3033 <comment>"*"+"/" BEGIN(INITIAL);
3034
3035.fi
3036This could be sped up by writing it as:
3037.nf
3038
3039 %x comment
3040 %%
3041 int line_num = 1;
3042
3043 "/*" BEGIN(comment);
3044
3045 <comment>[^*\\n]*
3046 <comment>[^*\\n]*\\n ++line_num;
3047 <comment>"*"+[^*/\\n]*
3048 <comment>"*"+[^*/\\n]*\\n ++line_num;
3049 <comment>"*"+"/" BEGIN(INITIAL);
3050
3051.fi
3052Now instead of each newline requiring the processing of another
3053action, recognizing the newlines is "distributed" over the other rules
3054to keep the matched text as long as possible. Note that
3055.I adding
3056rules does
3057.I not
3058slow down the scanner! The speed of the scanner is independent
3059of the number of rules or (modulo the considerations given at the
3060beginning of this section) how complicated the rules are with
3061regard to operators such as '*' and '|'.
3062.PP
3063A final example in speeding up a scanner: suppose you want to scan
3064through a file containing identifiers and keywords, one per line
3065and with no other extraneous characters, and recognize all the
3066keywords. A natural first approach is:
3067.nf
3068
3069 %%
3070 asm |
3071 auto |
3072 break |
3073 ... etc ...
3074 volatile |
3075 while /* it's a keyword */
3076
3077 .|\\n /* it's not a keyword */
3078
3079.fi
3080To eliminate the back-tracking, introduce a catch-all rule:
3081.nf
3082
3083 %%
3084 asm |
3085 auto |
3086 break |
3087 ... etc ...
3088 volatile |
3089 while /* it's a keyword */
3090
3091 [a-z]+ |
3092 .|\\n /* it's not a keyword */
3093
3094.fi
3095Now, if it's guaranteed that there's exactly one word per line,
3096then we can reduce the total number of matches by a half by
3097merging in the recognition of newlines with that of the other
3098tokens:
3099.nf
3100
3101 %%
3102 asm\\n |
3103 auto\\n |
3104 break\\n |
3105 ... etc ...
3106 volatile\\n |
3107 while\\n /* it's a keyword */
3108
3109 [a-z]+\\n |
3110 .|\\n /* it's not a keyword */
3111
3112.fi
3113One has to be careful here, as we have now reintroduced backing up
3114into the scanner. In particular, while
3115.I we
3116know that there will never be any characters in the input stream
3117other than letters or newlines,
3118.I flex
3119can't figure this out, and it will plan for possibly needing to back up
3120when it has scanned a token like "auto" and then the next character
3121is something other than a newline or a letter. Previously it would
3122then just match the "auto" rule and be done, but now it has no "auto"
3123rule, only a "auto\\n" rule. To eliminate the possibility of backing up,
3124we could either duplicate all rules but without final newlines, or,
3125since we never expect to encounter such an input and therefore don't
3126how it's classified, we can introduce one more catch-all rule, this
3127one which doesn't include a newline:
3128.nf
3129
3130 %%
3131 asm\\n |
3132 auto\\n |
3133 break\\n |
3134 ... etc ...
3135 volatile\\n |
3136 while\\n /* it's a keyword */
3137
3138 [a-z]+\\n |
3139 [a-z]+ |
3140 .|\\n /* it's not a keyword */
3141
3142.fi
3143Compiled with
3144.B \-Cf,
3145this is about as fast as one can get a
3146.I flex
3147scanner to go for this particular problem.
3148.PP
3149A final note:
3150.I flex
3151is slow when matching NUL's, particularly when a token contains
3152multiple NUL's.
3153It's best to write rules which match
3154.I short
3155amounts of text if it's anticipated that the text will often include NUL's.
3156.PP
3157Another final note regarding performance: as mentioned above in the section
3158How the Input is Matched, dynamically resizing
549.B yytext
3159.B yytext
550rather than replacing it.
551.IP -
552.B yyless(n)
553returns all but the first
554.I n
555characters of the current token back to the input stream, where they
556will be rescanned when the scanner looks for the next match.
557.B yytext
3160to accommodate huge tokens is a slow process because it presently requires that
3161the (huge) token be rescanned from the beginning. Thus if performance is
3162vital, you should attempt to match "large" quantities of text but not
3163"huge" quantities, where the cutoff between the two is at about 8K
3164characters/token.
3165.SH GENERATING C++ SCANNERS
3166.I flex
3167provides two different ways to generate scanners for use with C++. The
3168first way is to simply compile a scanner generated by
3169.I flex
3170using a C++ compiler instead of a C compiler. You should not encounter
3171any compilations errors (please report any you find to the email address
3172given in the Author section below). You can then use C++ code in your
3173rule actions instead of C code. Note that the default input source for
3174your scanner remains
3175.I yyin,
3176and default echoing is still done to
3177.I yyout.
3178Both of these remain
3179.I FILE *
3180variables and not C++
3181.I streams.
3182.PP
3183You can also use
3184.I flex
3185to generate a C++ scanner class, using the
3186.B \-+
3187option (or, equivalently,
3188.B %option c++),
3189which is automatically specified if the name of the flex
3190executable ends in a '+', such as
3191.I flex++.
3192When using this option, flex defaults to generating the scanner to the file
3193.B lex.yy.cc
3194instead of
3195.B lex.yy.c.
3196The generated scanner includes the header file
3197.I FlexLexer.h,
3198which defines the interface to two C++ classes.
3199.PP
3200The first class,
3201.B FlexLexer,
3202provides an abstract base class defining the general scanner class
3203interface. It provides the following member functions:
3204.TP
3205.B const char* YYText()
3206returns the text of the most recently matched token, the equivalent of
3207.B yytext.
3208.TP
3209.B int YYLeng()
3210returns the length of the most recently matched token, the equivalent of
3211.B yyleng.
3212.TP
3213.B int lineno() const
3214returns the current input line number
3215(see
3216.B %option yylineno),
3217or
3218.B 1
3219if
3220.B %option yylineno
3221was not used.
3222.TP
3223.B void set_debug( int flag )
3224sets the debugging flag for the scanner, equivalent to assigning to
3225.B yy_flex_debug
3226(see the Options section above). Note that you must build the scanner
3227using
3228.B %option debug
3229to include debugging information in it.
3230.TP
3231.B int debug() const
3232returns the current setting of the debugging flag.
3233.PP
3234Also provided are member functions equivalent to
3235.B yy_switch_to_buffer(),
3236.B yy_create_buffer()
3237(though the first argument is an
3238.B istream*
3239object pointer and not a
3240.B FILE*),
3241.B yy_flush_buffer(),
3242.B yy_delete_buffer(),
558and
3243and
559.B yyleng
560are adjusted appropriately (e.g.,
561.B yyleng
562will now be equal to
563.I n
564).
3244.B yyrestart()
3245(again, the first argument is a
3246.B istream*
3247object pointer).
3248.PP
3249The second class defined in
3250.I FlexLexer.h
3251is
3252.B yyFlexLexer,
3253which is derived from
3254.B FlexLexer.
3255It defines the following additional member functions:
3256.TP
3257.B
3258yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )
3259constructs a
3260.B yyFlexLexer
3261object using the given streams for input and output. If not specified,
3262the streams default to
3263.B cin
3264and
3265.B cout,
3266respectively.
3267.TP
3268.B virtual int yylex()
3269performs the same role is
3270.B yylex()
3271does for ordinary flex scanners: it scans the input stream, consuming
3272tokens, until a rule's action returns a value. If you derive a subclass
3273.B S
3274from
3275.B yyFlexLexer
3276and want to access the member functions and variables of
3277.B S
3278inside
3279.B yylex(),
3280then you need to use
3281.B %option yyclass="S"
3282to inform
3283.I flex
3284that you will be using that subclass instead of
3285.B yyFlexLexer.
3286In this case, rather than generating
3287.B yyFlexLexer::yylex(),
3288.I flex
3289generates
3290.B S::yylex()
3291(and also generates a dummy
3292.B yyFlexLexer::yylex()
3293that calls
3294.B yyFlexLexer::LexerError()
3295if called).
3296.TP
3297.B
3298virtual void switch_streams(istream* new_in = 0,
3299.B
3300ostream* new_out = 0)
3301reassigns
3302.B yyin
3303to
3304.B new_in
3305(if non-nil)
3306and
3307.B yyout
3308to
3309.B new_out
3310(ditto), deleting the previous input buffer if
3311.B yyin
3312is reassigned.
3313.TP
3314.B
3315int yylex( istream* new_in, ostream* new_out = 0 )
3316first switches the input streams via
3317.B switch_streams( new_in, new_out )
3318and then returns the value of
3319.B yylex().
3320.PP
3321In addition,
3322.B yyFlexLexer
3323defines the following protected virtual functions which you can redefine
3324in derived classes to tailor the scanner:
3325.TP
3326.B
3327virtual int LexerInput( char* buf, int max_size )
3328reads up to
3329.B max_size
3330characters into
3331.B buf
3332and returns the number of characters read. To indicate end-of-input,
3333return 0 characters. Note that "interactive" scanners (see the
3334.B \-B
3335and
3336.B \-I
3337flags) define the macro
3338.B YY_INTERACTIVE.
3339If you redefine
3340.B LexerInput()
3341and need to take different actions depending on whether or not
3342the scanner might be scanning an interactive input source, you can
3343test for the presence of this name via
3344.B #ifdef.
3345.TP
3346.B
3347virtual void LexerOutput( const char* buf, int size )
3348writes out
3349.B size
3350characters from the buffer
3351.B buf,
3352which, while NUL-terminated, may also contain "internal" NUL's if
3353the scanner's rules can match text with NUL's in them.
3354.TP
3355.B
3356virtual void LexerError( const char* msg )
3357reports a fatal error message. The default version of this function
3358writes the message to the stream
3359.B cerr
3360and exits.
3361.PP
3362Note that a
3363.B yyFlexLexer
3364object contains its
3365.I entire
3366scanning state. Thus you can use such objects to create reentrant
3367scanners. You can instantiate multiple instances of the same
3368.B yyFlexLexer
3369class, and you can also combine multiple C++ scanner classes together
3370in the same program using the
3371.B \-P
3372option discussed above.
3373.PP
3374Finally, note that the
3375.B %array
3376feature is not available to C++ scanner classes; you must use
3377.B %pointer
3378(the default).
3379.PP
3380Here is an example of a simple C++ scanner:
3381.nf
3382
3383 // An example of using the flex C++ scanner class.
3384
3385 %{
3386 int mylineno = 0;
3387 %}
3388
3389 string \\"[^\\n"]+\\"
3390
3391 ws [ \\t]+
3392
3393 alpha [A-Za-z]
3394 dig [0-9]
3395 name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])*
3396 num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)?
3397 num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)?
3398 number {num1}|{num2}
3399
3400 %%
3401
3402 {ws} /* skip blanks and tabs */
3403
3404 "/*" {
3405 int c;
3406
3407 while((c = yyinput()) != 0)
3408 {
3409 if(c == '\\n')
3410 ++mylineno;
3411
3412 else if(c == '*')
3413 {
3414 if((c = yyinput()) == '/')
3415 break;
3416 else
3417 unput(c);
3418 }
3419 }
3420 }
3421
3422 {number} cout << "number " << YYText() << '\\n';
3423
3424 \\n mylineno++;
3425
3426 {name} cout << "name " << YYText() << '\\n';
3427
3428 {string} cout << "string " << YYText() << '\\n';
3429
3430 %%
3431
3432 int main( int /* argc */, char** /* argv */ )
3433 {
3434 FlexLexer* lexer = new yyFlexLexer;
3435 while(lexer->yylex() != 0)
3436 ;
3437 return 0;
3438 }
3439.fi
3440If you want to create multiple (different) lexer classes, you use the
3441.B \-P
3442flag (or the
3443.B prefix=
3444option) to rename each
3445.B yyFlexLexer
3446to some other
3447.B xxFlexLexer.
3448You then can include
3449.B <FlexLexer.h>
3450in your other sources once per lexer class, first renaming
3451.B yyFlexLexer
3452as follows:
3453.nf
3454
3455 #undef yyFlexLexer
3456 #define yyFlexLexer xxFlexLexer
3457 #include <FlexLexer.h>
3458
3459 #undef yyFlexLexer
3460 #define yyFlexLexer zzFlexLexer
3461 #include <FlexLexer.h>
3462
3463.fi
3464if, for example, you used
3465.B %option prefix="xx"
3466for one of your scanners and
3467.B %option prefix="zz"
3468for the other.
3469.PP
3470IMPORTANT: the present form of the scanning class is
3471.I experimental
3472and may change considerably between major releases.
3473.SH INCOMPATIBILITIES WITH LEX AND POSIX
3474.I flex
3475is a rewrite of the AT&T Unix
3476.I lex
3477tool (the two implementations do not share any code, though),
3478with some extensions and incompatibilities, both of which
3479are of concern to those who wish to write scanners acceptable
3480to either implementation. Flex is fully compliant with the POSIX
3481.I lex
3482specification, except that when using
3483.B %pointer
3484(the default), a call to
3485.B unput()
3486destroys the contents of
3487.B yytext,
3488which is counter to the POSIX specification.
3489.PP
3490In this section we discuss all of the known areas of incompatibility
3491between flex, AT&T lex, and the POSIX specification.
3492.PP
3493.I flex's
3494.B \-l
3495option turns on maximum compatibility with the original AT&T
3496.I lex
3497implementation, at the cost of a major loss in the generated scanner's
3498performance. We note below which incompatibilities can be overcome
3499using the
3500.B \-l
3501option.
3502.PP
3503.I flex
3504is fully compatible with
3505.I lex
3506with the following exceptions:
565.IP -
3507.IP -
566.B unput(c)
567puts the character
568.I c
569back onto the input stream. It will be the next character scanned.
3508The undocumented
3509.I lex
3510scanner internal variable
3511.B yylineno
3512is not supported unless
3513.B \-l
3514or
3515.B %option yylineno
3516is used.
3517.IP
3518.B yylineno
3519should be maintained on a per-buffer basis, rather than a per-scanner
3520(single global variable) basis.
3521.IP
3522.B yylineno
3523is not part of the POSIX specification.
570.IP -
3524.IP -
3525The
571.B input()
3526.B input()
572reads the next character from the input stream (this routine is called
573.B yyinput()
574if the scanner is compiled using
575.B C++).
576.IP -
577.B yyterminate()
578can be used in lieu of a return statement in an action. It terminates
579the scanner and returns a 0 to the scanner's caller, indicating "all done".
3527routine is not redefinable, though it may be called to read characters
3528following whatever has been matched by a rule. If
3529.B input()
3530encounters an end-of-file the normal
3531.B yywrap()
3532processing is done. A ``real'' end-of-file is returned by
3533.B input()
3534as
3535.I EOF.
580.IP
3536.IP
581By default,
582.B yyterminate()
583is also called when an end-of-file is encountered. It is a macro and
584may be redefined.
3537Input is instead controlled by defining the
3538.B YY_INPUT
3539macro.
3540.IP
3541The
3542.I flex
3543restriction that
3544.B input()
3545cannot be redefined is in accordance with the POSIX specification,
3546which simply does not specify any way of controlling the
3547scanner's input other than by making an initial assignment to
3548.I yyin.
585.IP -
3549.IP -
586.B YY_NEW_FILE
587is an action available only in <<EOF>> rules. It means "Okay, I've
588set up a new input file, continue scanning". It is no longer required;
589you can just assign
590.I yyin
591to point to a new file in the <<EOF>> action.
3550The
3551.B unput()
3552routine is not redefinable. This restriction is in accordance with POSIX.
592.IP -
3553.IP -
593.B yy_create_buffer( file, size )
594takes a
595.I FILE
596pointer and an integer
597.I size.
598It returns a YY_BUFFER_STATE
599handle to a new input buffer large enough to accomodate
600.I size
601characters and associated with the given file. When in doubt, use
602.B YY_BUF_SIZE
603for the size.
3554.I flex
3555scanners are not as reentrant as
3556.I lex
3557scanners. In particular, if you have an interactive scanner and
3558an interrupt handler which long-jumps out of the scanner, and
3559the scanner is subsequently called again, you may get the following
3560message:
3561.nf
3562
3563 fatal flex scanner internal error--end of buffer missed
3564
3565.fi
3566To reenter the scanner, first use
3567.nf
3568
3569 yyrestart( yyin );
3570
3571.fi
3572Note that this call will throw away any buffered input; usually this
3573isn't a problem with an interactive scanner.
3574.IP
3575Also note that flex C++ scanner classes
3576.I are
3577reentrant, so if using C++ is an option for you, you should use
3578them instead. See "Generating C++ Scanners" above for details.
604.IP -
3579.IP -
605.B yy_switch_to_buffer( new_buffer )
606switches the scanner's processing to scan for tokens from
607the given buffer, which must be a YY_BUFFER_STATE.
3580.B output()
3581is not supported.
3582Output from the
3583.B ECHO
3584macro is done to the file-pointer
3585.I yyout
3586(default
3587.I stdout).
3588.IP
3589.B output()
3590is not part of the POSIX specification.
608.IP -
3591.IP -
609.B yy_delete_buffer( buffer )
610deletes the given buffer.
611.SH VALUES AVAILABLE TO THE USER
3592.I lex
3593does not support exclusive start conditions (%x), though they
3594are in the POSIX specification.
612.IP -
3595.IP -
613.B char *yytext
614holds the text of the current token. It may be modified but not lengthened
615(you cannot append characters to the end). Modifying the last character
616may affect the activity of rules anchored using '^' during the next scan;
617see
618.B lexdoc(1)
619for details.
3596When definitions are expanded,
3597.I flex
3598encloses them in parentheses.
3599With lex, the following:
3600.nf
3601
3602 NAME [A-Z][A-Z0-9]*
3603 %%
3604 foo{NAME}? printf( "Found it\\n" );
3605 %%
3606
3607.fi
3608will not match the string "foo" because when the macro
3609is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
3610and the precedence is such that the '?' is associated with
3611"[A-Z0-9]*". With
3612.I flex,
3613the rule will be expanded to
3614"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
620.IP
3615.IP
621If the special directive
622.B %array
623appears in the first section of the scanner description, then
624.B yytext
625is instead declared
626.B char yytext[YYLMAX],
627where
628.B YYLMAX
629is a macro definition that you can redefine in the first section
630if you don't like the default value (generally 8KB). Using
631.B %array
632results in somewhat slower scanners, but the value of
633.B yytext
634becomes immune to calls to
635.I input()
3616Note that if the definition begins with
3617.B ^
3618or ends with
3619.B $
3620then it is
3621.I not
3622expanded with parentheses, to allow these operators to appear in
3623definitions without losing their special meanings. But the
3624.B <s>, /,
636and
3625and
637.I unput(),
638which potentially destroy its value when
639.B yytext
640is a character pointer. The opposite of
641.B %array
642is
643.B %pointer,
644which is the default.
3626.B <<EOF>>
3627operators cannot be used in a
3628.I flex
3629definition.
645.IP
3630.IP
646You cannot use
647.B %array
648when generating C++ scanner classes
649(the
650.B \-+
651flag).
3631Using
3632.B \-l
3633results in the
3634.I lex
3635behavior of no parentheses around the definition.
3636.IP
3637The POSIX specification is that the definition be enclosed in parentheses.
652.IP -
3638.IP -
653.B int yyleng
654holds the length of the current token.
655.IP -
656.B FILE *yyin
657is the file which by default
3639Some implementations of
3640.I lex
3641allow a rule's action to begin on a separate line, if the rule's pattern
3642has trailing whitespace:
3643.nf
3644
3645 %%
3646 foo|bar<space here>
3647 { foobar_action(); }
3648
3649.fi
658.I flex
3650.I flex
659reads from. It may be redefined but doing so only makes sense before
660scanning begins or after an EOF has been encountered. Changing it in
661the midst of scanning will have unexpected results since
662.I flex
663buffers its input; use
664.B yyrestart()
665instead.
666Once scanning terminates because an end-of-file
667has been seen,
668.B
669you can assign
670.I yyin
671at the new input file and then call the scanner again to continue scanning.
3651does not support this feature.
672.IP -
3652.IP -
673.B void yyrestart( FILE *new_file )
674may be called to point
675.I yyin
676at the new input file. The switch-over to the new file is immediate
677(any previously buffered-up input is lost). Note that calling
678.B yyrestart()
679with
680.I yyin
681as an argument thus throws away the current input buffer and continues
682scanning the same input file.
3653The
3654.I lex
3655.B %r
3656(generate a Ratfor scanner) option is not supported. It is not part
3657of the POSIX specification.
683.IP -
3658.IP -
684.B FILE *yyout
685is the file to which
686.B ECHO
687actions are done. It can be reassigned by the user.
3659After a call to
3660.B unput(),
3661.I yytext
3662is undefined until the next token is matched, unless the scanner
3663was built using
3664.B %array.
3665This is not the case with
3666.I lex
3667or the POSIX specification. The
3668.B \-l
3669option does away with this incompatibility.
688.IP -
3670.IP -
689.B YY_CURRENT_BUFFER
690returns a
691.B YY_BUFFER_STATE
692handle to the current buffer.
3671The precedence of the
3672.B {}
3673(numeric range) operator is different.
3674.I lex
3675interprets "abc{1,3}" as "match one, two, or
3676three occurrences of 'abc'", whereas
3677.I flex
3678interprets it as "match 'ab'
3679followed by one, two, or three occurrences of 'c'". The latter is
3680in agreement with the POSIX specification.
693.IP -
3681.IP -
694.B YY_START
695returns an integer value corresponding to the current start
696condition. You can subsequently use this value with
697.B BEGIN
698to return to that start condition.
699.SH MACROS AND FUNCTIONS YOU CAN REDEFINE
3682The precedence of the
3683.B ^
3684operator is different.
3685.I lex
3686interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
3687or 'bar' anywhere", whereas
3688.I flex
3689interprets it as "match either 'foo' or 'bar' if they come at the beginning
3690of a line". The latter is in agreement with the POSIX specification.
700.IP -
3691.IP -
701.B YY_DECL
702controls how the scanning routine is declared.
703By default, it is "int yylex()", or, if prototypes are being
704used, "int yylex(void)". This definition may be changed by redefining
705the "YY_DECL" macro. Note that
706if you give arguments to the scanning routine using a
707K&R-style/non-prototyped function declaration, you must terminate
708the definition with a semi-colon (;).
3692The special table-size declarations such as
3693.B %a
3694supported by
3695.I lex
3696are not required by
3697.I flex
3698scanners;
3699.I flex
3700ignores them.
709.IP -
3701.IP -
710The nature of how the scanner
711gets its input can be controlled by redefining the
712.B YY_INPUT
713macro.
714YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
715action is to place up to
716.I max_size
717characters in the character array
718.I buf
719and return in the integer variable
720.I result
721either the
722number of characters read or the constant YY_NULL (0 on Unix systems)
723to indicate EOF. The default YY_INPUT reads from the
724global file-pointer "yyin".
725A sample redefinition of YY_INPUT (in the definitions
726section of the input file):
3702The name
3703.bd
3704FLEX_SCANNER
3705is #define'd so scanners may be written for use with either
3706.I flex
3707or
3708.I lex.
3709Scanners also include
3710.B YY_FLEX_MAJOR_VERSION
3711and
3712.B YY_FLEX_MINOR_VERSION
3713indicating which version of
3714.I flex
3715generated the scanner
3716(for example, for the 2.5 release, these defines would be 2 and 5
3717respectively).
3718.PP
3719The following
3720.I flex
3721features are not included in
3722.I lex
3723or the POSIX specification:
727.nf
728
3724.nf
3725
729 %{
730 #undef YY_INPUT
731 #define YY_INPUT(buf,result,max_size) \\
732 { \\
733 int c = getchar(); \\
734 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
735 }
736 %}
3726 C++ scanners
3727 %option
3728 start condition scopes
3729 start condition stacks
3730 interactive/non-interactive scanners
3731 yy_scan_string() and friends
3732 yyterminate()
3733 yy_set_interactive()
3734 yy_set_bol()
3735 YY_AT_BOL()
3736 <<EOF>>
3737 <*>
3738 YY_DECL
3739 YY_START
3740 YY_USER_ACTION
3741 YY_USER_INIT
3742 #line directives
3743 %{}'s around actions
3744 multiple actions on a line
737
738.fi
3745
3746.fi
739.IP -
740When the scanner receives an end-of-file indication from YY_INPUT,
741it then checks the function
742.B yywrap()
743function. If
744.B yywrap()
745returns false (zero), then it is assumed that the
746function has gone ahead and set up
747.I yyin
748to point to another input file, and scanning continues. If it returns
749true (non-zero), then the scanner terminates, returning 0 to its
750caller.
751.IP
752The default
753.B yywrap()
754always returns 1.
755.IP -
756YY_USER_ACTION
757can be redefined to provide an action
758which is always executed prior to the matched rule's action.
759.IP -
760The macro
761.B YY_USER_INIT
762may be redefined to provide an action which is always executed before
763the first scan.
764.IP -
765In the generated scanner, the actions are all gathered in one large
766switch statement and separated using
767.B YY_BREAK,
768which may be redefined. By default, it is simply a "break", to separate
769each rule's action from the following rule's.
770.SH FILES
771.TP
772.B \-ll
773library with which to link scanners to obtain the default versions
774of
775.I yywrap()
776and/or
777.I main().
778.TP
779.I lex.yy.c
780generated scanner (called
781.I lexyy.c
782on some systems).
783.TP
784.I lex.yy.cc
785generated C++ scanner class, when using
786.B -+.
787.TP
788.I <FlexLexer.h>
789header file defining the C++ scanner base class,
790.B FlexLexer,
791and its derived class,
792.B yyFlexLexer.
793.TP
794.I flex.skl
795skeleton scanner. This file is only used when building flex, not when
796flex executes.
797.TP
798.I lex.backup
799backing-up information for
800.B \-b
801flag (called
802.I lex.bck
803on some systems).
804.SH "SEE ALSO"
3747plus almost all of the flex flags.
3748The last feature in the list refers to the fact that with
3749.I flex
3750you can put multiple actions on the same line, separated with
3751semi-colons, while with
3752.I lex,
3753the following
3754.nf
3755
3756 foo handle_foo(); ++num_foos_seen;
3757
3758.fi
3759is (rather surprisingly) truncated to
3760.nf
3761
3762 foo handle_foo();
3763
3764.fi
3765.I flex
3766does not truncate the action. Actions that are not enclosed in
3767braces are simply terminated at the end of the line.
3768.SH DIAGNOSTICS
805.PP
3769.PP
806lexdoc(1), lex(1), yacc(1), sed(1), awk(1).
3770.I warning, rule cannot be matched
3771indicates that the given rule
3772cannot be matched because it follows other rules that will
3773always match the same text as it. For
3774example, in the following "foo" cannot be matched because it comes after
3775an identifier "catch-all" rule:
3776.nf
3777
3778 [a-z]+ got_identifier();
3779 foo got_foo();
3780
3781.fi
3782Using
3783.B REJECT
3784in a scanner suppresses this warning.
807.PP
3785.PP
808M. E. Lesk and E. Schmidt,
809.I LEX \- Lexical Analyzer Generator
810.SH DIAGNOSTICS
3786.I warning,
3787.B \-s
3788.I
3789option given but default rule can be matched
3790means that it is possible (perhaps only in a particular start condition)
3791that the default rule (match any single character) is the only one
3792that will match a particular input. Since
3793.B \-s
3794was given, presumably this is not intended.
811.PP
812.I reject_used_but_not_detected undefined
813or
3795.PP
3796.I reject_used_but_not_detected undefined
3797or
814.PP
815.I yymore_used_but_not_detected undefined -
816These errors can occur at compile time. They indicate that the
817scanner uses
818.B REJECT
819or
820.B yymore()
821but that
822.I flex
823failed to notice the fact, meaning that
824.I flex
825scanned the first two sections looking for occurrences of these actions
826and failed to find any, but somehow you snuck some in (via a #include
3798.I yymore_used_but_not_detected undefined -
3799These errors can occur at compile time. They indicate that the
3800scanner uses
3801.B REJECT
3802or
3803.B yymore()
3804but that
3805.I flex
3806failed to notice the fact, meaning that
3807.I flex
3808scanned the first two sections looking for occurrences of these actions
3809and failed to find any, but somehow you snuck some in (via a #include
827file, for example). Make an explicit reference to the action in your
828.I flex
829input file. (Note that previously
830.I flex
831supported a
832.B %used/%unused
833mechanism for dealing with this problem; this feature is still supported
834but now deprecated, and will go away soon unless the author hears from
835people who can argue compellingly that they need it.)
3810file, for example). Use
3811.B %option reject
3812or
3813.B %option yymore
3814to indicate to flex that you really do use these features.
836.PP
837.I flex scanner jammed -
838a scanner compiled with
839.B \-s
840has encountered an input string which wasn't matched by
3815.PP
3816.I flex scanner jammed -
3817a scanner compiled with
3818.B \-s
3819has encountered an input string which wasn't matched by
841any of its rules.
3820any of its rules. This error can also occur due to internal problems.
842.PP
3821.PP
843.I warning, rule cannot be matched
844indicates that the given rule
845cannot be matched because it follows other rules that will
846always match the same text as it. See
847.I lexdoc(1)
848for an example.
849.PP
850.I warning,
851.B \-s
852.I
853option given but default rule can be matched
854means that it is possible (perhaps only in a particular start condition)
855that the default rule (match any single character) is the only one
856that will match a particular input. Since
857.PP
858.I scanner input buffer overflowed -
859a scanner rule matched more text than the available dynamic memory.
860.PP
861.I token too large, exceeds YYLMAX -
862your scanner uses
863.B %array
864and one of its rules matched a string longer than the
865.B YYLMAX
866constant (8K bytes by default). You can increase the value by
867#define'ing
868.B YYLMAX

--- 5 unchanged lines hidden (view full) ---

874.I use the character 'x' -
875Your scanner specification includes recognizing the 8-bit character
876.I 'x'
877and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
878because you used the
879.B \-Cf
880or
881.B \-CF
3822.I token too large, exceeds YYLMAX -
3823your scanner uses
3824.B %array
3825and one of its rules matched a string longer than the
3826.B YYLMAX
3827constant (8K bytes by default). You can increase the value by
3828#define'ing
3829.B YYLMAX

--- 5 unchanged lines hidden (view full) ---

3835.I use the character 'x' -
3836Your scanner specification includes recognizing the 8-bit character
3837.I 'x'
3838and you did not specify the \-8 flag, and your scanner defaulted to 7-bit
3839because you used the
3840.B \-Cf
3841or
3842.B \-CF
882table compression options.
3843table compression options. See the discussion of the
3844.B \-7
3845flag for details.
883.PP
884.I flex scanner push-back overflow -
885you used
886.B unput()
887to push back so much text that the scanner's buffer could not hold
888both the pushed-back text and the current token in
889.B yytext.
890Ideally the scanner should dynamically resize the buffer in this case, but at

--- 11 unchanged lines hidden (view full) ---

902This can occur in an scanner which is reentered after a long-jump
903has jumped out (or over) the scanner's activation frame. Before
904reentering the scanner, use:
905.nf
906
907 yyrestart( yyin );
908
909.fi
3846.PP
3847.I flex scanner push-back overflow -
3848you used
3849.B unput()
3850to push back so much text that the scanner's buffer could not hold
3851both the pushed-back text and the current token in
3852.B yytext.
3853Ideally the scanner should dynamically resize the buffer in this case, but at

--- 11 unchanged lines hidden (view full) ---

3865This can occur in an scanner which is reentered after a long-jump
3866has jumped out (or over) the scanner's activation frame. Before
3867reentering the scanner, use:
3868.nf
3869
3870 yyrestart( yyin );
3871
3872.fi
910or use C++ scanner classes (the
911.B \-+
912option), which are fully reentrant.
913.SH AUTHOR
914Vern Paxson, with the help of many ideas and much inspiration from
915Van Jacobson. Original version by Jef Poskanzer.
3873or, as noted above, switch to using the C++ scanner class.
916.PP
3874.PP
917See lexdoc(1) for additional credits and the address to send comments to.
3875.I too many start conditions in <> construct! -
3876you listed more start conditions in a <> construct than exist (so
3877you must have listed at least one of them twice).
3878.SH FILES
3879.TP
3880.B \-ll
3881library with which scanners must be linked.
3882.TP
3883.I lex.yy.c
3884generated scanner (called
3885.I lexyy.c
3886on some systems).
3887.TP
3888.I lex.yy.cc
3889generated C++ scanner class, when using
3890.B -+.
3891.TP
3892.I <FlexLexer.h>
3893header file defining the C++ scanner base class,
3894.B FlexLexer,
3895and its derived class,
3896.B yyFlexLexer.
3897.TP
3898.I flex.skl
3899skeleton scanner. This file is only used when building flex, not when
3900flex executes.
3901.TP
3902.I lex.backup
3903backing-up information for
3904.B \-b
3905flag (called
3906.I lex.bck
3907on some systems).
918.SH DEFICIENCIES / BUGS
919.PP
920Some trailing context
921patterns cannot be properly matched and generate
922warning messages ("dangerous trailing context"). These are
923patterns where the ending of the
924first part of the rule matches the beginning of the second
925part, such as "zx*/xy*", where the 'x*' matches the 'x' at

--- 15 unchanged lines hidden (view full) ---

941 %%
942 abc |
943 xyz/def
944
945.fi
946.PP
947Use of
948.B unput()
3908.SH DEFICIENCIES / BUGS
3909.PP
3910Some trailing context
3911patterns cannot be properly matched and generate
3912warning messages ("dangerous trailing context"). These are
3913patterns where the ending of the
3914first part of the rule matches the beginning of the second
3915part, such as "zx*/xy*", where the 'x*' matches the 'x' at

--- 15 unchanged lines hidden (view full) ---

3931 %%
3932 abc |
3933 xyz/def
3934
3935.fi
3936.PP
3937Use of
3938.B unput()
949or
950.B input()
951invalidates yytext and yyleng, unless the
952.B %array
953directive
954or the
955.B \-l
956option has been used.
957.PP
3939invalidates yytext and yyleng, unless the
3940.B %array
3941directive
3942or the
3943.B \-l
3944option has been used.
3945.PP
958Use of unput() to push back more text than was matched can
959result in the pushed-back text matching a beginning-of-line ('^')
960rule even though it didn't come at the beginning of the line
961(though this is rare!).
962.PP
963Pattern-matching of NUL's is substantially slower than matching other
964characters.
965.PP
966Dynamic resizing of the input buffer is slow, as it entails rescanning
967all the text matched so far by the current (generally huge) token.
968.PP
3946Pattern-matching of NUL's is substantially slower than matching other
3947characters.
3948.PP
3949Dynamic resizing of the input buffer is slow, as it entails rescanning
3950all the text matched so far by the current (generally huge) token.
3951.PP
969.I flex
970does not generate correct #line directives for code internal
971to the scanner; thus, bugs in
972.I flex.skl
973yield bogus line numbers.
974.PP
975Due to both buffering of input and read-ahead, you cannot intermix
976calls to <stdio.h> routines, such as, for example,
977.B getchar(),
978with
979.I flex
980rules and expect it to work. Call
981.B input()
982instead.

--- 11 unchanged lines hidden (view full) ---

994.B \-f
995or
996.B \-F
997options.
998.PP
999The
1000.I flex
1001internal algorithms need documentation.
3952Due to both buffering of input and read-ahead, you cannot intermix
3953calls to <stdio.h> routines, such as, for example,
3954.B getchar(),
3955with
3956.I flex
3957rules and expect it to work. Call
3958.B input()
3959instead.

--- 11 unchanged lines hidden (view full) ---

3971.B \-f
3972or
3973.B \-F
3974options.
3975.PP
3976The
3977.I flex
3978internal algorithms need documentation.
3979.SH SEE ALSO
3980.PP
3981lex(1), yacc(1), sed(1), awk(1).
3982.PP
3983John Levine, Tony Mason, and Doug Brown,
3984.I Lex & Yacc,
3985O'Reilly and Associates. Be sure to get the 2nd edition.
3986.PP
3987M. E. Lesk and E. Schmidt,
3988.I LEX \- Lexical Analyzer Generator
3989.PP
3990Alfred Aho, Ravi Sethi and Jeffrey Ullman,
3991.I Compilers: Principles, Techniques and Tools,
3992Addison-Wesley (1986). Describes the pattern-matching techniques used by
3993.I flex
3994(deterministic finite automata).
3995.SH AUTHOR
3996Vern Paxson, with the help of many ideas and much inspiration from
3997Van Jacobson. Original version by Jef Poskanzer. The fast table
3998representation is a partial implementation of a design done by Van
3999Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
4000.PP
4001Thanks to the many
4002.I flex
4003beta-testers, feedbackers, and contributors, especially Francois Pinard,
4004Casey Leedom,
4005Robert Abramovitz,
4006Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4007Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4008Karl Berry, Peter A. Bigot, Simon Blanchard,
4009Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4010Brian Clapper, J.T. Conklin,
4011Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4012Daniels, Chris G. Demetriou, Theo Deraadt,
4013Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4014Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4015Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4016Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4017Jan Hajic, Charles Hemphill, NORO Hideo,
4018Jarkko Hietaniemi, Scott Hofmann,
4019Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4020Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4021Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4022Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4023Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4024Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4025David Loffredo, Mike Long,
4026Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4027Bengt Martensson, Chris Metcalf,
4028Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4029G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4030Richard Ohnemus, Karsten Pahnke,
4031Sven Panne, Roland Pesch, Walter Pelissero, Gaumond
4032Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4033Frederic Raimbault, Pat Rankin, Rick Richardson,
4034Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4035Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4036Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4037Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4038Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4039Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4040Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken
4041Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4042and those whose names have slipped my marginal
4043mail-archiving skills but whose contributions are appreciated all the
4044same.
4045.PP
4046Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4047John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4048Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4049distribution headaches.
4050.PP
4051Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to
4052Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom
4053Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to
4054Eric Hughes for support of multiple buffers.
4055.PP
4056This work was primarily done when I was with the Real Time Systems Group
4057at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
4058for the support I received.
4059.PP
4060Send comments to vern@ee.lbl.gov.