gcc/doc/cppinternals.texi

90075Sobrien\input texinfo
90075Sobrien@setfilename cppinternals.info
90075Sobrien@settitle The GNU C Preprocessor Internals
90075Sobrien
169689Skan@include gcc-common.texi
169689Skan
90075Sobrien@ifinfo
169689Skan@dircategory Software development
90075Sobrien@direntry
90075Sobrien* Cpplib: (cppinternals).      Cpplib internals.
90075Sobrien@end direntry
90075Sobrien@end ifinfo
90075Sobrien
90075Sobrien@c @smallbook
90075Sobrien@c @cropmarks
90075Sobrien@c @finalout
90075Sobrien@setchapternewpage odd
90075Sobrien@ifinfo
90075SobrienThis file documents the internals of the GNU C Preprocessor.
90075Sobrien
169689SkanCopyright 2000, 2001, 2002, 2004, 2005 Free Software Foundation, Inc.
90075Sobrien
90075SobrienPermission is granted to make and distribute verbatim copies of
90075Sobrienthis manual provided the copyright notice and this permission notice
90075Sobrienare preserved on all copies.
90075Sobrien
90075Sobrien@ignore
90075SobrienPermission is granted to process this file through Tex and print the
90075Sobrienresults, provided the printed document carries copying permission
90075Sobriennotice identical to this one except for the removal of this paragraph
90075Sobrien(this paragraph not being relevant to the printed manual).
90075Sobrien
90075Sobrien@end ignore
90075SobrienPermission is granted to copy and distribute modified versions of this
90075Sobrienmanual under the conditions for verbatim copying, provided also that
90075Sobrienthe entire resulting derived work is distributed under the terms of a
90075Sobrienpermission notice identical to this one.
90075Sobrien
90075SobrienPermission is granted to copy and distribute translations of this manual
90075Sobrieninto another language, under the above conditions for modified versions.
90075Sobrien@end ifinfo
90075Sobrien
90075Sobrien@titlepage
90075Sobrien@title Cpplib Internals
169689Skan@versionsubtitle
90075Sobrien@author Neil Booth
90075Sobrien@page
90075Sobrien@vskip 0pt plus 1filll
90075Sobrien@c man begin COPYRIGHT
169689SkanCopyright @copyright{} 2000, 2001, 2002, 2004, 2005
90075SobrienFree Software Foundation, Inc.
90075Sobrien
90075SobrienPermission is granted to make and distribute verbatim copies of
90075Sobrienthis manual provided the copyright notice and this permission notice
90075Sobrienare preserved on all copies.
90075Sobrien
90075SobrienPermission is granted to copy and distribute modified versions of this
90075Sobrienmanual under the conditions for verbatim copying, provided also that
90075Sobrienthe entire resulting derived work is distributed under the terms of a
90075Sobrienpermission notice identical to this one.
90075Sobrien
90075SobrienPermission is granted to copy and distribute translations of this manual
90075Sobrieninto another language, under the above conditions for modified versions.
90075Sobrien@c man end
90075Sobrien@end titlepage
90075Sobrien@contents
90075Sobrien@page
90075Sobrien
90075Sobrien@node Top
90075Sobrien@top
90075Sobrien@chapter Cpplib---the GNU C Preprocessor
90075Sobrien
169689SkanThe GNU C preprocessor is
169689Skanimplemented as a library, @dfn{cpplib}, so it can be easily shared between
90075Sobriena stand-alone preprocessor, and a preprocessor integrated with the C,
90075SobrienC++ and Objective-C front ends.  It is also available for use by other
90075Sobrienprograms, though this is not recommended as its exposed interface has
90075Sobriennot yet reached a point of reasonable stability.
90075Sobrien
90075SobrienThe library has been written to be re-entrant, so that it can be used
90075Sobriento preprocess many files simultaneously if necessary.  It has also been
90075Sobrienwritten with the preprocessing token as the fundamental unit; the
90075Sobrienpreprocessor in previous versions of GCC would operate on text strings
90075Sobrienas the fundamental unit.
90075Sobrien
90075SobrienThis brief manual documents the internals of cpplib, and explains some
90075Sobrienof the tricky issues.  It is intended that, along with the comments in
90075Sobrienthe source code, a reasonably competent C programmer should be able to
90075Sobrienfigure out what the code is doing, and why things have been implemented
90075Sobrienthe way they have.
90075Sobrien
90075Sobrien@menu
90075Sobrien* Conventions::         Conventions used in the code.
90075Sobrien* Lexer::               The combined C, C++ and Objective-C Lexer.
90075Sobrien* Hash Nodes::          All identifiers are entered into a hash table.
90075Sobrien* Macro Expansion::     Macro expansion algorithm.
90075Sobrien* Token Spacing::       Spacing and paste avoidance issues.
90075Sobrien* Line Numbering::      Tracking location within files.
90075Sobrien* Guard Macros::        Optimizing header files with guard macros.
90075Sobrien* Files::               File handling.
169689Skan* Concept Index::       Index.
90075Sobrien@end menu
90075Sobrien
90075Sobrien@node Conventions
90075Sobrien@unnumbered Conventions
90075Sobrien@cindex interface
90075Sobrien@cindex header files
90075Sobrien
90075Sobriencpplib has two interfaces---one is exposed internally only, and the
90075Sobrienother is for both internal and external use.
90075Sobrien
90075SobrienThe convention is that functions and types that are exposed to multiple
90075Sobrienfiles internally are prefixed with @samp{_cpp_}, and are to be found in
169689Skanthe file @file{internal.h}.  Functions and types exposed to external
90075Sobrienclients are in @file{cpplib.h}, and prefixed with @samp{cpp_}.  For
90075Sobrienhistorical reasons this is no longer quite true, but we should strive to
90075Sobrienstick to it.
90075Sobrien
90075SobrienWe are striving to reduce the information exposed in @file{cpplib.h} to the
90075Sobrienbare minimum necessary, and then to keep it there.  This makes clear
90075Sobrienexactly what external clients are entitled to assume, and allows us to
90075Sobrienchange internals in the future without worrying whether library clients
90075Sobrienare perhaps relying on some kind of undocumented implementation-specific
90075Sobrienbehavior.
90075Sobrien
90075Sobrien@node Lexer
90075Sobrien@unnumbered The Lexer
90075Sobrien@cindex lexer
90075Sobrien@cindex newlines
90075Sobrien@cindex escaped newlines
90075Sobrien
90075Sobrien@section Overview
169689SkanThe lexer is contained in the file @file{lex.c}.  It is a hand-coded
90075Sobrienlexer, and not implemented as a state machine.  It can understand C, C++
90075Sobrienand Objective-C source code, and has been extended to allow reasonably
90075Sobriensuccessful preprocessing of assembly language.  The lexer does not make
90075Sobrienan initial pass to strip out trigraphs and escaped newlines, but handles
90075Sobrienthem as they are encountered in a single pass of the input file.  It
90075Sobrienreturns preprocessing tokens individually, not a line at a time.
90075Sobrien
90075SobrienIt is mostly transparent to users of the library, since the library's
90075Sobrieninterface for obtaining the next token, @code{cpp_get_token}, takes care
90075Sobrienof lexing new tokens, handling directives, and expanding macros as
90075Sobriennecessary.  However, the lexer does expose some functionality so that
90075Sobrienclients of the library can easily spell a given token, such as
90075Sobrien@code{cpp_spell_token} and @code{cpp_token_len}.  These functions are
90075Sobrienuseful when generating diagnostics, and for emitting the preprocessed
90075Sobrienoutput.
90075Sobrien
90075Sobrien@section Lexing a token
90075SobrienLexing of an individual token is handled by @code{_cpp_lex_direct} and
90075Sobrienits subroutines.  In its current form the code is quite complicated,
90075Sobrienwith read ahead characters and such-like, since it strives to not step
90075Sobrienback in the character stream in preparation for handling non-ASCII file
90075Sobrienencodings.  The current plan is to convert any such files to UTF-8
90075Sobrienbefore processing them.  This complexity is therefore unnecessary and
90075Sobrienwill be removed, so I'll not discuss it further here.
90075Sobrien
90075SobrienThe job of @code{_cpp_lex_direct} is simply to lex a token.  It is not
90075Sobrienresponsible for issues like directive handling, returning lookahead
90075Sobrientokens directly, multiple-include optimization, or conditional block
90075Sobrienskipping.  It necessarily has a minor r@^ole to play in memory
90075Sobrienmanagement of lexed lines.  I discuss these issues in a separate section
90075Sobrien(@pxref{Lexing a line}).
90075Sobrien
90075SobrienThe lexer places the token it lexes into storage pointed to by the
90075Sobrienvariable @code{cur_token}, and then increments it.  This variable is
90075Sobrienimportant for correct diagnostic positioning.  Unless a specific line
90075Sobrienand column are passed to the diagnostic routines, they will examine the
90075Sobrien@code{line} and @code{col} values of the token just before the location
90075Sobrienthat @code{cur_token} points to, and use that location to report the
90075Sobriendiagnostic.
90075Sobrien
90075SobrienThe lexer does not consider whitespace to be a token in its own right.
90075SobrienIf whitespace (other than a new line) precedes a token, it sets the
90075Sobrien@code{PREV_WHITE} bit in the token's flags.  Each token has its
90075Sobrien@code{line} and @code{col} variables set to the line and column of the
90075Sobrienfirst character of the token.  This line number is the line number in
90075Sobrienthe translation unit, and can be converted to a source (file, line) pair
90075Sobrienusing the line map code.
90075Sobrien
90075SobrienThe first token on a logical, i.e.@: unescaped, line has the flag
90075Sobrien@code{BOL} set for beginning-of-line.  This flag is intended for
90075Sobrieninternal use, both to distinguish a @samp{#} that begins a directive
90075Sobrienfrom one that doesn't, and to generate a call-back to clients that want
90075Sobriento be notified about the start of every non-directive line with tokens
90075Sobrienon it.  Clients cannot reliably determine this for themselves: the first
90075Sobrientoken might be a macro, and the tokens of a macro expansion do not have
90075Sobrienthe @code{BOL} flag set.  The macro expansion may even be empty, and the
90075Sobriennext token on the line certainly won't have the @code{BOL} flag set.
90075Sobrien
90075SobrienNew lines are treated specially; exactly how the lexer handles them is
90075Sobriencontext-dependent.  The C standard mandates that directives are
90075Sobrienterminated by the first unescaped newline character, even if it appears
90075Sobrienin the middle of a macro expansion.  Therefore, if the state variable
90075Sobrien@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,
90075Sobrienwhich is normally used to indicate end-of-file, to indicate
90075Sobrienend-of-directive.  In a directive a @code{CPP_EOF} token never means
90075Sobrienend-of-file.  Conveniently, if the caller was @code{collect_args}, it
90075Sobrienalready handles @code{CPP_EOF} as if it were end-of-file, and reports an
90075Sobrienerror about an unterminated macro argument list.
90075Sobrien
90075SobrienThe C standard also specifies that a new line in the middle of the
90075Sobrienarguments to a macro is treated as whitespace.  This white space is
90075Sobrienimportant in case the macro argument is stringified.  The state variable
90075Sobrien@code{parsing_args} is nonzero when the preprocessor is collecting the
90075Sobrienarguments to a macro call.  It is set to 1 when looking for the opening
90075Sobrienparenthesis to a function-like macro, and 2 when collecting the actual
90075Sobrienarguments up to the closing parenthesis, since these two cases need to
90075Sobrienbe distinguished sometimes.  One such time is here: the lexer sets the
90075Sobrien@code{PREV_WHITE} flag of a token if it meets a new line when
90075Sobrien@code{parsing_args} is set to 2.  It doesn't set it if it meets a new
90075Sobrienline when @code{parsing_args} is 1, since then code like
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#define foo() bar
90075Sobrienfoo
90075Sobrienbaz
90075Sobrien@end smallexample
90075Sobrien
90075Sobrien@noindent would be output with an erroneous space before @samp{baz}:
90075Sobrien
90075Sobrien@smallexample
90075Sobrienfoo
90075Sobrien baz
90075Sobrien@end smallexample
90075Sobrien
90075SobrienThis is a good example of the subtlety of getting token spacing correct
132718Skanin the preprocessor; there are plenty of tests in the testsuite for
90075Sobriencorner cases like this.
90075Sobrien
90075SobrienThe lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
90075Sobrienand @samp{\n\r} as a single new line indicator.  This allows it to
90075Sobrientransparently preprocess MS-DOS, Macintosh and Unix files without their
90075Sobrienneeding to pass through a special filter beforehand.
90075Sobrien
90075SobrienWe also decided to treat a backslash, either @samp{\} or the trigraph
90075Sobrien@samp{??/}, separated from one of the above newline indicators by
90075Sobriennon-comment whitespace only, as intending to escape the newline.  It
90075Sobrientends to be a typing mistake, and cannot reasonably be mistaken for
90075Sobrienanything else in any of the C-family grammars.  Since handling it this
90075Sobrienway is not strictly conforming to the ISO standard, the library issues a
90075Sobrienwarning wherever it encounters it.
90075Sobrien
90075SobrienHandling newlines like this is made simpler by doing it in one place
90075Sobrienonly.  The function @code{handle_newline} takes care of all newline
90075Sobriencharacters, and @code{skip_escaped_newlines} takes care of arbitrarily
90075Sobrienlong sequences of escaped newlines, deferring to @code{handle_newline}
90075Sobriento handle the newlines themselves.
90075Sobrien
90075SobrienThe most painful aspect of lexing ISO-standard C and C++ is handling
90075Sobrientrigraphs and backlash-escaped newlines.  Trigraphs are processed before
90075Sobrienany interpretation of the meaning of a character is made, and unfortunately
90075Sobrienthere is a trigraph representation for a backslash, so it is possible for
90075Sobrienthe trigraph @samp{??/} to introduce an escaped newline.
90075Sobrien
90075SobrienEscaped newlines are tedious because theoretically they can occur
90075Sobrienanywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,
90075Sobrienwithin the characters of an identifier, and even between the @samp{*}
90075Sobrienand @samp{/} that terminates a comment.  Moreover, you cannot be sure
90075Sobrienthere is just one---there might be an arbitrarily long sequence of them.
90075Sobrien
90075SobrienSo, for example, the routine that lexes a number, @code{parse_number},
90075Sobriencannot assume that it can scan forwards until the first non-number
90075Sobriencharacter and be done with it, because this could be the @samp{\}
90075Sobrienintroducing an escaped newline, or the @samp{?} introducing the trigraph
90075Sobriensequence that represents the @samp{\} of an escaped newline.  If it
90075Sobrienencounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}
90075Sobriento skip over any potential escaped newlines before checking whether the
90075Sobriennumber has been finished.
90075Sobrien
90075SobrienSimilarly code in the main body of @code{_cpp_lex_direct} cannot simply
90075Sobriencheck for a @samp{=} after a @samp{+} character to determine whether it
90075Sobrienhas a @samp{+=} token; it needs to be prepared for an escaped newline of
90075Sobriensome sort.  Such cases use the function @code{get_effective_char}, which
90075Sobrienreturns the first character after any intervening escaped newlines.
90075Sobrien
90075SobrienThe lexer needs to keep track of the correct column position, including
90075Sobriencounting tabs as specified by the @option{-ftabstop=} option.  This
90075Sobrienshould be done even within C-style comments; they can appear in the
90075Sobrienmiddle of a line, and we want to report diagnostics in the correct
90075Sobrienposition for text appearing after the end of the comment.
90075Sobrien
90075Sobrien@anchor{Invalid identifiers}
90075SobrienSome identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,
90075Sobrienmay be invalid and require a diagnostic.  However, if they appear in a
90075Sobrienmacro expansion we don't want to complain with each use of the macro.
90075SobrienIt is therefore best to catch them during the lexing stage, in
90075Sobrien@code{parse_identifier}.  In both cases, whether a diagnostic is needed
90075Sobrienor not is dependent upon the lexer's state.  For example, we don't want
90075Sobriento issue a diagnostic for re-poisoning a poisoned identifier, or for
90075Sobrienusing @code{__VA_ARGS__} in the expansion of a variable-argument macro.
90075SobrienTherefore @code{parse_identifier} makes use of state flags to determine
90075Sobrienwhether a diagnostic is appropriate.  Since we change state on a
90075Sobrienper-token basis, and don't lex whole lines at a time, this is not a
90075Sobrienproblem.
90075Sobrien
90075SobrienAnother place where state flags are used to change behavior is whilst
90075Sobrienlexing header names.  Normally, a @samp{<} would be lexed as a single
90075Sobrientoken.  After a @code{#include} directive, though, it should be lexed as
90075Sobriena single token as far as the nearest @samp{>} character.  Note that we
90075Sobriendon't allow the terminators of header names to be escaped; the first
90075Sobrien@samp{"} or @samp{>} terminates the header name.
90075Sobrien
90075SobrienInterpretation of some character sequences depends upon whether we are
90075Sobrienlexing C, C++ or Objective-C, and on the revision of the standard in
90075Sobrienforce.  For example, @samp{::} is a single token in C++, but in C it is
90075Sobrientwo separate @samp{:} tokens and almost certainly a syntax error.  Such
90075Sobriencases are handled by @code{_cpp_lex_direct} based upon command-line
90075Sobrienflags stored in the @code{cpp_options} structure.
90075Sobrien
90075SobrienOnce a token has been lexed, it leads an independent existence.  The
90075Sobrienspelling of numbers, identifiers and strings is copied to permanent
90075Sobrienstorage from the original input buffer, so a token remains valid and
90075Sobriencorrect even if its source buffer is freed with @code{_cpp_pop_buffer}.
90075SobrienThe storage holding the spellings of such tokens remains until the
90075Sobrienclient program calls cpp_destroy, probably at the end of the translation
90075Sobrienunit.
90075Sobrien
90075Sobrien@anchor{Lexing a line}
90075Sobrien@section Lexing a line
90075Sobrien@cindex token run
90075Sobrien
90075SobrienWhen the preprocessor was changed to return pointers to tokens, one
90075Sobrienfeature I wanted was some sort of guarantee regarding how long a
90075Sobrienreturned pointer remains valid.  This is important to the stand-alone
90075Sobrienpreprocessor, the future direction of the C family front ends, and even
90075Sobriento cpplib itself internally.
90075Sobrien
90075SobrienOccasionally the preprocessor wants to be able to peek ahead in the
90075Sobrientoken stream.  For example, after the name of a function-like macro, it
90075Sobrienwants to check the next token to see if it is an opening parenthesis.
90075SobrienAnother example is that, after reading the first few tokens of a
90075Sobrien@code{#pragma} directive and not recognizing it as a registered pragma,
90075Sobrienit wants to backtrack and allow the user-defined handler for unknown
90075Sobrienpragmas to access the full @code{#pragma} token stream.  The stand-alone
90075Sobrienpreprocessor wants to be able to test the current token with the
90075Sobrienprevious one to see if a space needs to be inserted to preserve their
90075Sobrienseparate tokenization upon re-lexing (paste avoidance), so it needs to
90075Sobrienbe sure the pointer to the previous token is still valid.  The
90075Sobrienrecursive-descent C++ parser wants to be able to perform tentative
90075Sobrienparsing arbitrarily far ahead in the token stream, and then to be able
90075Sobriento jump back to a prior position in that stream if necessary.
90075Sobrien
90075SobrienThe rule I chose, which is fairly natural, is to arrange that the
90075Sobrienpreprocessor lex all tokens on a line consecutively into a token buffer,
90075Sobrienwhich I call a @dfn{token run}, and when meeting an unescaped new line
90075Sobrien(newlines within comments do not count either), to start lexing back at
90075Sobrienthe beginning of the run.  Note that we do @emph{not} lex a line of
90075Sobrientokens at once; if we did that @code{parse_identifier} would not have
90075Sobrienstate flags available to warn about invalid identifiers (@pxref{Invalid
90075Sobrienidentifiers}).
90075Sobrien
90075SobrienIn other words, accessing tokens that appeared earlier in the current
90075Sobrienline is valid, but since each logical line overwrites the tokens of the
90075Sobrienprevious line, tokens from prior lines are unavailable.  In particular,
90075Sobriensince a directive only occupies a single logical line, this means that
90075Sobrienthe directive handlers like the @code{#pragma} handler can jump around
90075Sobrienin the directive's tokens if necessary.
90075Sobrien
90075SobrienTwo issues remain: what about tokens that arise from macro expansions,
90075Sobrienand what happens when we have a long line that overflows the token run?
90075Sobrien
90075SobrienSince we promise clients that we preserve the validity of pointers that
90075Sobrienwe have already returned for tokens that appeared earlier in the line,
90075Sobrienwe cannot reallocate the run.  Instead, on overflow it is expanded by
90075Sobrienchaining a new token run on to the end of the existing one.
90075Sobrien
90075SobrienThe tokens forming a macro's replacement list are collected by the
90075Sobrien@code{#define} handler, and placed in storage that is only freed by
132718Skan@code{cpp_destroy}.  So if a macro is expanded in the line of tokens,
132718Skanthe pointers to the tokens of its expansion that are returned will always
90075Sobrienremain valid.  However, macros are a little trickier than that, since
90075Sobrienthey give rise to three sources of fresh tokens.  They are the built-in
90075Sobrienmacros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
90075Sobrienfor stringification and token pasting.  I handled this by allocating
90075Sobrienspace for these tokens from the lexer's token run chain.  This means
90075Sobrienthey automatically receive the same lifetime guarantees as lexed tokens,
90075Sobrienand we don't need to concern ourselves with freeing them.
90075Sobrien
90075SobrienLexing into a line of tokens solves some of the token memory management
90075Sobrienissues, but not all.  The opening parenthesis after a function-like
90075Sobrienmacro name might lie on a different line, and the front ends definitely
90075Sobrienwant the ability to look ahead past the end of the current line.  So
90075Sobriencpplib only moves back to the start of the token run at the end of a
90075Sobrienline if the variable @code{keep_tokens} is zero.  Line-buffering is
90075Sobrienquite natural for the preprocessor, and as a result the only time cpplib
90075Sobrienneeds to increment this variable is whilst looking for the opening
90075Sobrienparenthesis to, and reading the arguments of, a function-like macro.  In
90075Sobrienthe near future cpplib will export an interface to increment and
90075Sobriendecrement this variable, so that clients can share full control over the
90075Sobrienlifetime of token pointers too.
90075Sobrien
90075SobrienThe routine @code{_cpp_lex_token} handles moving to new token runs,
90075Sobriencalling @code{_cpp_lex_direct} to lex new tokens, or returning
90075Sobrienpreviously-lexed tokens if we stepped back in the token stream.  It also
90075Sobrienchecks each token for the @code{BOL} flag, which might indicate a
90075Sobriendirective that needs to be handled, or require a start-of-line call-back
90075Sobriento be made.  @code{_cpp_lex_token} also handles skipping over tokens in
90075Sobrienfailed conditional blocks, and invalidates the control macro of the
90075Sobrienmultiple-include optimization if a token was successfully lexed outside
90075Sobriena directive.  In other words, its callers do not need to concern
90075Sobrienthemselves with such issues.
90075Sobrien
90075Sobrien@node Hash Nodes
90075Sobrien@unnumbered Hash Nodes
90075Sobrien@cindex hash table
90075Sobrien@cindex identifiers
90075Sobrien@cindex macros
90075Sobrien@cindex assertions
90075Sobrien@cindex named operators
90075Sobrien
90075SobrienWhen cpplib encounters an ``identifier'', it generates a hash code for
90075Sobrienit and stores it in the hash table.  By ``identifier'' we mean tokens
90075Sobrienwith type @code{CPP_NAME}; this includes identifiers in the usual C
90075Sobriensense, as well as keywords, directive names, macro names and so on.  For
90075Sobrienexample, all of @code{pragma}, @code{int}, @code{foo} and
90075Sobrien@code{__GNUC__} are identifiers and hashed when lexed.
90075Sobrien
90075SobrienEach node in the hash table contain various information about the
90075Sobrienidentifier it represents.  For example, its length and type.  At any one
90075Sobrientime, each identifier falls into exactly one of three categories:
90075Sobrien
90075Sobrien@itemize @bullet
90075Sobrien@item Macros
90075Sobrien
90075SobrienThese have been declared to be macros, either on the command line or
90075Sobrienwith @code{#define}.  A few, such as @code{__TIME__} are built-ins
90075Sobrienentered in the hash table during initialization.  The hash node for a
90075Sobriennormal macro points to a structure with more information about the
90075Sobrienmacro, such as whether it is function-like, how many arguments it takes,
90075Sobrienand its expansion.  Built-in macros are flagged as special, and instead
90075Sobriencontain an enum indicating which of the various built-in macros it is.
90075Sobrien
90075Sobrien@item Assertions
90075Sobrien
90075SobrienAssertions are in a separate namespace to macros.  To enforce this, cpp
90075Sobrienactually prepends a @code{#} character before hashing and entering it in
90075Sobrienthe hash table.  An assertion's node points to a chain of answers to
90075Sobrienthat assertion.
90075Sobrien
90075Sobrien@item Void
90075Sobrien
90075SobrienEverything else falls into this category---an identifier that is not
90075Sobriencurrently a macro, or a macro that has since been undefined with
90075Sobrien@code{#undef}.
90075Sobrien
90075SobrienWhen preprocessing C++, this category also includes the named operators,
90075Sobriensuch as @code{xor}.  In expressions these behave like the operators they
90075Sobrienrepresent, but in contexts where the spelling of a token matters they
90075Sobrienare spelt differently.  This spelling distinction is relevant when they
90075Sobrienare operands of the stringizing and pasting macro operators @code{#} and
90075Sobrien@code{##}.  Named operator hash nodes are flagged, both to catch the
90075Sobrienspelling distinction and to prevent them from being defined as macros.
90075Sobrien@end itemize
90075Sobrien
90075SobrienThe same identifiers share the same hash node.  Since each identifier
90075Sobrientoken, after lexing, contains a pointer to its hash node, this is used
90075Sobriento provide rapid lookup of various information.  For example, when
90075Sobrienparsing a @code{#define} statement, CPP flags each argument's identifier
90075Sobrienhash node with the index of that argument.  This makes duplicated
90075Sobrienargument checking an O(1) operation for each argument.  Similarly, for
90075Sobrieneach identifier in the macro's expansion, lookup to see if it is an
90075Sobrienargument, and which argument it is, is also an O(1) operation.  Further,
90075Sobrieneach directive name, such as @code{endif}, has an associated directive
90075Sobrienenum stored in its hash node, so that directive lookup is also O(1).
90075Sobrien
90075Sobrien@node Macro Expansion
90075Sobrien@unnumbered Macro Expansion Algorithm
90075Sobrien@cindex macro expansion
90075Sobrien
90075SobrienMacro expansion is a tricky operation, fraught with nasty corner cases
90075Sobrienand situations that render what you thought was a nifty way to
90075Sobrienoptimize the preprocessor's expansion algorithm wrong in quite subtle
90075Sobrienways.
90075Sobrien
90075SobrienI strongly recommend you have a good grasp of how the C and C++
90075Sobrienstandards require macros to be expanded before diving into this
90075Sobriensection, let alone the code!.  If you don't have a clear mental
90075Sobrienpicture of how things like nested macro expansion, stringification and
90075Sobrientoken pasting are supposed to work, damage to your sanity can quickly
90075Sobrienresult.
90075Sobrien
90075Sobrien@section Internal representation of macros
90075Sobrien@cindex macro representation (internal)
90075Sobrien
90075SobrienThe preprocessor stores macro expansions in tokenized form.  This
90075Sobriensaves repeated lexing passes during expansion, at the cost of a small
90075Sobrienincrease in memory consumption on average.  The tokens are stored
90075Sobriencontiguously in memory, so a pointer to the first one and a token
90075Sobriencount is all you need to get the replacement list of a macro.
90075Sobrien
90075SobrienIf the macro is a function-like macro the preprocessor also stores its
90075Sobrienparameters, in the form of an ordered list of pointers to the hash
90075Sobrientable entry of each parameter's identifier.  Further, in the macro's
90075Sobrienstored expansion each occurrence of a parameter is replaced with a
90075Sobrienspecial token of type @code{CPP_MACRO_ARG}.  Each such token holds the
90075Sobrienindex of the parameter it represents in the parameter list, which
90075Sobrienallows rapid replacement of parameters with their arguments during
90075Sobrienexpansion.  Despite this optimization it is still necessary to store
90075Sobrienthe original parameters to the macro, both for dumping with e.g.,
90075Sobrien@option{-dD}, and to warn about non-trivial macro redefinitions when
90075Sobrienthe parameter names have changed.
90075Sobrien
90075Sobrien@section Macro expansion overview
90075SobrienThe preprocessor maintains a @dfn{context stack}, implemented as a
90075Sobrienlinked list of @code{cpp_context} structures, which together represent
90075Sobrienthe macro expansion state at any one time.  The @code{struct
90075Sobriencpp_reader} member variable @code{context} points to the current top
90075Sobrienof this stack.  The top normally holds the unexpanded replacement list
90075Sobrienof the innermost macro under expansion, except when cpplib is about to
90075Sobrienpre-expand an argument, in which case it holds that argument's
90075Sobrienunexpanded tokens.
90075Sobrien
90075SobrienWhen there are no macros under expansion, cpplib is in @dfn{base
90075Sobriencontext}.  All contexts other than the base context contain a
90075Sobriencontiguous list of tokens delimited by a starting and ending token.
90075SobrienWhen not in base context, cpplib obtains the next token from the list
90075Sobrienof the top context.  If there are no tokens left in the list, it pops
90075Sobrienthat context off the stack, and subsequent ones if necessary, until an
90075Sobrienunexhausted context is found or it returns to base context.  In base
90075Sobriencontext, cpplib reads tokens directly from the lexer.
90075Sobrien
90075SobrienIf it encounters an identifier that is both a macro and enabled for
90075Sobrienexpansion, cpplib prepares to push a new context for that macro on the
90075Sobrienstack by calling the routine @code{enter_macro_context}.  When this
90075Sobrienroutine returns, the new context will contain the unexpanded tokens of
90075Sobrienthe replacement list of that macro.  In the case of function-like
90075Sobrienmacros, @code{enter_macro_context} also replaces any parameters in the
90075Sobrienreplacement list, stored as @code{CPP_MACRO_ARG} tokens, with the
90075Sobrienappropriate macro argument.  If the standard requires that the
90075Sobrienparameter be replaced with its expanded argument, the argument will
90075Sobrienhave been fully macro expanded first.
90075Sobrien
90075Sobrien@code{enter_macro_context} also handles special macros like
90075Sobrien@code{__LINE__}.  Although these macros expand to a single token which
90075Sobriencannot contain any further macros, for reasons of token spacing
90075Sobrien(@pxref{Token Spacing}) and simplicity of implementation, cpplib
90075Sobrienhandles these special macros by pushing a context containing just that
90075Sobrienone token.
90075Sobrien
90075SobrienThe final thing that @code{enter_macro_context} does before returning
90075Sobrienis to mark the macro disabled for expansion (except for special macros
90075Sobrienlike @code{__TIME__}).  The macro is re-enabled when its context is
90075Sobrienlater popped from the context stack, as described above.  This strict
90075Sobrienordering ensures that a macro is disabled whilst its expansion is
90075Sobrienbeing scanned, but that it is @emph{not} disabled whilst any arguments
90075Sobriento it are being expanded.
90075Sobrien
90075Sobrien@section Scanning the replacement list for macros to expand
90075SobrienThe C standard states that, after any parameters have been replaced
90075Sobrienwith their possibly-expanded arguments, the replacement list is
90075Sobrienscanned for nested macros.  Further, any identifiers in the
90075Sobrienreplacement list that are not expanded during this scan are never
90075Sobrienagain eligible for expansion in the future, if the reason they were
90075Sobriennot expanded is that the macro in question was disabled.
90075Sobrien
90075SobrienClearly this latter condition can only apply to tokens resulting from
90075Sobrienargument pre-expansion.  Other tokens never have an opportunity to be
90075Sobrienre-tested for expansion.  It is possible for identifiers that are
90075Sobrienfunction-like macros to not expand initially but to expand during a
90075Sobrienlater scan.  This occurs when the identifier is the last token of an
90075Sobrienargument (and therefore originally followed by a comma or a closing
90075Sobrienparenthesis in its macro's argument list), and when it replaces its
90075Sobrienparameter in the macro's replacement list, the subsequent token
90075Sobrienhappens to be an opening parenthesis (itself possibly the first token
90075Sobrienof an argument).
90075Sobrien
90075SobrienIt is important to note that when cpplib reads the last token of a
90075Sobriengiven context, that context still remains on the stack.  Only when
90075Sobrienlooking for the @emph{next} token do we pop it off the stack and drop
90075Sobriento a lower context.  This makes backing up by one token easy, but more
90075Sobrienimportantly ensures that the macro corresponding to the current
90075Sobriencontext is still disabled when we are considering the last token of
90075Sobrienits replacement list for expansion (or indeed expanding it).  As an
90075Sobrienexample, which illustrates many of the points above, consider
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#define foo(x) bar x
90075Sobrienfoo(foo) (2)
90075Sobrien@end smallexample
90075Sobrien
90075Sobrien@noindent which fully expands to @samp{bar foo (2)}.  During pre-expansion
90075Sobrienof the argument, @samp{foo} does not expand even though the macro is
90075Sobrienenabled, since it has no following parenthesis [pre-expansion of an
90075Sobrienargument only uses tokens from that argument; it cannot take tokens
90075Sobrienfrom whatever follows the macro invocation].  This still leaves the
90075Sobrienargument token @samp{foo} eligible for future expansion.  Then, when
90075Sobrienre-scanning after argument replacement, the token @samp{foo} is
90075Sobrienrejected for expansion, and marked ineligible for future expansion,
90075Sobriensince the macro is now disabled.  It is disabled because the
90075Sobrienreplacement list @samp{bar foo} of the macro is still on the context
90075Sobrienstack.
90075Sobrien
90075SobrienIf instead the algorithm looked for an opening parenthesis first and
90075Sobrienthen tested whether the macro were disabled it would be subtly wrong.
90075SobrienIn the example above, the replacement list of @samp{foo} would be
90075Sobrienpopped in the process of finding the parenthesis, re-enabling
90075Sobrien@samp{foo} and expanding it a second time.
90075Sobrien
90075Sobrien@section Looking for a function-like macro's opening parenthesis
90075SobrienFunction-like macros only expand when immediately followed by a
90075Sobrienparenthesis.  To do this cpplib needs to temporarily disable macros
90075Sobrienand read the next token.  Unfortunately, because of spacing issues
90075Sobrien(@pxref{Token Spacing}), there can be fake padding tokens in-between,
90075Sobrienand if the next real token is not a parenthesis cpplib needs to be
90075Sobrienable to back up that one token as well as retain the information in
90075Sobrienany intervening padding tokens.
90075Sobrien
90075SobrienBacking up more than one token when macros are involved is not
90075Sobrienpermitted by cpplib, because in general it might involve issues like
90075Sobrienrestoring popped contexts onto the context stack, which are too hard.
90075SobrienInstead, searching for the parenthesis is handled by a special
90075Sobrienfunction, @code{funlike_invocation_p}, which remembers padding
90075Sobrieninformation as it reads tokens.  If the next real token is not an
90075Sobrienopening parenthesis, it backs up that one token, and then pushes an
90075Sobrienextra context just containing the padding information if necessary.
90075Sobrien
90075Sobrien@section Marking tokens ineligible for future expansion
90075SobrienAs discussed above, cpplib needs a way of marking tokens as
90075Sobrienunexpandable.  Since the tokens cpplib handles are read-only once they
90075Sobrienhave been lexed, it instead makes a copy of the token and adds the
90075Sobrienflag @code{NO_EXPAND} to the copy.
90075Sobrien
90075SobrienFor efficiency and to simplify memory management by avoiding having to
90075Sobrienremember to free these tokens, they are allocated as temporary tokens
90075Sobrienfrom the lexer's current token run (@pxref{Lexing a line}) using the
90075Sobrienfunction @code{_cpp_temp_token}.  The tokens are then re-used once the
90075Sobriencurrent line of tokens has been read in.
90075Sobrien
90075SobrienThis might sound unsafe.  However, tokens runs are not re-used at the
90075Sobrienend of a line if it happens to be in the middle of a macro argument
90075Sobrienlist, and cpplib only wants to back-up more than one lexer token in
90075Sobriensituations where no macro expansion is involved, so the optimization
90075Sobrienis safe.
90075Sobrien
90075Sobrien@node Token Spacing
90075Sobrien@unnumbered Token Spacing
90075Sobrien@cindex paste avoidance
90075Sobrien@cindex spacing
90075Sobrien@cindex token spacing
90075Sobrien
132718SkanFirst, consider an issue that only concerns the stand-alone
132718Skanpreprocessor: there needs to be a guarantee that re-reading its preprocessed
90075Sobrienoutput results in an identical token stream.  Without taking special
90075Sobrienmeasures, this might not be the case because of macro substitution.
90075SobrienFor example:
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#define PLUS +
90075Sobrien#define EMPTY
90075Sobrien#define f(x) =x=
90075Sobrien+PLUS -EMPTY- PLUS+ f(=)
90075Sobrien        @expansion{} + + - - + + = = =
90075Sobrien@emph{not}
90075Sobrien        @expansion{} ++ -- ++ ===
90075Sobrien@end smallexample
90075Sobrien
90075SobrienOne solution would be to simply insert a space between all adjacent
90075Sobrientokens.  However, we would like to keep space insertion to a minimum,
90075Sobrienboth for aesthetic reasons and because it causes problems for people who
90075Sobrienstill try to abuse the preprocessor for things like Fortran source and
90075SobrienMakefiles.
90075Sobrien
90075SobrienFor now, just notice that when tokens are added (or removed, as shown by
90075Sobrienthe @code{EMPTY} example) from the original lexed token stream, we need
90075Sobriento check for accidental token pasting.  We call this @dfn{paste
90075Sobrienavoidance}.  Token addition and removal can only occur because of macro
90075Sobrienexpansion, but accidental pasting can occur in many places: both before
90075Sobrienand after each macro replacement, each argument replacement, and
90075Sobrienadditionally each token created by the @samp{#} and @samp{##} operators.
90075Sobrien
132718SkanLook at how the preprocessor gets whitespace output correct
90075Sobriennormally.  The @code{cpp_token} structure contains a flags byte, and one
90075Sobrienof those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
90075Sobrienindicates that the token was preceded by whitespace of some form other
90075Sobrienthan a new line.  The stand-alone preprocessor can use this flag to
90075Sobriendecide whether to insert a space between tokens in the output.
90075Sobrien
90075SobrienNow consider the result of the following macro expansion:
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#define add(x, y, z) x + y +z;
90075Sobriensum = add (1,2, 3);
90075Sobrien        @expansion{} sum = 1 + 2 +3;
90075Sobrien@end smallexample
90075Sobrien
90075SobrienThe interesting thing here is that the tokens @samp{1} and @samp{2} are
90075Sobrienoutput with a preceding space, and @samp{3} is output without a
90075Sobrienpreceding space, but when lexed none of these tokens had that property.
90075SobrienCareful consideration reveals that @samp{1} gets its preceding
90075Sobrienwhitespace from the space preceding @samp{add} in the macro invocation,
90075Sobrien@emph{not} replacement list.  @samp{2} gets its whitespace from the
90075Sobrienspace preceding the parameter @samp{y} in the macro replacement list,
90075Sobrienand @samp{3} has no preceding space because parameter @samp{z} has none
90075Sobrienin the replacement list.
90075Sobrien
90075SobrienOnce lexed, tokens are effectively fixed and cannot be altered, since
90075Sobrienpointers to them might be held in many places, in particular by
90075Sobrienin-progress macro expansions.  So instead of modifying the two tokens
90075Sobrienabove, the preprocessor inserts a special token, which I call a
90075Sobrien@dfn{padding token}, into the token stream to indicate that spacing of
90075Sobrienthe subsequent token is special.  The preprocessor inserts padding
90075Sobrientokens in front of every macro expansion and expanded macro argument.
90075SobrienThese point to a @dfn{source token} from which the subsequent real token
90075Sobrienshould inherit its spacing.  In the above example, the source tokens are
90075Sobrien@samp{add} in the macro invocation, and @samp{y} and @samp{z} in the
90075Sobrienmacro replacement list, respectively.
90075Sobrien
90075SobrienIt is quite easy to get multiple padding tokens in a row, for example if
90075Sobriena macro's first replacement token expands straight into another macro.
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#define foo bar
90075Sobrien#define bar baz
90075Sobrien[foo]
90075Sobrien        @expansion{} [baz]
90075Sobrien@end smallexample
90075Sobrien
90075SobrienHere, two padding tokens are generated with sources the @samp{foo} token
90075Sobrienbetween the brackets, and the @samp{bar} token from foo's replacement
132718Skanlist, respectively.  Clearly the first padding token is the one to
132718Skanuse, so the output code should contain a rule that the first
90075Sobrienpadding token in a sequence is the one that matters.
90075Sobrien
132718SkanBut what if a macro expansion is left?  Adjusting the above
90075Sobrienexample slightly:
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#define foo bar
90075Sobrien#define bar EMPTY baz
90075Sobrien#define EMPTY
90075Sobrien[foo] EMPTY;
90075Sobrien        @expansion{} [ baz] ;
90075Sobrien@end smallexample
90075Sobrien
90075SobrienAs shown, now there should be a space before @samp{baz} and the
90075Sobriensemicolon in the output.
90075Sobrien
90075SobrienThe rules we decided above fail for @samp{baz}: we generate three
90075Sobrienpadding tokens, one per macro invocation, before the token @samp{baz}.
90075SobrienWe would then have it take its spacing from the first of these, which
90075Sobriencarries source token @samp{foo} with no leading space.
90075Sobrien
90075SobrienIt is vital that cpplib get spacing correct in these examples since any
90075Sobrienof these macro expansions could be stringified, where spacing matters.
90075Sobrien
90075SobrienSo, this demonstrates that not just entering macro and argument
90075Sobrienexpansions, but leaving them requires special handling too.  I made
90075Sobriencpplib insert a padding token with a @code{NULL} source token when
90075Sobrienleaving macro expansions, as well as after each replaced argument in a
90075Sobrienmacro's replacement list.  It also inserts appropriate padding tokens on
90075Sobrieneither side of tokens created by the @samp{#} and @samp{##} operators.
90075SobrienI expanded the rule so that, if we see a padding token with a
90075Sobrien@code{NULL} source token, @emph{and} that source token has no leading
90075Sobrienspace, then we behave as if we have seen no padding tokens at all.  A
90075Sobrienquick check shows this rule will then get the above example correct as
90075Sobrienwell.
90075Sobrien
90075SobrienNow a relationship with paste avoidance is apparent: we have to be
90075Sobriencareful about paste avoidance in exactly the same locations we have
90075Sobrienpadding tokens in order to get white space correct.  This makes
90075Sobrienimplementation of paste avoidance easy: wherever the stand-alone
90075Sobrienpreprocessor is fixing up spacing because of padding tokens, and it
90075Sobrienturns out that no space is needed, it has to take the extra step to
90075Sobriencheck that a space is not needed after all to avoid an accidental paste.
90075SobrienThe function @code{cpp_avoid_paste} advises whether a space is required
90075Sobrienbetween two consecutive tokens.  To avoid excessive spacing, it tries
90075Sobrienhard to only require a space if one is likely to be necessary, but for
90075Sobrienreasons of efficiency it is slightly conservative and might recommend a
90075Sobrienspace where one is not strictly needed.
90075Sobrien
90075Sobrien@node Line Numbering
90075Sobrien@unnumbered Line numbering
90075Sobrien@cindex line numbers
90075Sobrien
90075Sobrien@section Just which line number anyway?
90075Sobrien
90075SobrienThere are three reasonable requirements a cpplib client might have for
90075Sobrienthe line number of a token passed to it:
90075Sobrien
90075Sobrien@itemize @bullet
90075Sobrien@item
90075SobrienThe source line it was lexed on.
90075Sobrien@item
90075SobrienThe line it is output on.  This can be different to the line it was
90075Sobrienlexed on if, for example, there are intervening escaped newlines or
90075SobrienC-style comments.  For example:
90075Sobrien
90075Sobrien@smallexample
169689Skanfoo /* @r{A long
169689Skancomment} */ bar \
90075Sobrienbaz
90075Sobrien@result{}
90075Sobrienfoo bar baz
90075Sobrien@end smallexample
90075Sobrien
90075Sobrien@item
90075SobrienIf the token results from a macro expansion, the line of the macro name,
90075Sobrienor possibly the line of the closing parenthesis in the case of
90075Sobrienfunction-like macro expansion.
90075Sobrien@end itemize
90075Sobrien
90075SobrienThe @code{cpp_token} structure contains @code{line} and @code{col}
90075Sobrienmembers.  The lexer fills these in with the line and column of the first
90075Sobriencharacter of the token.  Consequently, but maybe unexpectedly, a token
90075Sobrienfrom the replacement list of a macro expansion carries the location of
90075Sobrienthe token within the @code{#define} directive, because cpplib expands a
90075Sobrienmacro by returning pointers to the tokens in its replacement list.  The
90075Sobriencurrent implementation of cpplib assigns tokens created from built-in
90075Sobrienmacros and the @samp{#} and @samp{##} operators the location of the most
90075Sobrienrecently lexed token.  This is a because they are allocated from the
90075Sobrienlexer's token runs, and because of the way the diagnostic routines infer
90075Sobrienthe appropriate location to report.
90075Sobrien
90075SobrienThe diagnostic routines in cpplib display the location of the most
90075Sobrienrecently @emph{lexed} token, unless they are passed a specific line and
90075Sobriencolumn to report.  For diagnostics regarding tokens that arise from
90075Sobrienmacro expansions, it might also be helpful for the user to see the
90075Sobrienoriginal location in the macro definition that the token came from.
90075SobrienSince that is exactly the information each token carries, such an
90075Sobrienenhancement could be made relatively easily in future.
90075Sobrien
90075SobrienThe stand-alone preprocessor faces a similar problem when determining
90075Sobrienthe correct line to output the token on: the position attached to a
90075Sobrientoken is fairly useless if the token came from a macro expansion.  All
90075Sobrientokens on a logical line should be output on its first physical line, so
90075Sobrienthe token's reported location is also wrong if it is part of a physical
90075Sobrienline other than the first.
90075Sobrien
90075SobrienTo solve these issues, cpplib provides a callback that is generated
90075Sobrienwhenever it lexes a preprocessing token that starts a new logical line
90075Sobrienother than a directive.  It passes this token (which may be a
90075Sobrien@code{CPP_EOF} token indicating the end of the translation unit) to the
90075Sobriencallback routine, which can then use the line and column of this token
90075Sobriento produce correct output.
90075Sobrien
90075Sobrien@section Representation of line numbers
90075Sobrien
90075SobrienAs mentioned above, cpplib stores with each token the line number that
90075Sobrienit was lexed on.  In fact, this number is not the number of the line in
90075Sobrienthe source file, but instead bears more resemblance to the number of the
90075Sobrienline in the translation unit.
90075Sobrien
90075SobrienThe preprocessor maintains a monotonic increasing line count, which is
90075Sobrienincremented at every new line character (and also at the end of any
90075Sobrienbuffer that does not end in a new line).  Since a line number of zero is
90075Sobrienuseful to indicate certain special states and conditions, this variable
90075Sobrienstarts counting from one.
90075Sobrien
90075SobrienThis variable therefore uniquely enumerates each line in the translation
90075Sobrienunit.  With some simple infrastructure, it is straight forward to map
90075Sobrienfrom this to the original source file and line number pair, saving space
90075Sobrienwhenever line number information needs to be saved.  The code the
90075Sobrienimplements this mapping lies in the files @file{line-map.c} and
90075Sobrien@file{line-map.h}.
90075Sobrien
90075SobrienCommand-line macros and assertions are implemented by pushing a buffer
90075Sobriencontaining the right hand side of an equivalent @code{#define} or
90075Sobrien@code{#assert} directive.  Some built-in macros are handled similarly.
90075SobrienSince these are all processed before the first line of the main input
90075Sobrienfile, it will typically have an assigned line closer to twenty than to
90075Sobrienone.
90075Sobrien
90075Sobrien@node Guard Macros
90075Sobrien@unnumbered The Multiple-Include Optimization
90075Sobrien@cindex guard macros
90075Sobrien@cindex controlling macros
90075Sobrien@cindex multiple-include optimization
90075Sobrien
90075SobrienHeader files are often of the form
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#ifndef FOO
90075Sobrien#define FOO
90075Sobrien@dots{}
90075Sobrien#endif
90075Sobrien@end smallexample
90075Sobrien
90075Sobrien@noindent
90075Sobriento prevent the compiler from processing them more than once.  The
90075Sobrienpreprocessor notices such header files, so that if the header file
90075Sobrienappears in a subsequent @code{#include} directive and @code{FOO} is
90075Sobriendefined, then it is ignored and it doesn't preprocess or even re-open
90075Sobrienthe file a second time.  This is referred to as the @dfn{multiple
90075Sobrieninclude optimization}.
90075Sobrien
90075SobrienUnder what circumstances is such an optimization valid?  If the file
90075Sobrienwere included a second time, it can only be optimized away if that
90075Sobrieninclusion would result in no tokens to return, and no relevant
90075Sobriendirectives to process.  Therefore the current implementation imposes
90075Sobrienrequirements and makes some allowances as follows:
90075Sobrien
90075Sobrien@enumerate
90075Sobrien@item
90075SobrienThere must be no tokens outside the controlling @code{#if}-@code{#endif}
90075Sobrienpair, but whitespace and comments are permitted.
90075Sobrien
90075Sobrien@item
90075SobrienThere must be no directives outside the controlling directive pair, but
90075Sobrienthe @dfn{null directive} (a line containing nothing other than a single
90075Sobrien@samp{#} and possibly whitespace) is permitted.
90075Sobrien
90075Sobrien@item
90075SobrienThe opening directive must be of the form
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#ifndef FOO
90075Sobrien@end smallexample
90075Sobrien
90075Sobrienor
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#if !defined FOO     [equivalently, #if !defined(FOO)]
90075Sobrien@end smallexample
90075Sobrien
90075Sobrien@item
90075SobrienIn the second form above, the tokens forming the @code{#if} expression
90075Sobrienmust have come directly from the source file---no macro expansion must
90075Sobrienhave been involved.  This is because macro definitions can change, and
90075Sobrientracking whether or not a relevant change has been made is not worth the
90075Sobrienimplementation cost.
90075Sobrien
90075Sobrien@item
90075SobrienThere can be no @code{#else} or @code{#elif} directives at the outer
90075Sobrienconditional block level, because they would probably contain something
90075Sobrienof interest to a subsequent pass.
90075Sobrien@end enumerate
90075Sobrien
90075SobrienFirst, when pushing a new file on the buffer stack,
90075Sobrien@code{_stack_include_file} sets the controlling macro @code{mi_cmacro} to
90075Sobrien@code{NULL}, and sets @code{mi_valid} to @code{true}.  This indicates
90075Sobrienthat the preprocessor has not yet encountered anything that would
90075Sobrieninvalidate the multiple-include optimization.  As described in the next
90075Sobrienfew paragraphs, these two variables having these values effectively
90075Sobrienindicates top-of-file.
90075Sobrien
90075SobrienWhen about to return a token that is not part of a directive,
90075Sobrien@code{_cpp_lex_token} sets @code{mi_valid} to @code{false}.  This
90075Sobrienenforces the constraint that tokens outside the controlling conditional
90075Sobrienblock invalidate the optimization.
90075Sobrien
90075SobrienThe @code{do_if}, when appropriate, and @code{do_ifndef} directive
90075Sobrienhandlers pass the controlling macro to the function
90075Sobrien@code{push_conditional}.  cpplib maintains a stack of nested conditional
90075Sobrienblocks, and after processing every opening conditional this function
90075Sobrienpushes an @code{if_stack} structure onto the stack.  In this structure
90075Sobrienit records the controlling macro for the block, provided there is one
90075Sobrienand we're at top-of-file (as described above).  If an @code{#elif} or
90075Sobrien@code{#else} directive is encountered, the controlling macro for that
90075Sobrienblock is cleared to @code{NULL}.  Otherwise, it survives until the
90075Sobrien@code{#endif} closing the block, upon which @code{do_endif} sets
90075Sobrien@code{mi_valid} to true and stores the controlling macro in
90075Sobrien@code{mi_cmacro}.
90075Sobrien
90075Sobrien@code{_cpp_handle_directive} clears @code{mi_valid} when processing any
90075Sobriendirective other than an opening conditional and the null directive.
90075SobrienWith this, and requiring top-of-file to record a controlling macro, and
90075Sobrienno @code{#else} or @code{#elif} for it to survive and be copied to
90075Sobrien@code{mi_cmacro} by @code{do_endif}, we have enforced the absence of
90075Sobriendirectives outside the main conditional block for the optimization to be
90075Sobrienon.
90075Sobrien
90075SobrienNote that whilst we are inside the conditional block, @code{mi_valid} is
169689Skanlikely to be reset to @code{false}, but this does not matter since
90075Sobrienthe closing @code{#endif} restores it to @code{true} if appropriate.
90075Sobrien
90075SobrienFinally, since @code{_cpp_lex_direct} pops the file off the buffer stack
90075Sobrienat @code{EOF} without returning a token, if the @code{#endif} directive
90075Sobrienwas not followed by any tokens, @code{mi_valid} is @code{true} and
90075Sobrien@code{_cpp_pop_file_buffer} remembers the controlling macro associated
90075Sobrienwith the file.  Subsequent calls to @code{stack_include_file} result in
90075Sobrienno buffer being pushed if the controlling macro is defined, effecting
90075Sobrienthe optimization.
90075Sobrien
90075SobrienA quick word on how we handle the
90075Sobrien
90075Sobrien@smallexample
90075Sobrien#if !defined FOO
90075Sobrien@end smallexample
90075Sobrien
90075Sobrien@noindent
90075Sobriencase.  @code{_cpp_parse_expr} and @code{parse_defined} take steps to see
90075Sobrienwhether the three stages @samp{!}, @samp{defined-expression} and
90075Sobrien@samp{end-of-directive} occur in order in a @code{#if} expression.  If
90075Sobrienso, they return the guard macro to @code{do_if} in the variable
90075Sobrien@code{mi_ind_cmacro}, and otherwise set it to @code{NULL}.
90075Sobrien@code{enter_macro_context} sets @code{mi_valid} to false, so if a macro
90075Sobrienwas expanded whilst parsing any part of the expression, then the
90075Sobrientop-of-file test in @code{push_conditional} fails and the optimization
90075Sobrienis turned off.
90075Sobrien
90075Sobrien@node Files
90075Sobrien@unnumbered File Handling
90075Sobrien@cindex files
90075Sobrien
90075SobrienFairly obviously, the file handling code of cpplib resides in the file
169689Skan@file{files.c}.  It takes care of the details of file searching,
90075Sobrienopening, reading and caching, for both the main source file and all the
90075Sobrienheaders it recursively includes.
90075Sobrien
90075SobrienThe basic strategy is to minimize the number of system calls.  On many
90075Sobriensystems, the basic @code{open ()} and @code{fstat ()} system calls can
90075Sobrienbe quite expensive.  For every @code{#include}-d file, we need to try
90075Sobrienall the directories in the search path until we find a match.  Some
90075Sobrienprojects, such as glibc, pass twenty or thirty include paths on the
90075Sobriencommand line, so this can rapidly become time consuming.
90075Sobrien
90075SobrienFor a header file we have not encountered before we have little choice
90075Sobrienbut to do this.  However, it is often the case that the same headers are
90075Sobrienrepeatedly included, and in these cases we try to avoid repeating the
90075Sobrienfilesystem queries whilst searching for the correct file.
90075Sobrien
90075SobrienFor each file we try to open, we store the constructed path in a splay
90075Sobrientree.  This path first undergoes simplification by the function
90075Sobrien@code{_cpp_simplify_pathname}.  For example,
90075Sobrien@file{/usr/include/bits/../foo.h} is simplified to
90075Sobrien@file{/usr/include/foo.h} before we enter it in the splay tree and try
90075Sobriento @code{open ()} the file.  CPP will then find subsequent uses of
90075Sobrien@file{foo.h}, even as @file{/usr/include/foo.h}, in the splay tree and
90075Sobriensave system calls.
90075Sobrien
90075SobrienFurther, it is likely the file contents have also been cached, saving a
90075Sobrien@code{read ()} system call.  We don't bother caching the contents of
90075Sobrienheader files that are re-inclusion protected, and whose re-inclusion
90075Sobrienmacro is defined when we leave the header file for the first time.  If
90075Sobrienthe host supports it, we try to map suitably large files into memory,
90075Sobrienrather than reading them in directly.
90075Sobrien
90075SobrienThe include paths are internally stored on a null-terminated
90075Sobriensingly-linked list, starting with the @code{"header.h"} directory search
90075Sobrienchain, which then links into the @code{<header.h>} directory chain.
90075Sobrien
90075SobrienFiles included with the @code{<foo.h>} syntax start the lookup directly
90075Sobrienin the second half of this chain.  However, files included with the
90075Sobrien@code{"foo.h"} syntax start at the beginning of the chain, but with one
90075Sobrienextra directory prepended.  This is the directory of the current file;
90075Sobrienthe one containing the @code{#include} directive.  Prepending this
90075Sobriendirectory on a per-file basis is handled by the function
90075Sobrien@code{search_from}.
90075Sobrien
90075SobrienNote that a header included with a directory component, such as
90075Sobrien@code{#include "mydir/foo.h"} and opened as
90075Sobrien@file{/usr/local/include/mydir/foo.h}, will have the complete path minus
90075Sobrienthe basename @samp{foo.h} as the current directory.
90075Sobrien
90075SobrienEnough information is stored in the splay tree that CPP can immediately
90075Sobrientell whether it can skip the header file because of the multiple include
90075Sobrienoptimization, whether the file didn't exist or couldn't be opened for
90075Sobriensome reason, or whether the header was flagged not to be re-used, as it
90075Sobrienis with the obsolete @code{#import} directive.
90075Sobrien
90075SobrienFor the benefit of MS-DOS filesystems with an 8.3 filename limitation,
90075SobrienCPP offers the ability to treat various include file names as aliases
90075Sobrienfor the real header files with shorter names.  The map from one to the
90075Sobrienother is found in a special file called @samp{header.gcc}, stored in the
90075Sobriencommand line (or system) include directories to which the mapping
90075Sobrienapplies.  This may be higher up the directory tree than the full path to
90075Sobrienthe file minus the base name.
90075Sobrien
169689Skan@node Concept Index
169689Skan@unnumbered Concept Index
90075Sobrien@printindex cp
90075Sobrien
90075Sobrien@bye