• Home
  • History
  • Annotate
  • Line#
  • Navigate
  • Raw
  • Download
  • only in /asuswrt-rt-n18u-9.0.0.4.380.2695/release/src-rt-6.x.4708/toolchains/hndtools-armeabi-2011.09/share/doc/arm-arm-none-eabi/html/cpp/
1<html lang="en">
2<head>
3<title>Tokenization - The C Preprocessor</title>
4<meta http-equiv="Content-Type" content="text/html">
5<meta name="description" content="The C Preprocessor">
6<meta name="generator" content="makeinfo 4.13">
7<link title="Top" rel="start" href="index.html#Top">
8<link rel="up" href="Overview.html#Overview" title="Overview">
9<link rel="prev" href="Initial-processing.html#Initial-processing" title="Initial processing">
10<link rel="next" href="The-preprocessing-language.html#The-preprocessing-language" title="The preprocessing language">
11<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
12<!--
13Copyright (C) 1987, 1989, 1991, 1992, 1993, 1994, 1995, 1996,
141997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
152008, 2009, 2010, 2011
16Free Software Foundation, Inc.
17
18Permission is granted to copy, distribute and/or modify this document
19under the terms of the GNU Free Documentation License, Version 1.3 or
20any later version published by the Free Software Foundation.  A copy of
21the license is included in the
22section entitled ``GNU Free Documentation License''.
23
24This manual contains no Invariant Sections.  The Front-Cover Texts are
25(a) (see below), and the Back-Cover Texts are (b) (see below).
26
27(a) The FSF's Front-Cover Text is:
28
29     A GNU Manual
30
31(b) The FSF's Back-Cover Text is:
32
33     You have freedom to copy and modify this GNU Manual, like GNU
34     software.  Copies published by the Free Software Foundation raise
35     funds for GNU development.
36-->
37<meta http-equiv="Content-Style-Type" content="text/css">
38<style type="text/css"><!--
39  pre.display { font-family:inherit }
40  pre.format  { font-family:inherit }
41  pre.smalldisplay { font-family:inherit; font-size:smaller }
42  pre.smallformat  { font-family:inherit; font-size:smaller }
43  pre.smallexample { font-size:smaller }
44  pre.smalllisp    { font-size:smaller }
45  span.sc    { font-variant:small-caps }
46  span.roman { font-family:serif; font-weight:normal; } 
47  span.sansserif { font-family:sans-serif; font-weight:normal; } 
48--></style>
49<link rel="stylesheet" type="text/css" href="../cs.css">
50</head>
51<body>
52<div class="node">
53<a name="Tokenization"></a>
54<p>
55Next:&nbsp;<a rel="next" accesskey="n" href="The-preprocessing-language.html#The-preprocessing-language">The preprocessing language</a>,
56Previous:&nbsp;<a rel="previous" accesskey="p" href="Initial-processing.html#Initial-processing">Initial processing</a>,
57Up:&nbsp;<a rel="up" accesskey="u" href="Overview.html#Overview">Overview</a>
58<hr>
59</div>
60
61<h3 class="section">1.3 Tokenization</h3>
62
63<p><a name="index-tokens-8"></a><a name="index-preprocessing-tokens-9"></a>After the textual transformations are finished, the input file is
64converted into a sequence of <dfn>preprocessing tokens</dfn>.  These mostly
65correspond to the syntactic tokens used by the C compiler, but there are
66a few differences.  White space separates tokens; it is not itself a
67token of any kind.  Tokens do not have to be separated by white space,
68but it is often necessary to avoid ambiguities.
69
70   <p>When faced with a sequence of characters that has more than one possible
71tokenization, the preprocessor is greedy.  It always makes each token,
72starting from the left, as big as possible before moving on to the next
73token.  For instance, <code>a+++++b</code> is interpreted as
74<code>a&nbsp;++&nbsp;++&nbsp;+&nbsp;b<!-- /@w --></code>, not as <code>a&nbsp;++&nbsp;+&nbsp;++&nbsp;b<!-- /@w --></code>, even though the
75latter tokenization could be part of a valid C program and the former
76could not.
77
78   <p>Once the input file is broken into tokens, the token boundaries never
79change, except when the &lsquo;<samp><span class="samp">##</span></samp>&rsquo; preprocessing operator is used to paste
80tokens together.  See <a href="Concatenation.html#Concatenation">Concatenation</a>.  For example,
81
82<pre class="smallexample">     #define foo() bar
83     foo()baz
84          ==&gt; bar baz
85     <em>not</em>
86          ==&gt; barbaz
87</pre>
88   <p>The compiler does not re-tokenize the preprocessor's output.  Each
89preprocessing token becomes one compiler token.
90
91   <p><a name="index-identifiers-10"></a>Preprocessing tokens fall into five broad classes: identifiers,
92preprocessing numbers, string literals, punctuators, and other.  An
93<dfn>identifier</dfn> is the same as an identifier in C: any sequence of
94letters, digits, or underscores, which begins with a letter or
95underscore.  Keywords of C have no significance to the preprocessor;
96they are ordinary identifiers.  You can define a macro whose name is a
97keyword, for instance.  The only identifier which can be considered a
98preprocessing keyword is <code>defined</code>.  See <a href="Defined.html#Defined">Defined</a>.
99
100   <p>This is mostly true of other languages which use the C preprocessor. 
101However, a few of the keywords of C++ are significant even in the
102preprocessor.  See <a href="C_002b_002b-Named-Operators.html#C_002b_002b-Named-Operators">C++ Named Operators</a>.
103
104   <p>In the 1999 C standard, identifiers may contain letters which are not
105part of the &ldquo;basic source character set&rdquo;, at the implementation's
106discretion (such as accented Latin letters, Greek letters, or Chinese
107ideograms).  This may be done with an extended character set, or the
108&lsquo;<samp><span class="samp">\u</span></samp>&rsquo; and &lsquo;<samp><span class="samp">\U</span></samp>&rsquo; escape sequences.  The implementation of this
109feature in GCC is experimental; such characters are only accepted in
110the &lsquo;<samp><span class="samp">\u</span></samp>&rsquo; and &lsquo;<samp><span class="samp">\U</span></samp>&rsquo; forms and only if
111<samp><span class="option">-fextended-identifiers</span></samp> is used.
112
113   <p>As an extension, GCC treats &lsquo;<samp><span class="samp">$</span></samp>&rsquo; as a letter.  This is for
114compatibility with some systems, such as VMS, where &lsquo;<samp><span class="samp">$</span></samp>&rsquo; is commonly
115used in system-defined function and object names.  &lsquo;<samp><span class="samp">$</span></samp>&rsquo; is not a
116letter in strictly conforming mode, or if you specify the <samp><span class="option">-$</span></samp>
117option.  See <a href="Invocation.html#Invocation">Invocation</a>.
118
119   <p><a name="index-numbers-11"></a><a name="index-preprocessing-numbers-12"></a>A <dfn>preprocessing number</dfn> has a rather bizarre definition.  The
120category includes all the normal integer and floating point constants
121one expects of C, but also a number of other things one might not
122initially recognize as a number.  Formally, preprocessing numbers begin
123with an optional period, a required decimal digit, and then continue
124with any sequence of letters, digits, underscores, periods, and
125exponents.  Exponents are the two-character sequences &lsquo;<samp><span class="samp">e+</span></samp>&rsquo;,
126&lsquo;<samp><span class="samp">e-</span></samp>&rsquo;, &lsquo;<samp><span class="samp">E+</span></samp>&rsquo;, &lsquo;<samp><span class="samp">E-</span></samp>&rsquo;, &lsquo;<samp><span class="samp">p+</span></samp>&rsquo;, &lsquo;<samp><span class="samp">p-</span></samp>&rsquo;, &lsquo;<samp><span class="samp">P+</span></samp>&rsquo;, and
127&lsquo;<samp><span class="samp">P-</span></samp>&rsquo;.  (The exponents that begin with &lsquo;<samp><span class="samp">p</span></samp>&rsquo; or &lsquo;<samp><span class="samp">P</span></samp>&rsquo; are new
128to C99.  They are used for hexadecimal floating-point constants.)
129
130   <p>The purpose of this unusual definition is to isolate the preprocessor
131from the full complexity of numeric constants.  It does not have to
132distinguish between lexically valid and invalid floating-point numbers,
133which is complicated.  The definition also permits you to split an
134identifier at any position and get exactly two tokens, which can then be
135pasted back together with the &lsquo;<samp><span class="samp">##</span></samp>&rsquo; operator.
136
137   <p>It's possible for preprocessing numbers to cause programs to be
138misinterpreted.  For example, <code>0xE+12</code> is a preprocessing number
139which does not translate to any valid numeric constant, therefore a
140syntax error.  It does not mean <code>0xE&nbsp;+&nbsp;12<!-- /@w --></code>, which is what you
141might have intended.
142
143   <p><a name="index-string-literals-13"></a><a name="index-string-constants-14"></a><a name="index-character-constants-15"></a><a name="index-header-file-names-16"></a><!-- the @: prevents makeinfo from turning '' into ". -->
144<dfn>String literals</dfn> are string constants, character constants, and
145header file names (the argument of &lsquo;<samp><span class="samp">#include</span></samp>&rsquo;).<a rel="footnote" href="#fn-1" name="fnd-1"><sup>1</sup></a>  String constants and character
146constants are straightforward: <tt>"<small class="dots">...</small>"</tt> or <tt>'<small class="dots">...</small>'</tt>.  In
147either case embedded quotes should be escaped with a backslash:
148<tt>'\''</tt> is the character constant for &lsquo;<samp><span class="samp">'</span></samp>&rsquo;.  There is no limit on
149the length of a character constant, but the value of a character
150constant that contains more than one character is
151implementation-defined.  See <a href="Implementation-Details.html#Implementation-Details">Implementation Details</a>.
152
153   <p>Header file names either look like string constants, <tt>"<small class="dots">...</small>"</tt>, or are
154written with angle brackets instead, <tt>&lt;<small class="dots">...</small>&gt;</tt>.  In either case,
155backslash is an ordinary character.  There is no way to escape the
156closing quote or angle bracket.  The preprocessor looks for the header
157file in different places depending on which form you use.  See <a href="Include-Operation.html#Include-Operation">Include Operation</a>.
158
159   <p>No string literal may extend past the end of a line.  Older versions
160of GCC accepted multi-line string constants.  You may use continued
161lines instead, or string constant concatenation.  See <a href="Differences-from-previous-versions.html#Differences-from-previous-versions">Differences from previous versions</a>.
162
163   <p><a name="index-punctuators-17"></a><a name="index-digraphs-18"></a><a name="index-alternative-tokens-19"></a><dfn>Punctuators</dfn> are all the usual bits of punctuation which are
164meaningful to C and C++.  All but three of the punctuation characters in
165ASCII are C punctuators.  The exceptions are &lsquo;<samp><span class="samp">@</span></samp>&rsquo;, &lsquo;<samp><span class="samp">$</span></samp>&rsquo;, and
166&lsquo;<samp><span class="samp">`</span></samp>&rsquo;.  In addition, all the two- and three-character operators are
167punctuators.  There are also six <dfn>digraphs</dfn>, which the C++ standard
168calls <dfn>alternative tokens</dfn>, which are merely alternate ways to spell
169other punctuators.  This is a second attempt to work around missing
170punctuation in obsolete systems.  It has no negative side effects,
171unlike trigraphs, but does not cover as much ground.  The digraphs and
172their corresponding normal punctuators are:
173
174<pre class="smallexample">     Digraph:        &lt;%  %&gt;  &lt;:  :&gt;  %:  %:%:
175     Punctuator:      {   }   [   ]   #    ##
176</pre>
177   <p><a name="index-other-tokens-20"></a>Any other single character is considered &ldquo;other&rdquo;.  It is passed on to
178the preprocessor's output unmolested.  The C compiler will almost
179certainly reject source code containing &ldquo;other&rdquo; tokens.  In ASCII, the
180only other characters are &lsquo;<samp><span class="samp">@</span></samp>&rsquo;, &lsquo;<samp><span class="samp">$</span></samp>&rsquo;, &lsquo;<samp><span class="samp">`</span></samp>&rsquo;, and control
181characters other than NUL (all bits zero).  (Note that &lsquo;<samp><span class="samp">$</span></samp>&rsquo; is
182normally considered a letter.)  All characters with the high bit set
183(numeric range 0x7F&ndash;0xFF) are also &ldquo;other&rdquo; in the present
184implementation.  This will change when proper support for international
185character sets is added to GCC.
186
187   <p>NUL is a special case because of the high probability that its
188appearance is accidental, and because it may be invisible to the user
189(many terminals do not display NUL at all).  Within comments, NULs are
190silently ignored, just as any other character would be.  In running
191text, NUL is considered white space.  For example, these two directives
192have the same meaning.
193
194<pre class="smallexample">     #define X^@1
195     #define X 1
196</pre>
197   <p class="noindent">(where &lsquo;<samp><span class="samp">^@</span></samp>&rsquo; is ASCII NUL).  Within string or character constants,
198NULs are preserved.  In the latter two cases the preprocessor emits a
199warning message.
200
201   <div class="footnote">
202<hr>
203<h4>Footnotes</h4><p class="footnote"><small>[<a name="fn-1" href="#fnd-1">1</a>]</small> The C
204standard uses the term <dfn>string literal</dfn> to refer only to what we are
205calling <dfn>string constants</dfn>.</p>
206
207   <hr></div>
208
209   </body></html>
210
211