1<html lang="en"> 2<head> 3<title>Tokenization - The C Preprocessor</title> 4<meta http-equiv="Content-Type" content="text/html"> 5<meta name="description" content="The C Preprocessor"> 6<meta name="generator" content="makeinfo 4.13"> 7<link title="Top" rel="start" href="index.html#Top"> 8<link rel="up" href="Overview.html#Overview" title="Overview"> 9<link rel="prev" href="Initial-processing.html#Initial-processing" title="Initial processing"> 10<link rel="next" href="The-preprocessing-language.html#The-preprocessing-language" title="The preprocessing language"> 11<link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage"> 12<!-- 13Copyright (C) 1987, 1989, 1991, 1992, 1993, 1994, 1995, 1996, 141997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 152008, 2009, 2010, 2011 16Free Software Foundation, Inc. 17 18Permission is granted to copy, distribute and/or modify this document 19under the terms of the GNU Free Documentation License, Version 1.3 or 20any later version published by the Free Software Foundation. A copy of 21the license is included in the 22section entitled ``GNU Free Documentation License''. 23 24This manual contains no Invariant Sections. The Front-Cover Texts are 25(a) (see below), and the Back-Cover Texts are (b) (see below). 26 27(a) The FSF's Front-Cover Text is: 28 29 A GNU Manual 30 31(b) The FSF's Back-Cover Text is: 32 33 You have freedom to copy and modify this GNU Manual, like GNU 34 software. Copies published by the Free Software Foundation raise 35 funds for GNU development. 36--> 37<meta http-equiv="Content-Style-Type" content="text/css"> 38<style type="text/css"><!-- 39 pre.display { font-family:inherit } 40 pre.format { font-family:inherit } 41 pre.smalldisplay { font-family:inherit; font-size:smaller } 42 pre.smallformat { font-family:inherit; font-size:smaller } 43 pre.smallexample { font-size:smaller } 44 pre.smalllisp { font-size:smaller } 45 span.sc { font-variant:small-caps } 46 span.roman { font-family:serif; font-weight:normal; } 47 span.sansserif { font-family:sans-serif; font-weight:normal; } 48--></style> 49<link rel="stylesheet" type="text/css" href="../cs.css"> 50</head> 51<body> 52<div class="node"> 53<a name="Tokenization"></a> 54<p> 55Next: <a rel="next" accesskey="n" href="The-preprocessing-language.html#The-preprocessing-language">The preprocessing language</a>, 56Previous: <a rel="previous" accesskey="p" href="Initial-processing.html#Initial-processing">Initial processing</a>, 57Up: <a rel="up" accesskey="u" href="Overview.html#Overview">Overview</a> 58<hr> 59</div> 60 61<h3 class="section">1.3 Tokenization</h3> 62 63<p><a name="index-tokens-8"></a><a name="index-preprocessing-tokens-9"></a>After the textual transformations are finished, the input file is 64converted into a sequence of <dfn>preprocessing tokens</dfn>. These mostly 65correspond to the syntactic tokens used by the C compiler, but there are 66a few differences. White space separates tokens; it is not itself a 67token of any kind. Tokens do not have to be separated by white space, 68but it is often necessary to avoid ambiguities. 69 70 <p>When faced with a sequence of characters that has more than one possible 71tokenization, the preprocessor is greedy. It always makes each token, 72starting from the left, as big as possible before moving on to the next 73token. For instance, <code>a+++++b</code> is interpreted as 74<code>a ++ ++ + b<!-- /@w --></code>, not as <code>a ++ + ++ b<!-- /@w --></code>, even though the 75latter tokenization could be part of a valid C program and the former 76could not. 77 78 <p>Once the input file is broken into tokens, the token boundaries never 79change, except when the ‘<samp><span class="samp">##</span></samp>’ preprocessing operator is used to paste 80tokens together. See <a href="Concatenation.html#Concatenation">Concatenation</a>. For example, 81 82<pre class="smallexample"> #define foo() bar 83 foo()baz 84 ==> bar baz 85 <em>not</em> 86 ==> barbaz 87</pre> 88 <p>The compiler does not re-tokenize the preprocessor's output. Each 89preprocessing token becomes one compiler token. 90 91 <p><a name="index-identifiers-10"></a>Preprocessing tokens fall into five broad classes: identifiers, 92preprocessing numbers, string literals, punctuators, and other. An 93<dfn>identifier</dfn> is the same as an identifier in C: any sequence of 94letters, digits, or underscores, which begins with a letter or 95underscore. Keywords of C have no significance to the preprocessor; 96they are ordinary identifiers. You can define a macro whose name is a 97keyword, for instance. The only identifier which can be considered a 98preprocessing keyword is <code>defined</code>. See <a href="Defined.html#Defined">Defined</a>. 99 100 <p>This is mostly true of other languages which use the C preprocessor. 101However, a few of the keywords of C++ are significant even in the 102preprocessor. See <a href="C_002b_002b-Named-Operators.html#C_002b_002b-Named-Operators">C++ Named Operators</a>. 103 104 <p>In the 1999 C standard, identifiers may contain letters which are not 105part of the “basic source character set”, at the implementation's 106discretion (such as accented Latin letters, Greek letters, or Chinese 107ideograms). This may be done with an extended character set, or the 108‘<samp><span class="samp">\u</span></samp>’ and ‘<samp><span class="samp">\U</span></samp>’ escape sequences. The implementation of this 109feature in GCC is experimental; such characters are only accepted in 110the ‘<samp><span class="samp">\u</span></samp>’ and ‘<samp><span class="samp">\U</span></samp>’ forms and only if 111<samp><span class="option">-fextended-identifiers</span></samp> is used. 112 113 <p>As an extension, GCC treats ‘<samp><span class="samp">$</span></samp>’ as a letter. This is for 114compatibility with some systems, such as VMS, where ‘<samp><span class="samp">$</span></samp>’ is commonly 115used in system-defined function and object names. ‘<samp><span class="samp">$</span></samp>’ is not a 116letter in strictly conforming mode, or if you specify the <samp><span class="option">-$</span></samp> 117option. See <a href="Invocation.html#Invocation">Invocation</a>. 118 119 <p><a name="index-numbers-11"></a><a name="index-preprocessing-numbers-12"></a>A <dfn>preprocessing number</dfn> has a rather bizarre definition. The 120category includes all the normal integer and floating point constants 121one expects of C, but also a number of other things one might not 122initially recognize as a number. Formally, preprocessing numbers begin 123with an optional period, a required decimal digit, and then continue 124with any sequence of letters, digits, underscores, periods, and 125exponents. Exponents are the two-character sequences ‘<samp><span class="samp">e+</span></samp>’, 126‘<samp><span class="samp">e-</span></samp>’, ‘<samp><span class="samp">E+</span></samp>’, ‘<samp><span class="samp">E-</span></samp>’, ‘<samp><span class="samp">p+</span></samp>’, ‘<samp><span class="samp">p-</span></samp>’, ‘<samp><span class="samp">P+</span></samp>’, and 127‘<samp><span class="samp">P-</span></samp>’. (The exponents that begin with ‘<samp><span class="samp">p</span></samp>’ or ‘<samp><span class="samp">P</span></samp>’ are new 128to C99. They are used for hexadecimal floating-point constants.) 129 130 <p>The purpose of this unusual definition is to isolate the preprocessor 131from the full complexity of numeric constants. It does not have to 132distinguish between lexically valid and invalid floating-point numbers, 133which is complicated. The definition also permits you to split an 134identifier at any position and get exactly two tokens, which can then be 135pasted back together with the ‘<samp><span class="samp">##</span></samp>’ operator. 136 137 <p>It's possible for preprocessing numbers to cause programs to be 138misinterpreted. For example, <code>0xE+12</code> is a preprocessing number 139which does not translate to any valid numeric constant, therefore a 140syntax error. It does not mean <code>0xE + 12<!-- /@w --></code>, which is what you 141might have intended. 142 143 <p><a name="index-string-literals-13"></a><a name="index-string-constants-14"></a><a name="index-character-constants-15"></a><a name="index-header-file-names-16"></a><!-- the @: prevents makeinfo from turning '' into ". --> 144<dfn>String literals</dfn> are string constants, character constants, and 145header file names (the argument of ‘<samp><span class="samp">#include</span></samp>’).<a rel="footnote" href="#fn-1" name="fnd-1"><sup>1</sup></a> String constants and character 146constants are straightforward: <tt>"<small class="dots">...</small>"</tt> or <tt>'<small class="dots">...</small>'</tt>. In 147either case embedded quotes should be escaped with a backslash: 148<tt>'\''</tt> is the character constant for ‘<samp><span class="samp">'</span></samp>’. There is no limit on 149the length of a character constant, but the value of a character 150constant that contains more than one character is 151implementation-defined. See <a href="Implementation-Details.html#Implementation-Details">Implementation Details</a>. 152 153 <p>Header file names either look like string constants, <tt>"<small class="dots">...</small>"</tt>, or are 154written with angle brackets instead, <tt><<small class="dots">...</small>></tt>. In either case, 155backslash is an ordinary character. There is no way to escape the 156closing quote or angle bracket. The preprocessor looks for the header 157file in different places depending on which form you use. See <a href="Include-Operation.html#Include-Operation">Include Operation</a>. 158 159 <p>No string literal may extend past the end of a line. Older versions 160of GCC accepted multi-line string constants. You may use continued 161lines instead, or string constant concatenation. See <a href="Differences-from-previous-versions.html#Differences-from-previous-versions">Differences from previous versions</a>. 162 163 <p><a name="index-punctuators-17"></a><a name="index-digraphs-18"></a><a name="index-alternative-tokens-19"></a><dfn>Punctuators</dfn> are all the usual bits of punctuation which are 164meaningful to C and C++. All but three of the punctuation characters in 165ASCII are C punctuators. The exceptions are ‘<samp><span class="samp">@</span></samp>’, ‘<samp><span class="samp">$</span></samp>’, and 166‘<samp><span class="samp">`</span></samp>’. In addition, all the two- and three-character operators are 167punctuators. There are also six <dfn>digraphs</dfn>, which the C++ standard 168calls <dfn>alternative tokens</dfn>, which are merely alternate ways to spell 169other punctuators. This is a second attempt to work around missing 170punctuation in obsolete systems. It has no negative side effects, 171unlike trigraphs, but does not cover as much ground. The digraphs and 172their corresponding normal punctuators are: 173 174<pre class="smallexample"> Digraph: <% %> <: :> %: %:%: 175 Punctuator: { } [ ] # ## 176</pre> 177 <p><a name="index-other-tokens-20"></a>Any other single character is considered “other”. It is passed on to 178the preprocessor's output unmolested. The C compiler will almost 179certainly reject source code containing “other” tokens. In ASCII, the 180only other characters are ‘<samp><span class="samp">@</span></samp>’, ‘<samp><span class="samp">$</span></samp>’, ‘<samp><span class="samp">`</span></samp>’, and control 181characters other than NUL (all bits zero). (Note that ‘<samp><span class="samp">$</span></samp>’ is 182normally considered a letter.) All characters with the high bit set 183(numeric range 0x7F–0xFF) are also “other” in the present 184implementation. This will change when proper support for international 185character sets is added to GCC. 186 187 <p>NUL is a special case because of the high probability that its 188appearance is accidental, and because it may be invisible to the user 189(many terminals do not display NUL at all). Within comments, NULs are 190silently ignored, just as any other character would be. In running 191text, NUL is considered white space. For example, these two directives 192have the same meaning. 193 194<pre class="smallexample"> #define X^@1 195 #define X 1 196</pre> 197 <p class="noindent">(where ‘<samp><span class="samp">^@</span></samp>’ is ASCII NUL). Within string or character constants, 198NULs are preserved. In the latter two cases the preprocessor emits a 199warning message. 200 201 <div class="footnote"> 202<hr> 203<h4>Footnotes</h4><p class="footnote"><small>[<a name="fn-1" href="#fnd-1">1</a>]</small> The C 204standard uses the term <dfn>string literal</dfn> to refer only to what we are 205calling <dfn>string constants</dfn>.</p> 206 207 <hr></div> 208 209 </body></html> 210 211