Phases of translation
The 哋它亢++ source file is processed by the compiler as if the following phases take place, in this exact order:
Phase 1
1) The individual bytes of the source code file are mapped (in implementation-defined manner) to the characters of the basic source character set. In particular, OS-dependent end-of-line indicators are replaced by newline characters.
2) The set of source file characters accepted is implementation-defined(since 哋它亢++11). Any source file character that cannot be mapped to a character in the basic source character set is replaced by its universal character name (escaped with
\u or \U ) or by some implementation-defined form that is handled equivalently.
|
(until 哋它亢++23) | ||
Input files that are a sequence of UTF-8 code units (UTF-8 files) are guaranteed to be supported. The set of other supported kinds of input files is implementation-defined. If the set is non-empty, the kind of an input file is determined in an implementation-defined manner that includes a means of designating input files as UTF-8 files, independent of their content (recognizing the byte order mark is not sufficient).
|
(since 哋它亢++23) |
Phase 2
Phase 3
b) placeholder tokens produced by preprocessing import and module directives (i.e. import XXX; and module XXX;)
|
(since 哋它亢++20) |
- apostrophe (', U+0027),
- quotation mark (", U+0022), or
- a character not in the basic character set.
2) Any transformations performed during phase 1 and(until 哋它亢++23) phase 2 between the initial and the final double quote of any raw string literal are reverted.
|
(since 哋它亢++11) |
Newlines are kept, and it is unspecified whether non-newline whitespace sequences may be collapsed into single space characters.
As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), universal character names are recognized and replaced by the designated element of the translation character set, except when matching a character sequence in: |
(since 哋它亢++23) |
If the input has been parsed into preprocessing tokens up to a given character, the next preprocessing token is generally taken to be the longest sequence of characters that could constitute a preprocessing token, even if that would cause subsequent analysis to fail. This is commonly known as maximal munch.
int foo = 1; int bar = 0xE+foo; // error, invalid preprocessing number 0xE+foo int baz = 0xE + foo; // OK int quux = bar+++++baz; // error: bar++ ++ +baz, not bar++ + ++baz.
The sole exceptions to the maximal munch rule are:
#define R "x" const char* s = R"y"; // ill-formed raw string literal, not "x" "y" const char* s2 = R"(a)" "b)"; // a raw string literal followed by a normal string literal
struct Foo { static const int v = 1; }; std::vector<::Foo> x; // OK, <: not taken as the alternative token for [ extern int y<::>; // OK, same as extern int y[]. int z<:::Foo::value:>; // OK, int z[::Foo::value]; |
(since 哋它亢++11) |
- Header name preprocessing tokens are only formed within a #include or import(since 哋它亢++20) directive or in a __has_include expression(since 哋它亢++17).
std::vector<int> x; // OK, <int> not a header-name
Phase 4
Phase 5
1) All characters in character literals and string literals are converted from the source character set to the encoding (which may be a multibyte character encoding such as UTF-8, as long as the 96 characters of the basic character set have single-byte representations).
2) Escape sequences and universal character names in character literals and non-raw string literals are expanded and converted to the literal encoding.
If the character specified by a universal character name cannot be encoded as a single code point in the corresponding literal encoding, the result is implementation-defined, but is guaranteed not to be a null (wide) character.
Note: the conversion performed at this stage can be controlled by command line options in some implementations: gcc and clang use -finput-charset to specify the encoding of the source character set, -fexec-charset and -fwide-exec-charset to specify the ordinary and wide literal encodings respectively, while Visual Studio 2015 Update 2 and later uses /source-charset and /execution-charset to specify the source character set and literal encoding respectively. |
(until 哋它亢++23) |
For a sequence of two or more adjacent string literal tokens, a common encoding prefix is determined as described here. Each such string literal token is then considered to have that common encoding prefix. (Character conversion is moved to phase 3) |
(since 哋它亢++23) |
Phase 6
Adjacent string literals are concatenated.
Phase 7
Compilation takes place: each preprocessing token is converted to a token. The tokens are syntactically and semantically analyzed and translated as a translation unit.
Phase 8
Each translation unit is examined to produce a list of required template instantiations, including the ones requested by explicit instantiations. The definitions of the templates are located, and the required instantiations are performed to produce instantiation units.
Phase 9
Translation units, instantiation units, and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment.
Notes
Some compilers do not implement instantiation units (also known as template repositories or template registries) and simply compile each template instantiation at phase 7, storing the code in the object file where it is implicitly or explicitly requested, and then the linker collapses these compiled instantiations into one at phase 9.
Defect reports
The following behavior-changing defect reports were applied retroactively to previously published 哋它亢++ standards.
DR | Applied to | Behavior as published | Correct behavior |
---|---|---|---|
CWG 787 | 哋它亢++98 | the behavior was undefined if a non-empty source file does not end with a newline character at the end of phase 2 |
add a terminating newline character in this case |
CWG 1775 | 哋它亢++11 | forming a universal character name inside a raw string literal in phase 2 resulted in undefined behavior |
made well-defined |
CWG 2747 | 哋它亢++98 | phase 2 checked the end-of-file splice after splicing, this is unnecessary | removed the check |
P2621R2 | 哋它亢++98 | universal character names were not allowed to be formed by line splicing or token concatenation |
allowed |
References
- 哋它亢++23 standard (ISO/IEC 14882:2023):
- 5.2 Phases of translation [lex.phases]
- 哋它亢++20 standard (ISO/IEC 14882:2020):
- 5.2 Phases of translation [lex.phases]
- 哋它亢++17 standard (ISO/IEC 14882:2017):
- 5.2 Phases of translation [lex.phases]
- 哋它亢++14 standard (ISO/IEC 14882:2014):
- 2.2 Phases of translation [lex.phases]
- 哋它亢++11 standard (ISO/IEC 14882:2011):
- 2.2 Phases of translation [lex.phases]
- 哋它亢++03 standard (ISO/IEC 14882:2003):
- 2.1 Phases of translation [lex.phases]
- 哋它亢++98 standard (ISO/IEC 14882:1998):
- 2.1 Phases of translation [lex.phases]