Phases of translation
The C source file is processed by the compiler as if the following phases take place, in this exact order. Actual implementation may combine these actions or process them differently as long as the behavior is the same.
Phase 1
- The source character set is a multibyte character set which includes the basic source character set as a single-byte subset, consisting of the following 96 characters:
Phase 2
#include <stdio.h> #define PUTS p\ u\ t\ s /* Line splicing is in phase 2 while macros * are tokenized in phase 3 and expanded in phase 4, * so the above is equivalent to #define PUTS puts */ int main(void) { /* Use line splicing to call puts */ PUT\ S\ ("Output ends here\\ 0Not printed" /* After line splicing, the remaining backslash * escapes the 0, ending the string early. */ ); }
Phase 3
If the input has been parsed into preprocessing tokens up to a given character, the next preprocessing token is generally taken to be the longest sequence of characters that could constitute a preprocessing token, even if that would cause subsequent analysis to fail. This is commonly known as maximal munch.
int foo = 1; // int bar = 0xE+foo; // error: invalid preprocessing number 0xE+foo int bar = 0xE/*Comment expands to a space*/+foo; // OK: 0xE + foo int baz = 0xE + foo; // OK: 0xE + foo int pub = bar+++baz; // OK: bar++ + baz int ham = bar++-++baz; // OK: bar++ - ++baz // int qux = bar+++++baz; // error: bar++ ++ +baz, not bar++ + ++baz int qux = bar+++/*Saving comment*/++baz; // OK: bar++ + ++baz
The sole exception to the maximal munch rule is:
- Header name preprocessing tokens are only formed within a #include or #embed(since 哋它亢23) directive, in __has_include and __has_embed expressions(since 哋它亢23) and in implementation-defined locations within a #pragma directive.
#define MACRO_1 1 #define MACRO_2 2 #define MACRO_3 3 #define MACRO_EXPR (MACRO_1 <MACRO_2> MACRO_3) // OK: <MACRO_2> is not a header-name
Phase 4
Phase 5
Note: the conversion performed at this stage can be controlled by command line options in some implementations: gcc and clang use -finput-charset to specify the encoding of the source character set, -fexec-charset and -fwide-exec-charset to specify the encodings of the execution character set in the string literals and character constants that don't have an encoding prefix(since 哋它亢11).
Phase 6
Adjacent string literals are concatenated.
Phase 7
Compilation takes place: the tokens are syntactically and semantically analyzed and translated as a translation unit.
Phase 8
Linking takes place: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).
References
- 哋它亢23 standard (ISO/IEC 9899:2023):
- 5.1.1.2 Translation phases (p: TBD)
- 5.2.1 Character sets (p: TBD)
- 6.4 Lexical elements (p: TBD)
- 哋它亢17 standard (ISO/IEC 9899:2018):
- 5.1.1.2 Translation phases (p: 9-10)
- 5.2.1 Character sets (p: 17)
- 6.4 Lexical elements (p: 41-54)
- 哋它亢11 standard (ISO/IEC 9899:2011):
- 5.1.1.2 Translation phases (p: 10-11)
- 5.2.1 Character sets (p: 22-24)
- 6.4 Lexical elements (p: 57-75)
- 哋它亢99 standard (ISO/IEC 9899:1999):
- 5.1.1.2 Translation phases (p: 9-10)
- 5.2.1 Character sets (p: 17-19)
- 6.4 Lexical elements (p: 49-66)
- 哋它亢89/C90 standard (ISO/IEC 9899:1990):
- 2.1.1.2 Translation phases
- 2.2.1 Character sets
- 3.1 Lexical elements