Compiler Design
The CPython compiler transforms Python source code into executable bytecode through a multi-stage pipeline.Compilation Pipeline
CPython’s compilation from source code to bytecode involves five main steps:- Tokenize the source code (Parser/lexer/ and Parser/tokenizer/)
- Parse the token stream into an Abstract Syntax Tree (Parser/parser.c)
- Transform AST into an instruction sequence (Python/compile.c)
- Construct a Control Flow Graph and apply optimizations (Python/flowgraph.c)
- Emit bytecode based on the CFG (Python/assemble.c)
Parsing
As of Python 3.9, Python uses a PEG (Parsing Expression Grammar) parser. Unlike traditional parsers, the PEG parser operates on a stream of tokens rather than individual characters.Grammar Definition
The grammar file for Python is located at:- Grammar/python.gram - PEG grammar specification
- Grammar/Tokens - Literal token definitions
- Parser/parser.c - Generated C parser code
See the Parser Guide for a detailed description of how the PEG parser works and how to modify the grammar.
Abstract Syntax Trees (AST)
The AST is a high-level representation of the program structure without the source code details.ASDL Definition
AST nodes are specified using the Zephyr Abstract Syntax Definition Language (ASDL) in Parser/Python.asdl.Example ASDL Fragment
- Three statement types:
FunctionDef,Return,Yield - Modifiers:
?(optional),*(zero or more) - Common attributes like
linenofor all statements
Generated C Structures
The ASDL generates C structures like:AST Generation
AST is generated from source code using:_PyParser_ASTFromString()- For string input_PyParser_ASTFromFile()- For file input
Memory Management
The compiler uses an arena memory allocator for efficient memory management.Arena API
PyArena_New()- Create a new arenaPyArena_Free()- Free all arena memory at oncePyArena_AddPyObject()- Register PyObject for cleanup
Arena allocation eliminates the need for explicit deallocation during compilation. A single
PyArena_Free() call releases all compiler memory.Control Flow Graphs
A Control Flow Graph (CFG) models program flow as a directed graph of basic blocks.Basic Blocks
A basic block is a sequence of bytecode instructions that:- Always execute sequentially (no branches within)
- Have a single entry point (at the beginning)
- Have a single exit point (at the end)
Example: If Statement CFG
- Condition block:
x < 10→ conditional jump - If block:
f1(),f2()→ jump to end block - Else block:
g()→ jump to end block - End block:
end()
AST to Bytecode
The transformation from AST to bytecode is initiated by_PyAST_Compile() in Python/compile.c.
Compilation Steps
-
Build Symbol Table
_PySymtable_Build()in Python/symtable.c- Determines variable scopes (local, global, nonlocal, etc.)
- Tracks code blocks and nested scopes
-
Generate Instruction Sequence
compiler_codegen()transforms AST into pseudo-instructions- Uses
compiler_visit_{xx}functions for each node type - Emits instructions using macros like
ADDOP,ADDOP_I, etc.
-
Construct CFG
_PyCfg_FromInstructionSequence()builds the control flow graph- Identifies basic blocks and their connections
-
Optimize CFG
_PyCfg_OptimizeCodeUnit()applies peephole optimizations- Examples: constant folding, dead code elimination
-
Assemble Bytecode
_PyAssemble_MakeCodeObject()generates final bytecode- Converts pseudo-instructions to real bytecode
- Constructs exception and location tables
- Creates
PyCodeObject
Instruction Emission Macros
Defined in Python/compile.c:ADDOP(c, loc, op)- Add simple opcodeADDOP_I(c, loc, op, arg)- Add opcode with integer argumentADDOP_O(c, loc, op, obj, type)- Add opcode with object argumentADDOP_JUMP(c, loc, op, block)- Add jump instructionADDOP_LOAD_CONST(c, loc, obj)- Add LOAD_CONST instruction
The
location parameter contains source location info for error reporting and debugging. Use LOC macro to extract from AST nodes or NO_LOCATION for synthetic instructions.Important Files
Parser
- Parser/Python.asdl - AST node definitions
- Parser/asdl.py - ASDL parser
- Parser/asdl_c.py - C code generator from ASDL
- Parser/parser.c - Generated PEG parser
- Parser/peg_api.c - High-level parser API
Compiler
- Python/compile.c - Main compiler (AST → instruction sequence)
- Python/symtable.c - Symbol table builder
- Python/flowgraph.c - CFG construction and optimization
- Python/assemble.c - Final bytecode assembly
- Python/codegen.c - Code generation utilities
AST
- Python/Python-ast.c - Generated AST C structures
- Python/asdl.c - ASDL sequence types
- Python/ast.c - AST validation
- Python/ast_preprocess.c - AST preprocessing
- Python/ast_unparse.c - Convert AST back to string
Headers
- Include/cpython/code.h - PyCodeObject definition
- Include/internal/pycore_ast.h - Generated AST definitions
- Include/internal/pycore_symtable.h - Symbol table structures
- Include/internal/pycore_compile.h - Compiler structures
Code Object Structure
The result of compilation is aPyCodeObject, defined in Include/cpython/code.h.
Key fields:
co_code_adaptive- Bytecode array (with inline caches)co_consts- Tuple of constantsco_names- Tuple of names usedco_varnames- Tuple of local variable namesco_exceptiontable- Exception handling tableco_linetable- Source location table
Related Topics
- Parser Guide - Detailed parser documentation
- Bytecode Interpreter - How bytecode executes
- Code Objects - Code object structure
- Exception Handling - Exception table format
