Guide to the Parser
CPython uses a PEG (Parsing Expression Grammar) parser introduced in Python 3.9 to replace the original LL(1) parser.PEG Parser Overview
The parser is generated from a grammar file by a parser generator, not hand-written.Key Files
- Grammar/python.gram - Python grammar specification
- Parser/parser.c - Generated C parser code
- Tools/peg_generator/ - Parser generator (Pegen)
To change Python’s syntax, modify Grammar/python.gram, then run
make regen-pegen.How PEG Parsers Work
PEG differs fundamentally from context-free grammars.Ordered Choice
The choice operator (|) is ordered:
- PEG: Try A first. If succeeds, done. Otherwise try B, then C.
- CFG: Deduce which alternative applies based on lookahead.
Critical Difference: In PEG,
A | B is NOT the same as B | A!Example: Order Matters
first_ruleacceptsaabut notaaa'a'matches first char, leaving'aa'for the final'a'(fails)
second_ruleacceptsaaabut notaa'aa'matches first two chars, leaving one for final'a'
Eagerness
PEG parsers are eager: Once an alternative succeeds, no backtracking to try other alternatives even if the overall parse fails.Common Mistake
This grammar is wrong:Grammar Syntax
Basic Rule Format
Grammar Expressions
| Expression | Description | Example |
|---|---|---|
# comment | Python-style comments | # This is a comment |
e1 e2 | Sequence | 'if' expression ':' |
e1 | e2 | Ordered choice | '(' expr ')' | atom |
(e) | Grouping | ('a' 'b')* |
[e] or e? | Optional | [','] or ','? |
e* | Zero or more | statement* |
e+ | One or more | 'a'+ |
s.e+ | Sep-separated | ','.expression+ |
&e | Positive lookahead | &'(' |
!e | Negative lookahead | !')' |
~ | Commit/cut | '(' ~ expr ')' |
Variables
Name sub-expressions for use in actions:Actions
Specify return value in curly braces:Critical: Actions must NOT mutate AST nodes passed via variables. Nodes are cached by memoization and might be reused. Create new copies if modification needed.
Left Recursion
PEG parsers typically don’t support left recursion, but CPython’s does!Examples
Simple:Memoization
Memoization caches parse results to avoid exponential time.Selective Memoization
CPython disables memoization by default except for rules marked(memo):
Left-recursive rules always use memoization (required for the algorithm).
Why Selective?
Full memoization:- Uses significant memory
- Has execution overhead (cache lookups)
- Often slower than re-parsing for simple rules
Keywords
Hard Keywords
Always reserved (single quotes in grammar):Soft Keywords
Contextual (double quotes in grammar):Listing Keywords
Error Handling
Generic Errors
PEG’s error heuristic: Report error at furthest token that failed to match.This heuristic works well for Python’s grammar but isn’t perfect. Lookaheads can affect error location.
Custom Errors
Useinvalid_ rules for better error messages:
Two-Phase Parsing
Phase 1: Parse withoutinvalid_ rules
- If succeeds: Return AST, skip phase 2
invalid_ rules (only if phase 1 failed)
- Try to match specific error patterns
- Raise custom
SyntaxErrorwith helpful message
All
invalid_ rules must start with invalid_ prefix to avoid impacting phase 1 performance.Testing Invalid Rules
Add syntax error after valid code:invalid_ rule works, error reports at $. If not, it reports elsewhere.
Automatic Variables
Pegen injects variables into action scope:p- Parser structureEXTRA- Expands to(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)
Regenerating the Parser
After Grammar Changes
Regenerating Meta-Parser
If you modify Tools/peg_generator/pegen/metagrammar.gram:Tokenization
CPython’s tokenizer is separate from the parser (unlike many PEG implementations).Token List
Defined in Grammar/Tokens. After modifying:Why Separate Tokenizer?
Python needs custom tokenization for:- Indentation handling (INDENT/DEDENT tokens)
- Soft keywords (
async,awaitcompatibility) - Interactive mode
- Encoding detection
- Better error messages
Debugging
Python Parser (Experimentation)
Generate Python parser for testing:Verbose Mode
Compile Python in debug mode:>- Trying to parse rule+- Rule parsed successfully-- Rule failed!- Exception/error detected
Example Grammar
Simple arithmetic parser:- Parses expressions with
+,-,*,/ - Handles parentheses
- Respects operator precedence (multiplication before addition)
- Uses left recursion for left associativity
Testing
Grammar tests in: Parser generator tests:Related Topics
- Compiler Design - What happens after parsing
- Source Code Structure - Where parser files live
- PEP 617 - New PEG parser specification
