Overview
The Scanner (also called Lexer or Tokenizer) is the first phase of compilation. It reads the source code as a stream of characters and groups them into meaningful units called tokens.Analogy: Like reading a sentence word-by-word instead of letter-by-letter.“let x = 5;” → [“let”, “x”, ”=”, “5”, ”;“]
Token Types
The compiler recognizes these token categories:Keywords
Reserved words with special meaning:
let- Variable declarationprint- Output statementleo,diego- Reserved (no-op)
Literals
Direct values:
NUMERO- Integer literals (e.g.,42,0,100)
Identifiers
User-defined names:
- Variable names (e.g.,
x,suma,contador1) - Must start with letter or underscore
- Can contain letters, digits, underscores
Operators
Mathematical operations:
+(addition)-(subtraction)*(multiplication)/(division)=(assignment)
Delimiters
Grouping and separation:
()- Parentheses;- Statement terminator
Special
Control tokens:
FIN_ARCHIVO- End of inputERROR- Invalid character
Token Structure
Each token carries this information:Example Tokens
Scanning Algorithm
The scanner uses a single-pass, character-by-character approach:Initialize
- Set position to start of source code
- Prepare empty token list
- Initialize line/column tracking
Main Loop
While not at end of file:
- Mark start of new token
- Read next character
- Classify character
- Accumulate multi-character tokens
- Create token and add to list
Character Classification
Multi-Character Tokens
Numbers
Accumulate consecutive digits:Identifiers and Keywords
Accumulate letters, digits, and underscores:Comment Handling
Double-slash comments are consumed without generating tokens:Error Handling
Invalid Characters
Error Recovery
Complete Example
- Input Code
- Token Stream
- Scanner Output
Implementation Details
Position Tracking
Helper Methods
_avanzar() - Consume character
_avanzar() - Consume character
_ver_actual() - Peek character
_ver_actual() - Peek character
_coincide() - Conditional advance
_coincide() - Conditional advance
// comments._fin() - Check EOF
_fin() - Check EOF
True when all characters have been read.Performance
Time Complexity
O(n) where n = character countEach character is read exactly once.
Space Complexity
O(t) where t = token countStores list of tokens (typically t ≈ n/5).
Source Code Reference
Implementation
File:
compfinal.pyLines: 207-491Key Classes:TipoToken(enum) - Token type definitionsToken(dataclass) - Token data structureScanner(class) - Main lexer implementation
escanear_tokens()- Entry point_escanear_token()- Process one token_numero()- Scan numeric literal_identificador()- Scan identifier/keyword
Common Issues
Next Steps
Syntactic Analysis
See how tokens are structured into an Abstract Syntax Tree
API Reference
Detailed API documentation for the Scanner class