Overview
The lexer module provides fast tokenization for assembly language source code using thelogos crate. It converts assembly source text into a stream of tokens for parsing.
Token Types
The lexer recognizes the following token categories:Control Flow Instructions
Halts program execution
No operation (does nothing)
Unconditional jump instruction
Conditional jump instruction
Call subroutine
Return from subroutine
Revert execution
Arithmetic Instructions
Addition operation
Subtraction operation
Multiplication operation
Division operation
Modulo operation
Add immediate value
Bitwise Instructions
Bitwise AND
Bitwise OR
Bitwise XOR
Bitwise NOT
Shift left
Shift right
Comparison Instructions
Equal comparison
Not equal comparison
Less than comparison
Greater than comparison
Less than or equal comparison
Greater than or equal comparison
Check if value is zero
Memory Instructions
Load 8-bit value from memory
Load 64-bit value from memory
Store 8-bit value to memory
Store 64-bit value to memory
Get memory size
Copy memory region
Storage Instructions
Load from persistent storage
Store to persistent storage
Immediate Instructions
Load immediate value into register
Move value between registers
Context Instructions
Get caller address
Get call value
Get current contract address
Get current block number
Get block timestamp
Get remaining gas
Debug Instructions
Log value for debugging
Operands and Symbols
Register reference (R0-R15)
Decimal number literal
Hexadecimal number literal (0x prefix)
Label or constant name
Assembler directive (starts with .)
Comma separator
Colon for label definitions
Lexer API
Lexer::new
Assembly source code to tokenize
A new lexer instance ready to tokenize the source
Lexer::span
Byte range of the current token
Lexer::slice
String content of the current token
Iterator Implementation
TheLexer implements Iterator with items of type (Token, usize), where the tuple contains the token and its line number.
Returns the next token and its line number, or
None at end of inputFeatures
- Case Insensitive: All instruction mnemonics are case-insensitive (ADD, add, Add all work)
- Line Tracking: Each token is tagged with its line number for error reporting
- Comment Support: Line comments starting with
;are automatically skipped - Whitespace Handling: Spaces, tabs, and newlines are automatically skipped
- Error Recovery: Invalid characters are converted to error tokens
Usage Examples
Basic Tokenization
With Line Numbers
Handling Labels and Comments
Hexadecimal Numbers
Register Range
Directives
Token Patterns
Case-insensitive keywords (HALT, NOP, ADD, etc.)
[Rr][0-9] or [Rr]1[0-5] (R0-R15)[0-9]+0x[0-9a-fA-F]+[a-zA-Z_][a-zA-Z0-9_]*\.[a-z]+;[^\n]* (skipped automatically)