Skip to main content

Overview

The lexer module provides fast tokenization for assembly language source code using the logos crate. It converts assembly source text into a stream of tokens for parsing.

Token Types

The lexer recognizes the following token categories:

Control Flow Instructions

HALT
Token
Halts program execution
NOP
Token
No operation (does nothing)
JUMP
Token
Unconditional jump instruction
JUMPI
Token
Conditional jump instruction
CALL
Token
Call subroutine
RET
Token
Return from subroutine
REVERT
Token
Revert execution

Arithmetic Instructions

ADD
Token
Addition operation
SUB
Token
Subtraction operation
MUL
Token
Multiplication operation
DIV
Token
Division operation
MOD
Token
Modulo operation
ADDI
Token
Add immediate value

Bitwise Instructions

AND
Token
Bitwise AND
OR
Token
Bitwise OR
XOR
Token
Bitwise XOR
NOT
Token
Bitwise NOT
SHL
Token
Shift left
SHR
Token
Shift right

Comparison Instructions

EQ
Token
Equal comparison
NE
Token
Not equal comparison
LT
Token
Less than comparison
GT
Token
Greater than comparison
LE
Token
Less than or equal comparison
GE
Token
Greater than or equal comparison
ISZERO
Token
Check if value is zero

Memory Instructions

LOAD8
Token
Load 8-bit value from memory
LOAD64
Token
Load 64-bit value from memory
STORE8
Token
Store 8-bit value to memory
STORE64
Token
Store 64-bit value to memory
MSIZE
Token
Get memory size
MCOPY
Token
Copy memory region

Storage Instructions

SLOAD
Token
Load from persistent storage
SSTORE
Token
Store to persistent storage

Immediate Instructions

LOADI
Token
Load immediate value into register
MOV
Token
Move value between registers

Context Instructions

CALLER
Token
Get caller address
CALLVALUE
Token
Get call value
ADDRESS
Token
Get current contract address
BLOCKNUMBER
Token
Get current block number
TIMESTAMP
Token
Get block timestamp
GAS
Token
Get remaining gas

Debug Instructions

LOG
Token
Log value for debugging

Operands and Symbols

Register(u8)
Token
Register reference (R0-R15)
Number(u64)
Token
Decimal number literal
HexNumber(u64)
Token
Hexadecimal number literal (0x prefix)
Identifier(String)
Token
Label or constant name
Directive(String)
Token
Assembler directive (starts with .)
Comma
Token
Comma separator
Colon
Token
Colon for label definitions

Lexer API

Lexer::new

pub fn new(source: &'source str) -> Self
Creates a new lexer for the given source code.
source
&str
required
Assembly source code to tokenize
Lexer
Lexer<'source>
A new lexer instance ready to tokenize the source

Lexer::span

pub fn span(&self) -> std::ops::Range<usize>
Returns the byte range span of the current token in the source.
Range
Range<usize>
Byte range of the current token

Lexer::slice

pub fn slice(&self) -> &'source str
Returns the string slice of the current token.
slice
&str
String content of the current token

Iterator Implementation

The Lexer implements Iterator with items of type (Token, usize), where the tuple contains the token and its line number.
fn next(&mut self) -> Option<(Token, usize)>
next
Option<(Token, usize)>
Returns the next token and its line number, or None at end of input

Features

  • Case Insensitive: All instruction mnemonics are case-insensitive (ADD, add, Add all work)
  • Line Tracking: Each token is tagged with its line number for error reporting
  • Comment Support: Line comments starting with ; are automatically skipped
  • Whitespace Handling: Spaces, tabs, and newlines are automatically skipped
  • Error Recovery: Invalid characters are converted to error tokens

Usage Examples

Basic Tokenization

use minichain_assembler::lexer::Lexer;

let source = "LOADI R0, 10";
let tokens: Vec<_> = Lexer::new(source)
    .map(|(token, _line)| token)
    .collect();

// tokens: [LoadI, Register(0), Comma, Number(10)]

With Line Numbers

use minichain_assembler::lexer::Lexer;

let source = r#"
LOADI R0, 10
ADD R1, R0, R0
HALT
"#;

for (token, line) in Lexer::new(source) {
    println!("Line {}: {:?}", line, token);
}

Handling Labels and Comments

use minichain_assembler::lexer::Lexer;

let source = r#"
main:           ; Entry point
    LOADI R0, 10
    HALT
"#;

let tokens: Vec<_> = Lexer::new(source)
    .map(|(token, _)| token)
    .collect();

// Comments are automatically stripped
// tokens: [Identifier("main"), Colon, LoadI, Register(0), Comma, Number(10), Halt]

Hexadecimal Numbers

use minichain_assembler::lexer::Lexer;

let source = "LOADI R0, 0xFF";
let tokens: Vec<_> = Lexer::new(source)
    .map(|(token, _)| token)
    .collect();

// tokens: [LoadI, Register(0), Comma, HexNumber(255)]

Register Range

use minichain_assembler::lexer::Lexer;

let source = "R0 R15 R16";
let tokens: Vec<_> = Lexer::new(source)
    .map(|(token, _)| token)
    .collect();

// R0-R15 are valid registers
// R16 is out of range and becomes an Identifier
// tokens: [Register(0), Register(15), Identifier("R16")]

Directives

use minichain_assembler::lexer::Lexer;

let source = ".entry main";
let tokens: Vec<_> = Lexer::new(source)
    .map(|(token, _)| token)
    .collect();

// tokens: [Directive("entry"), Identifier("main")]

Token Patterns

Instructions
Pattern
Case-insensitive keywords (HALT, NOP, ADD, etc.)
Registers
Pattern
[Rr][0-9] or [Rr]1[0-5] (R0-R15)
Decimal Numbers
Pattern
[0-9]+
Hex Numbers
Pattern
0x[0-9a-fA-F]+
Identifiers
Pattern
[a-zA-Z_][a-zA-Z0-9_]*
Directives
Pattern
\.[a-z]+
Comments
Pattern
;[^\n]* (skipped automatically)

Build docs developers (and LLMs) love