Skip to main content

Lexer & Parser Architecture

The Lexer and Parser form the front-end of the AXON compiler, transforming raw source text into a structured Abstract Syntax Tree (AST).
Source (.axon text) → [Lexer] → Token Stream → [Parser] → Cognitive AST

Lexer (axon/compiler/lexer.py)

Overview

The Lexer is a hand-written, single-pass scanner that converts AXON source code into a sequence of tokens. Key Features:
  • Keyword vs identifier discrimination via lookup table
  • String literals with escape sequences (\n, \t, \")
  • Numeric literals: integers, floats, durations (10s, 5m, 2h)
  • Multi-character operators: -> (arrow), .. (range)
  • Comment stripping: // (line) and /* */ (block)
  • Line/column tracking for precise error reporting

Implementation

class Lexer:
    """Tokenizes AXON source code into a stream of Token objects."""

    def __init__(self, source: str, filename: str = "<stdin>"):
        self._source = source
        self._filename = filename
        self._pos = 0
        self._line = 1
        self._column = 1
        self._tokens: list[Token] = []

    def tokenize(self) -> list[Token]:
        """Scan the entire source and return all tokens."""
        while not self._at_end():
            self._skip_whitespace()
            if self._at_end():
                break
            self._scan_token()
        self._tokens.append(Token(TokenType.EOF, "", self._line, self._column))
        return self._tokens

Token Types

Keywords (reserved identifiers):
KEYWORDS = {
    "persona": TokenType.PERSONA,
    "flow": TokenType.FLOW,
    "step": TokenType.STEP,
    "anchor": TokenType.ANCHOR,
    "reason": TokenType.REASON,
    "probe": TokenType.PROBE,
    "weave": TokenType.WEAVE,
    "validate": TokenType.VALIDATE,
    "run": TokenType.RUN,
    # ...
}
Operators & Delimiters:
  • {, }, (, ), [, ] — Structural delimiters
  • :, ,, ., ? — Punctuation
  • -> — Arrow (return type, action)
  • .. — Range operator
  • ==, !=, <, >, <=, >= — Comparisons
Literals:
  • STRING: "hello world" with escape support
  • INTEGER: 42, -7
  • FLOAT: 3.14, -0.5
  • DURATION: 10s, 5m, 2h, 1d
  • BOOL: true, false

Character-Level Scanning

def _scan_token(self) -> None:
    line = self._line
    col = self._column
    ch = self._advance()

    match ch:
        case "{":
            self._emit(TokenType.LBRACE, "{", line, col)
        case "}":
            self._emit(TokenType.RBRACE, "}", line, col)
        # ...
        case ".":
            if self._match("."):
                self._emit(TokenType.DOTDOT, "..", line, col)
            else:
                self._emit(TokenType.DOT, ".", line, col)
        case "-":
            if self._match(">"):
                self._emit(TokenType.ARROW, "->", line, col)
            # ...
        case '"':
            self._scan_string(line, col)
        case _:
            if ch.isdigit():
                self._scan_number(line, col, first_char=ch)
            elif ch.isalpha() or ch == "_":
                self._scan_identifier(line, col, first_char=ch)

Duration Literal Handling

def _scan_number(self, start_line: int, start_col: int, ...) -> None:
    # ... scan integer/float ...
    
    # Check for duration suffix
    if not self._at_end() and self._peek().isalpha():
        suffix = ""
        while not self._at_end() and self._peek().isalpha():
            suffix += self._advance()
        
        if suffix in ("s", "ms", "m", "h", "d"):
            self._emit(TokenType.DURATION, raw + suffix, start_line, start_col)
            return
Supported units: s (seconds), ms (milliseconds), m (minutes), h (hours), d (days)

Parser (axon/compiler/parser.py)

Overview

The Parser uses recursive descent to transform the token stream into a cognitive Abstract Syntax Tree. Design Principle: Zero mechanical nodes. The AST contains only cognitive concepts:
  • PersonaDefinition (not ClassDecl)
  • IntentNode (not FunctionCall)
  • ReasonChain (not ForLoop)
  • AnchorConstraint (not AssertStatement)
  • ProbeDirective (not SelectQuery)
  • WeaveNode (not JoinExpression)

Implementation

class Parser:
    """Recursive descent parser for the AXON language."""

    def __init__(self, tokens: list[Token]):
        self._tokens = tokens
        self._pos = 0

    def parse(self) -> ProgramNode:
        """Parse the full program → ProgramNode."""
        program = ProgramNode(line=1, column=1)
        while not self._check(TokenType.EOF):
            decl = self._parse_declaration()
            if decl is not None:
                program.declarations.append(decl)
        return program

Grammar Structure

The parser follows the AXON grammar hierarchy:
Program        → Declaration*
Declaration    → Persona | Context | Anchor | Memory | Tool | Type | Flow | Intent | Run
Flow           → "flow" Identifier "(" Parameters? ")" ["->", ReturnType] "{" Step* "}"
Step           → "step" Identifier "{" StepBody "}"
StepBody       → ("given" | "ask" | "probe" | "reason" | "weave" | "use" | ...)*

Top-Level Declaration Dispatch

def _parse_declaration(self) -> ASTNode | None:
    tok = self._current()

    match tok.type:
        case TokenType.IMPORT:
            return self._parse_import()
        case TokenType.PERSONA:
            return self._parse_persona()
        case TokenType.CONTEXT:
            return self._parse_context()
        case TokenType.ANCHOR:
            return self._parse_anchor()
        case TokenType.FLOW:
            return self._parse_flow()
        case TokenType.RUN:
            return self._parse_run()
        case _:
            raise AxonParseError(
                f"Unexpected token at top level",
                expected="declaration",
                found=tok.value
            )

Parsing Example: Flow Definition

def _parse_flow(self) -> FlowDefinition:
    tok = self._consume(TokenType.FLOW)
    name = self._consume(TokenType.IDENTIFIER)
    node = FlowDefinition(name=name.value, line=tok.line, column=tok.column)

    # Parameters: (param: Type, ...)
    self._consume(TokenType.LPAREN)
    if not self._check(TokenType.RPAREN):
        node.parameters = self._parse_param_list()
    self._consume(TokenType.RPAREN)

    # Optional return type: -> ReturnType
    if self._check(TokenType.ARROW):
        self._advance()
        node.return_type = self._parse_type_expr()

    # Body
    self._consume(TokenType.LBRACE)
    while not self._check(TokenType.RBRACE):
        step = self._parse_flow_step()
        if step is not None:
            node.body.append(step)
    self._consume(TokenType.RBRACE)

    return node

Parsing Cognitive Steps

def _parse_step(self) -> StepNode:
    tok = self._consume(TokenType.STEP)
    name = self._consume(TokenType.IDENTIFIER)
    node = StepNode(name=name.value, line=tok.line, column=tok.column)
    self._consume(TokenType.LBRACE)

    while not self._check(TokenType.RBRACE):
        inner = self._current()

        match inner.type:
            case TokenType.GIVEN:
                self._advance()
                self._consume(TokenType.COLON)
                node.given = self._parse_expression_string()
            case TokenType.ASK:
                self._advance()
                self._consume(TokenType.COLON)
                node.ask = self._consume(TokenType.STRING).value
            case TokenType.PROBE:
                node.probe = self._parse_probe()
            case TokenType.REASON:
                node.reason = self._parse_reason()
            # ...

    self._consume(TokenType.RBRACE)
    return node

AST Node Hierarchy

Base Node

@dataclass
class ASTNode:
    """Base class for all AXON AST nodes."""
    line: int = 0
    column: int = 0

Declaration Nodes

PersonaDefinition — Cognitive identity:
@dataclass
class PersonaDefinition(ASTNode):
    name: str = ""
    domain: list[str] = field(default_factory=list)
    tone: str = ""
    confidence_threshold: float | None = None
    cite_sources: bool | None = None
    refuse_if: list[str] = field(default_factory=list)
FlowDefinition — Cognitive pipeline:
@dataclass
class FlowDefinition(ASTNode):
    name: str = ""
    parameters: list[ParameterNode] = field(default_factory=list)
    return_type: TypeExprNode | None = None
    body: list[ASTNode] = field(default_factory=list)  # Steps
AnchorConstraint — Hard constraint:
@dataclass
class AnchorConstraint(ASTNode):
    name: str = ""
    require: str = ""
    reject: list[str] = field(default_factory=list)
    confidence_floor: float | None = None
    on_violation: str = ""  # raise | warn | fallback

Cognitive Step Nodes

ReasonChain — Explicit reasoning:
@dataclass
class ReasonChain(ASTNode):
    name: str = ""
    about: str = ""
    given: str | list[str] = ""
    depth: int = 1
    show_work: bool = False
    chain_of_thought: bool = False
    ask: str = ""
    output_type: str = ""
ProbeDirective — Targeted extraction:
@dataclass
class ProbeDirective(ASTNode):
    target: str = ""  # What to probe
    fields: list[str] = field(default_factory=list)  # What to extract
WeaveNode — Semantic synthesis:
@dataclass
class WeaveNode(ASTNode):
    sources: list[str] = field(default_factory=list)
    target: str = ""
    format_type: str = ""
    priority: list[str] = field(default_factory=list)

Error Handling

Lexer Errors

raise AxonLexerError(
    "Unterminated string",
    line=start_line,
    column=start_col,
)

Parser Errors

raise AxonParseError(
    f"Unexpected token",
    line=tok.line,
    column=tok.column,
    expected="step, probe, reason, validate",
    found=tok.value,
)
Both error types track line and column for precise diagnostics.

Example: Parsing a Persona

Input Source:
persona LegalExpert {
  domain: ["contract law", "IP"]
  tone: precise
  confidence_threshold: 0.85
  cite_sources: true
}
Lexer Output (token stream):
PERSONA("persona", 1:1)
IDENTIFIER("LegalExpert", 1:9)
LBRACE("{", 1:21)
IDENTIFIER("domain", 2:3)
COLON(":", 2:9)
LBRACKET("[", 2:11)
STRING("contract law", 2:12)
COMMA(",", 2:26)
...
Parser Output (AST):
PersonaDefinition(
    name="LegalExpert",
    domain=["contract law", "IP"],
    tone="precise",
    confidence_threshold=0.85,
    cite_sources=True,
    line=1,
    column=1
)

Next Steps

Type Checker

Learn how epistemic types are validated

AST to IR

See how the AST is lowered to model-agnostic IR

Build docs developers (and LLMs) love