Tokenization

Tokenization is the first step in parsing Lua code with Full Moon. The tokenizer (also called a lexer) converts a stream of characters into a sequence of meaningful tokens while preserving all formatting information.

What is a Token?

From src/tokenizer/structs.rs:427-467, a token represents a single meaningful unit of code:

/// A token consisting of its [`Position`] and a [`TokenType`]
pub struct Token {
    pub(crate) start_position: Position,
    pub(crate) end_position: Position,
    pub(crate) token_type: TokenType,
}

impl Token {
    /// The position a token begins at
    pub fn start_position(&self) -> Position
    
    /// The position a token ends at
    pub fn end_position(&self) -> Position
    
    /// The type of token as well as the data needed to represent it
    pub fn token_type(&self) -> &TokenType
    
    /// The kind of token with no additional data
    pub fn token_kind(&self) -> TokenKind
}

Token Types

From src/tokenizer/structs.rs:237-323, Full Moon recognizes these token types:

TokenType::Identifier {
    identifier: ShortString,
}
// Examples: foo, myVariable, _internal

Symbols

From src/tokenizer/structs.rs:94-179, symbols represent keywords and operators:

/// A literal symbol, used for both words important to syntax 
/// (like while) and operators (like +)
pub enum Symbol {
    // Keywords
    And, Break, Do, Else, ElseIf, End, False, For,
    Function, If, In, Local, Nil, Not, Or, Repeat,
    Return, Then, True, Until, While,
    
    // Operators
    Plus,           // +
    Minus,          // -
    Star,           // *
    Slash,          // /
    Percent,        // %
    Caret,          // ^
    Hash,           // #
    TwoEqual,       // ==
    TildeEqual,     // ~=
    LessThanEqual,  // <=
    GreaterThanEqual, // >=
    LessThan,       // <
    GreaterThan,    // >
    
    // Delimiters
    LeftParen,      // (
    RightParen,     // )
    LeftBrace,      // {
    RightBrace,     // }
    LeftBracket,    // [
    RightBracket,   // ]
    
    // Punctuation
    Semicolon,      // ;
    Colon,          // :
    Comma,          // ,
    Dot,            // .
    TwoDots,        // ..
    Ellipsis,       // ...
    Equal,          // =
}

Symbols are feature-gated. For example, Symbol::Goto is only available with the lua52 or luajit feature flags.

TokenReference: The Key to Losslessness

From src/tokenizer/structs.rs:586-604, TokenReference is what makes Full Moon lossless:

/// A reference to a token used by ASTs.
/// Dereferences to a [`Token`]
pub struct TokenReference {
    pub(crate) leading_trivia: Vec<Token>,
    pub(crate) token: Token,
    pub(crate) trailing_trivia: Vec<Token>,
}

impl TokenReference {
    /// Returns the inner token
    pub fn token(&self) -> &Token
    
    /// Returns the leading trivia
    pub fn leading_trivia(&self) -> impl Iterator<Item = &Token>
    
    /// Returns the trailing trivia
    pub fn trailing_trivia(&self) -> impl Iterator<Item = &Token>
}

What is Trivia?

Trivia refers to tokens that don’t affect program semantics:

Whitespace (spaces, tabs, newlines)
Comments (single-line and multi-line)
Shebang lines (#!/usr/bin/env lua)

From src/tokenizer/structs.rs:324-346:

impl TokenType {
    /// Returns whether a token can be practically ignored in most cases
    /// Comments and whitespace will return `true`
    pub fn is_trivia(&self) -> bool {
        matches!(
            self,
            TokenType::Shebang { .. }
                | TokenType::SingleLineComment { .. }
                | TokenType::MultiLineComment { .. }
                | TokenType::Whitespace { .. }
        )
    }
}

Trivia Attachment

Trivia is attached to the meaningful token that follows it:

-- This is a comment
local x = 1  -- inline comment

Tokenizes as:

Token: local with leading trivia "-- This is a comment\n"
Token: x
Token: =
Token: 1 with trailing trivia " -- inline comment"

Position Tracking

From src/tokenizer/structs.rs:852-887, every token tracks its exact position:

/// Used to represent exact positions of tokens in code
pub struct Position {
    pub(crate) bytes: usize,
    pub(crate) line: usize,
    pub(crate) character: usize,
}

impl Position {
    /// How many bytes, ignoring lines, it would take to find this position
    pub fn bytes(self) -> usize {
        self.bytes
    }
    
    /// Index of the character on the line for this position
    pub fn character(self) -> usize {
        self.character
    }
    
    /// Line the position lies on
    pub fn line(self) -> usize {
        self.line
    }
}

Example:

local x = 1
print(x)

local: Position { bytes: 0, line: 1, character: 1 }
x (line 1): Position { bytes: 6, line: 1, character: 7 }
print: Position { bytes: 12, line: 2, character: 1 }

Using the Tokenizer

Direct Tokenization

use full_moon::tokenizer::{Lexer, LuaVersion};

let code = "local x = 1";
let mut lexer = Lexer::new_lazy(code, LuaVersion::new());

while let Some(result) = lexer.process_next() {
    match result {
        LexerResult::Ok(token) => {
            println!("Token: {} at {:?}", 
                token.token_type().kind(), 
                token.start_position());
        }
        LexerResult::Fatal(errors) => {
            eprintln!("Tokenization failed: {:?}", errors);
            break;
        }
        LexerResult::Recovered(token, errors) => {
            println!("Recovered with token: {:?}", token);
            eprintln!("Errors: {:?}", errors);
        }
    }
}

Via Parsing

Typically, you don’t use the tokenizer directly. The parser handles it:

use full_moon::parse;

let ast = parse("local x = 1")?;
// Tokenization happened automatically

String Literal Types

From src/tokenizer/structs.rs:889-918, Lua supports multiple string formats:

pub enum StringLiteralQuoteType {
    /// Strings formatted [[with brackets]]
    Brackets,
    /// Strings formatted "with double quotes"
    Double,
    /// Strings formatted 'with single quotes'
    Single,
}

Examples:

local a = "double quoted"
local b = 'single quoted'
local c = [[bracket string]]
local d = [=[
  multi-line with depth
]=]

The multi_line_depth field tracks the number of = signs:

[[string]] → depth 0
[=[string]=] → depth 1
[==[string]==] → depth 2

Number Formats

The tokenizer preserves the exact text representation of numbers:

local a = 42        -- text: "42"
local b = 3.14      -- text: "3.14"
local c = 0xFF      -- text: "0xFF"
local d = 1e-10     -- text: "1e-10"
local e = 0b1010    -- text: "0b1010" (Luau only)

This allows formatters to preserve the programmer’s choice of number representation.

Comment Formats

Single-line comments:

-- This is a comment
--[[ This is also a single-bracket comment on one line ]]

Multi-line comments:

--[[
  Multi-line comment
  with depth 0
]]

--[==[
  Multi-line comment
  with depth 2
]==]

The blocks field stores the depth (number of = signs).

Error Handling

From src/tokenizer/structs.rs:199-230, tokenization can fail with:

pub enum TokenizerErrorType {
    /// An unclosed multi-line comment was found
    UnclosedComment,
    
    /// An unclosed string was found
    UnclosedString,
    
    /// An invalid number was found
    InvalidNumber,
    
    /// An unexpected token was found
    UnexpectedToken(char),
    
    /// Symbol passed is not valid
    InvalidSymbol(String),
}

Example:

use full_moon::tokenizer::{Lexer, LuaVersion};

let bad_code = "local x = \"unclosed string";
let mut lexer = Lexer::new_lazy(bad_code, LuaVersion::new());

while let Some(result) = lexer.process_next() {
    if let LexerResult::Fatal(errors) = result {
        for error in errors {
            println!("Error: {}", error);
            // Prints: "unclosed string (line:1, char:11)"
        }
    }
}

Creating Tokens Programmatically

From src/tokenizer/structs.rs:606-738, you can create tokens for code generation:

use full_moon::tokenizer::{TokenReference, Symbol};

// Create a symbol with whitespace
let return_token = TokenReference::symbol("return ")?;
assert_eq!(return_token.token().token_type(), &TokenType::Symbol {
    symbol: Symbol::Return,
});

// Leading trivia: none
// Token: Symbol::Return
// Trailing trivia: one space

Use TokenReference::symbol() to create tokens with proper trivia parsing. The input string can include leading and trailing whitespace.

Luau Extensions

When the luau feature is enabled, additional tokens are available:

#[cfg(feature = "luau")]
TokenType::InterpolatedString {
    literal: ShortString,
    kind: InterpolatedStringKind,
}

pub enum InterpolatedStringKind {
    Begin,   // `start{
    Middle,  // }middle{
    End,     // }end`
    Simple,  // `simple`
}

Example:

local name = "World"
local greeting = `Hello, {name}!`

Tokenizes as:

InterpolatedString(Begin): `Hello,
Expression tokens: name
InterpolatedString(End): !`

Performance Considerations

Full Moon’s tokenizer is designed for efficiency:

Lazy evaluation: Use Lexer::new_lazy() to tokenize on-demand
Zero-copy: Uses ShortString to avoid allocations for small strings
Position tracking: Efficient byte/line/character updates

ShortString is a small string optimization that stores strings up to 23 bytes inline without heap allocation.

Token Display

From src/tokenizer/structs.rs:469-524, tokens can convert back to source:

impl fmt::Display for Token {
    fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        match self.token_type() {
            TokenType::Number { text } => text.fmt(formatter),
            TokenType::Identifier { identifier } => identifier.fmt(formatter),
            TokenType::StringLiteral { literal, multi_line_depth, quote_type } => {
                if *quote_type == StringLiteralQuoteType::Brackets {
                    write!(formatter, "[{0}[{1}]{0}]", 
                        "=".repeat(*multi_line_depth), literal)
                } else {
                    write!(formatter, "{0}{1}{0}", quote_type, literal)
                }
            }
            // ... other types
        }
    }
}

This ensures perfect round-tripping from code → tokens → code.

Next Steps

AST Structure

Learn how tokens become AST nodes

Lossless Parsing

Understand the complete lossless parsing system

Get Started

Core Concepts

Guides

Examples

What is a Token?

Token Types

Symbols

TokenReference: The Key to Losslessness

What is Trivia?

Trivia Attachment

Position Tracking

Using the Tokenizer

Direct Tokenization

Via Parsing

String Literal Types

Number Formats

Comment Formats

Error Handling

Creating Tokens Programmatically

Luau Extensions

Performance Considerations

Token Display

Next Steps

AST Structure

Lossless Parsing

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​What is a Token?

​Token Types

​Symbols

​TokenReference: The Key to Losslessness

​What is Trivia?

​Trivia Attachment

​Position Tracking

​Using the Tokenizer

​Direct Tokenization

​Via Parsing

​String Literal Types

​Number Formats

​Comment Formats

​Error Handling

​Creating Tokens Programmatically

​Luau Extensions

​Performance Considerations

​Token Display

​Next Steps

AST Structure

Lossless Parsing

Build docs developers (and LLMs) love

What is a Token?

Token Types

Symbols

TokenReference: The Key to Losslessness

What is Trivia?

Trivia Attachment

Position Tracking

Using the Tokenizer

Direct Tokenization

Via Parsing

String Literal Types

Number Formats

Comment Formats

Error Handling

Creating Tokens Programmatically

Luau Extensions

Performance Considerations

Token Display

Next Steps