Skip to main content
Tokenization is the first step in parsing Lua code with Full Moon. The tokenizer (also called a lexer) converts a stream of characters into a sequence of meaningful tokens while preserving all formatting information.

What is a Token?

From src/tokenizer/structs.rs:427-467, a token represents a single meaningful unit of code:
/// A token consisting of its [`Position`] and a [`TokenType`]
pub struct Token {
    pub(crate) start_position: Position,
    pub(crate) end_position: Position,
    pub(crate) token_type: TokenType,
}

impl Token {
    /// The position a token begins at
    pub fn start_position(&self) -> Position
    
    /// The position a token ends at
    pub fn end_position(&self) -> Position
    
    /// The type of token as well as the data needed to represent it
    pub fn token_type(&self) -> &TokenType
    
    /// The kind of token with no additional data
    pub fn token_kind(&self) -> TokenKind
}

Token Types

From src/tokenizer/structs.rs:237-323, Full Moon recognizes these token types:
TokenType::Identifier {
    identifier: ShortString,
}
// Examples: foo, myVariable, _internal

Symbols

From src/tokenizer/structs.rs:94-179, symbols represent keywords and operators:
/// A literal symbol, used for both words important to syntax 
/// (like while) and operators (like +)
pub enum Symbol {
    // Keywords
    And, Break, Do, Else, ElseIf, End, False, For,
    Function, If, In, Local, Nil, Not, Or, Repeat,
    Return, Then, True, Until, While,
    
    // Operators
    Plus,           // +
    Minus,          // -
    Star,           // *
    Slash,          // /
    Percent,        // %
    Caret,          // ^
    Hash,           // #
    TwoEqual,       // ==
    TildeEqual,     // ~=
    LessThanEqual,  // <=
    GreaterThanEqual, // >=
    LessThan,       // <
    GreaterThan,    // >
    
    // Delimiters
    LeftParen,      // (
    RightParen,     // )
    LeftBrace,      // {
    RightBrace,     // }
    LeftBracket,    // [
    RightBracket,   // ]
    
    // Punctuation
    Semicolon,      // ;
    Colon,          // :
    Comma,          // ,
    Dot,            // .
    TwoDots,        // ..
    Ellipsis,       // ...
    Equal,          // =
}
Symbols are feature-gated. For example, Symbol::Goto is only available with the lua52 or luajit feature flags.

TokenReference: The Key to Losslessness

From src/tokenizer/structs.rs:586-604, TokenReference is what makes Full Moon lossless:
/// A reference to a token used by ASTs.
/// Dereferences to a [`Token`]
pub struct TokenReference {
    pub(crate) leading_trivia: Vec<Token>,
    pub(crate) token: Token,
    pub(crate) trailing_trivia: Vec<Token>,
}

impl TokenReference {
    /// Returns the inner token
    pub fn token(&self) -> &Token
    
    /// Returns the leading trivia
    pub fn leading_trivia(&self) -> impl Iterator<Item = &Token>
    
    /// Returns the trailing trivia
    pub fn trailing_trivia(&self) -> impl Iterator<Item = &Token>
}

What is Trivia?

Trivia refers to tokens that don’t affect program semantics:
  • Whitespace (spaces, tabs, newlines)
  • Comments (single-line and multi-line)
  • Shebang lines (#!/usr/bin/env lua)
From src/tokenizer/structs.rs:324-346:
impl TokenType {
    /// Returns whether a token can be practically ignored in most cases
    /// Comments and whitespace will return `true`
    pub fn is_trivia(&self) -> bool {
        matches!(
            self,
            TokenType::Shebang { .. }
                | TokenType::SingleLineComment { .. }
                | TokenType::MultiLineComment { .. }
                | TokenType::Whitespace { .. }
        )
    }
}

Trivia Attachment

Trivia is attached to the meaningful token that follows it:
-- This is a comment
local x = 1  -- inline comment
Tokenizes as:
  • Token: local with leading trivia "-- This is a comment\n"
  • Token: x
  • Token: =
  • Token: 1 with trailing trivia " -- inline comment"

Position Tracking

From src/tokenizer/structs.rs:852-887, every token tracks its exact position:
/// Used to represent exact positions of tokens in code
pub struct Position {
    pub(crate) bytes: usize,
    pub(crate) line: usize,
    pub(crate) character: usize,
}

impl Position {
    /// How many bytes, ignoring lines, it would take to find this position
    pub fn bytes(self) -> usize {
        self.bytes
    }
    
    /// Index of the character on the line for this position
    pub fn character(self) -> usize {
        self.character
    }
    
    /// Line the position lies on
    pub fn line(self) -> usize {
        self.line
    }
}
Example:
local x = 1
print(x)
  • local: Position { bytes: 0, line: 1, character: 1 }
  • x (line 1): Position { bytes: 6, line: 1, character: 7 }
  • print: Position { bytes: 12, line: 2, character: 1 }

Using the Tokenizer

Direct Tokenization

use full_moon::tokenizer::{Lexer, LuaVersion};

let code = "local x = 1";
let mut lexer = Lexer::new_lazy(code, LuaVersion::new());

while let Some(result) = lexer.process_next() {
    match result {
        LexerResult::Ok(token) => {
            println!("Token: {} at {:?}", 
                token.token_type().kind(), 
                token.start_position());
        }
        LexerResult::Fatal(errors) => {
            eprintln!("Tokenization failed: {:?}", errors);
            break;
        }
        LexerResult::Recovered(token, errors) => {
            println!("Recovered with token: {:?}", token);
            eprintln!("Errors: {:?}", errors);
        }
    }
}

Via Parsing

Typically, you don’t use the tokenizer directly. The parser handles it:
use full_moon::parse;

let ast = parse("local x = 1")?;
// Tokenization happened automatically

String Literal Types

From src/tokenizer/structs.rs:889-918, Lua supports multiple string formats:
pub enum StringLiteralQuoteType {
    /// Strings formatted [[with brackets]]
    Brackets,
    /// Strings formatted "with double quotes"
    Double,
    /// Strings formatted 'with single quotes'
    Single,
}
Examples:
local a = "double quoted"
local b = 'single quoted'
local c = [[bracket string]]
local d = [=[
  multi-line with depth
]=]
The multi_line_depth field tracks the number of = signs:
  • [[string]] → depth 0
  • [=[string]=] → depth 1
  • [==[string]==] → depth 2

Number Formats

The tokenizer preserves the exact text representation of numbers:
local a = 42        -- text: "42"
local b = 3.14      -- text: "3.14"
local c = 0xFF      -- text: "0xFF"
local d = 1e-10     -- text: "1e-10"
local e = 0b1010    -- text: "0b1010" (Luau only)
This allows formatters to preserve the programmer’s choice of number representation.

Comment Formats

Single-line comments:
-- This is a comment
--[[ This is also a single-bracket comment on one line ]]
Multi-line comments:
--[[
  Multi-line comment
  with depth 0
]]

--[==[
  Multi-line comment
  with depth 2
]==]
The blocks field stores the depth (number of = signs).

Error Handling

From src/tokenizer/structs.rs:199-230, tokenization can fail with:
pub enum TokenizerErrorType {
    /// An unclosed multi-line comment was found
    UnclosedComment,
    
    /// An unclosed string was found
    UnclosedString,
    
    /// An invalid number was found
    InvalidNumber,
    
    /// An unexpected token was found
    UnexpectedToken(char),
    
    /// Symbol passed is not valid
    InvalidSymbol(String),
}
Example:
use full_moon::tokenizer::{Lexer, LuaVersion};

let bad_code = "local x = \"unclosed string";
let mut lexer = Lexer::new_lazy(bad_code, LuaVersion::new());

while let Some(result) = lexer.process_next() {
    if let LexerResult::Fatal(errors) = result {
        for error in errors {
            println!("Error: {}", error);
            // Prints: "unclosed string (line:1, char:11)"
        }
    }
}

Creating Tokens Programmatically

From src/tokenizer/structs.rs:606-738, you can create tokens for code generation:
use full_moon::tokenizer::{TokenReference, Symbol};

// Create a symbol with whitespace
let return_token = TokenReference::symbol("return ")?;
assert_eq!(return_token.token().token_type(), &TokenType::Symbol {
    symbol: Symbol::Return,
});

// Leading trivia: none
// Token: Symbol::Return
// Trailing trivia: one space
Use TokenReference::symbol() to create tokens with proper trivia parsing. The input string can include leading and trailing whitespace.

Luau Extensions

When the luau feature is enabled, additional tokens are available:
#[cfg(feature = "luau")]
TokenType::InterpolatedString {
    literal: ShortString,
    kind: InterpolatedStringKind,
}

pub enum InterpolatedStringKind {
    Begin,   // `start{
    Middle,  // }middle{
    End,     // }end`
    Simple,  // `simple`
}
Example:
local name = "World"
local greeting = `Hello, {name}!`
Tokenizes as:
  • InterpolatedString(Begin): `Hello,
  • Expression tokens: name
  • InterpolatedString(End): !`

Performance Considerations

Full Moon’s tokenizer is designed for efficiency:
  1. Lazy evaluation: Use Lexer::new_lazy() to tokenize on-demand
  2. Zero-copy: Uses ShortString to avoid allocations for small strings
  3. Position tracking: Efficient byte/line/character updates
ShortString is a small string optimization that stores strings up to 23 bytes inline without heap allocation.

Token Display

From src/tokenizer/structs.rs:469-524, tokens can convert back to source:
impl fmt::Display for Token {
    fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        match self.token_type() {
            TokenType::Number { text } => text.fmt(formatter),
            TokenType::Identifier { identifier } => identifier.fmt(formatter),
            TokenType::StringLiteral { literal, multi_line_depth, quote_type } => {
                if *quote_type == StringLiteralQuoteType::Brackets {
                    write!(formatter, "[{0}[{1}]{0}]", 
                        "=".repeat(*multi_line_depth), literal)
                } else {
                    write!(formatter, "{0}{1}{0}", quote_type, literal)
                }
            }
            // ... other types
        }
    }
}
This ensures perfect round-tripping from code → tokens → code.

Next Steps

AST Structure

Learn how tokens become AST nodes

Lossless Parsing

Understand the complete lossless parsing system

Build docs developers (and LLMs) love