Parsing and AST

Learn how to transform raw JavaScript source code into an Abstract Syntax Tree (AST) through lexical analysis, tokenization, and parsing. This is the foundational step in building a JavaScript engine.

Overview

Parsing is the process of converting JavaScript source code into a structured representation that the engine can understand and execute. This involves multiple stages working together to analyze the syntax and build an AST.

The parser is the first major component of a JavaScript engine, sitting between the raw source code and the interpreter/compiler.

Lexical analysis and tokenization

Lexical analysis is the process of breaking down source code into meaningful tokens. Each token represents a fundamental building block of the language.

Scan characters

Read the source code character by character, identifying patterns that form tokens.

Classify tokens

Categorize each token by type: keywords, identifiers, operators, literals, punctuation.

Generate token stream

Produce a sequence of tokens that will be consumed by the parser.

Token types

A robust tokenizer must recognize and classify different token types:

// Reserved words in JavaScript
function, const, let, var, if, else, while, for, return, class, extends

Proper handling of edge cases like Unicode identifiers, regex literals, and template strings is crucial for a production-ready tokenizer.

Recursive descent parser

The recursive descent parser is a top-down parsing technique where each non-terminal in the grammar is implemented as a function.

Parser structure

The parser consists of mutually recursive functions that mirror the grammar rules:

class Parser {
private:
    Lexer& lexer;
    Token currentToken;
    
    void advance() {
        currentToken = lexer.nextToken();
    }
    
    bool match(TokenType type) {
        if (currentToken.type == type) {
            advance();
            return true;
        }
        return false;
    }
    
public:
    ASTNode* parseProgram();
    ASTNode* parseStatement();
    ASTNode* parseExpression();
    ASTNode* parsePrimary();
};

Operator precedence

Handling operator precedence correctly is essential for parsing expressions:

Operator precedence table

Precedence	Operators	Associativity
15	`()`, `[]`, `.`	Left-to-right
14	`++`, `--`, `!`, `typeof`	Right-to-left
13	`**`	Right-to-left
12	`*`, `/`, `%`	Left-to-right
11	`+`, `-`	Left-to-right
10	`<`, `<=`, `>`, `>=`	Left-to-right
9	`==`, `===`, `!=`, `!==`	Left-to-right
5	`&&`	Left-to-right
4	`		`	Left-to-right
3	`=`, `+=`, `-=`, etc.	Right-to-left

AST construction and validation

The Abstract Syntax Tree is a hierarchical representation of the program structure. Each node represents a syntactic construct.

AST node types

Different AST nodes represent different language constructs:

Declarations

Function declarations
Variable declarations
Class declarations

Statements

Expression statements
If/else statements
Loop statements
Return statements

Expressions

Binary expressions
Call expressions
Member expressions
Literals

Special

Program (root node)
Block statements
Function parameters

class ASTNode {
public:
    enum class Type {
        Program,
        FunctionDeclaration,
        VariableDeclaration,
        BinaryExpression,
        CallExpression,
        Identifier,
        Literal
    };
    
    virtual ~ASTNode() = default;
    virtual Type getType() const = 0;
    virtual void accept(ASTVisitor& visitor) = 0;
};

AST validation

After construction, the AST should be validated for semantic correctness:

Type checking

Verify that operations are performed on compatible types.

Reference validation

Ensure all variables and functions are declared before use.

Control flow analysis

Check for unreachable code, missing return statements, and other control flow issues.

Scope analysis and symbol resolution

Scope analysis determines which declarations are visible at each point in the program and resolves variable references to their declarations.

Scope types

Global scope
Function scope
Block scope
Lexical scope

The outermost scope containing global variables and functions. Accessible from anywhere in the program.

// Global scope
const globalVar = 42;

function globalFunction() {
    return globalVar; // Can access global scope
}

Variables declared with var are function-scoped. They are visible throughout the entire function.

function example() {
    if (true) {
        var x = 10; // Function-scoped
    }
    console.log(x); // 10 - accessible outside if block
}

Variables declared with let and const are block-scoped. They are only visible within their block.

function example() {
    if (true) {
        let x = 10;   // Block-scoped
        const y = 20; // Block-scoped
    }
    console.log(x); // ReferenceError
}

Inner functions have access to variables in outer scopes through lexical scoping.

function outer() {
    const x = 10;
    
    function inner() {
        console.log(x); // Can access outer scope
    }
    
    return inner;
}

Symbol table implementation

A symbol table tracks variable declarations and their scopes:

class SymbolTable {
private:
    struct Symbol {
        std::string name;
        enum class Kind { Variable, Function, Parameter };
        Kind kind;
        int scopeLevel;
    };
    
    std::vector<std::unordered_map<std::string, Symbol>> scopes;
    int currentLevel = 0;
    
public:
    void enterScope() {
        scopes.push_back({});
        currentLevel++;
    }
    
    void exitScope() {
        scopes.pop_back();
        currentLevel--;
    }
    
    void define(const std::string& name, Symbol::Kind kind) {
        Symbol sym{name, kind, currentLevel};
        scopes.back()[name] = sym;
    }
    
    Symbol* resolve(const std::string& name) {
        // Search from innermost to outermost scope
        for (auto it = scopes.rbegin(); it != scopes.rend(); ++it) {
            auto found = it->find(name);
            if (found != it->end()) {
                return &found->second;
            }
        }
        return nullptr; // Undefined variable
    }
};

Scope analysis must handle hoisting for var declarations and functions, while maintaining temporal dead zones for let and const.

Variable hoisting

JavaScript’s hoisting behavior must be correctly implemented:

// This code:
console.log(x); // undefined (not ReferenceError)
var x = 5;

// Is interpreted as:
var x; // Hoisted to top
console.log(x);
x = 5;

// But let/const have temporal dead zone:
console.log(y); // ReferenceError
let y = 5;

Next steps

Bytecode and VM

Learn how to compile AST to bytecode and execute it in a stack-based virtual machine

Closures and prototypes

Implement closures with environment chains and prototype-based inheritance

Parsing & Execution

Runtime

Overview

Lexical analysis and tokenization

Token types

Recursive descent parser

Parser structure

Operator precedence

AST construction and validation

AST node types

Declarations

Statements

Expressions

Special

AST validation

Scope analysis and symbol resolution

Scope types

Symbol table implementation

Variable hoisting

Next steps

Bytecode and VM

Closures and prototypes

Build docs developers (and LLMs) love

Parsing & Execution

Runtime

​Overview

​Lexical analysis and tokenization

​Token types

​Recursive descent parser

​Parser structure

​Operator precedence

​AST construction and validation

​AST node types

Declarations

Statements

Expressions

Special

​AST validation

​Scope analysis and symbol resolution

​Scope types

​Symbol table implementation

​Variable hoisting

​Next steps

Bytecode and VM

Closures and prototypes

Build docs developers (and LLMs) love

Overview

Lexical analysis and tokenization

Token types

Recursive descent parser

Parser structure

Operator precedence

AST construction and validation

AST node types

AST validation

Scope analysis and symbol resolution

Scope types

Symbol table implementation

Variable hoisting

Next steps