HTML parsing

HTML parsing is the critical first step in transforming raw HTML markup into a structured document tree that browsers can work with. This process involves sophisticated state machines, error recovery, and careful handling of scripts.

Tokenization state machine

The HTML tokenizer implements a state machine that processes the input character stream and produces tokens. This is a complex process with numerous states and transitions.

The HTML5 specification defines a tokenization state machine with 13 distinct states and 67 possible transitions between them.

Initialize tokenizer state

Begin in the data state, the default state for parsing regular content. The tokenizer maintains a current state and processes characters one at a time.

Process character stream

Read characters from the input stream and transition between states based on the current character and current state. Each state has specific rules for handling different characters.

Emit tokens

When appropriate conditions are met, emit tokens such as:

Start tag tokens (e.g., <div>)
End tag tokens (e.g., </div>)
Character tokens (text content)
Comment tokens
DOCTYPE tokens

Handle state transitions

Navigate through the 67 possible state transitions based on the input. Common states include:

Data state
Tag open state
Tag name state
Before attribute name state
Attribute name state
Attribute value states (quoted and unquoted)
Script data states
Comment states

State machine implementation

The tokenizer processes characters and transitions states according to specific rules:

enum class State {
    Data,
    TagOpen,
    TagName,
    BeforeAttributeName,
    AttributeName,
    AfterAttributeName,
    BeforeAttributeValue,
    AttributeValueDoubleQuoted,
    AttributeValueSingleQuoted,
    AttributeValueUnquoted,
    ScriptData,
    CommentStart,
    Comment
};

void HTMLTokenizer::processCharacter(char c) {
    switch (current_state) {
        case State::Data:
            if (c == '<') {
                current_state = State::TagOpen;
            } else {
                emitCharacter(c);
            }
            break;
        
        case State::TagOpen:
            if (c == '/') {
                current_state = State::EndTagOpen;
            } else if (isAlpha(c)) {
                current_state = State::TagName;
                current_token.appendToTagName(toLowerCase(c));
            }
            break;
        
        case State::TagName:
            if (isWhitespace(c)) {
                current_state = State::BeforeAttributeName;
            } else if (c == '>') {
                emitCurrentToken();
                current_state = State::Data;
            } else {
                current_token.appendToTagName(toLowerCase(c));
            }
            break;
        
        // ... additional state handling
    }
}

The tokenizer must handle special parsing rules for <script> and <style> elements, switching to script data states where most HTML parsing rules don’t apply.

Tree construction algorithm

Once tokens are generated, the tree construction algorithm builds the DOM tree. This process uses a stack-based approach with insertion modes and special rules for different contexts.

Core algorithm components

Stack of open elements

Maintains the hierarchy of currently open elements. Elements are pushed onto the stack when their start tag is encountered and popped when their end tag is processed.The stack represents the current nesting structure and is used to determine where new elements should be inserted.

Insertion modes

The tree constructor operates in different insertion modes depending on the current context:

Initial mode: Before any content
Before html mode: Before the <html> element
Before head mode: Before the <head> element
In head mode: Inside the <head> element
After head mode: Between </head> and <body>
In body mode: Inside the <body> element (most common)
In table mode: Inside <table> elements
In select mode: Inside <select> elements
After body mode: After the </body> end tag
After after body mode: Final state

Each mode has specific rules for handling different tokens.

Active formatting elements

Tracks formatting elements (like <b>, <i>, <a>) that need to be reopened when their formatting scope is broken by block-level elements.This ensures that markup like <b>Bold <div>Block</div> Still Bold</b> is handled correctly.

Foster parenting

Special handling for table-related content that appears in invalid positions. Content that shouldn’t be in tables is “fostered” out to maintain proper structure.

Tree construction process

Receive token from tokenizer

The tree constructor receives tokens one at a time from the tokenizer. Each token type requires different handling.

Determine current insertion mode

Based on the current parsing context and open elements stack, determine which insertion mode is active.

Process token according to insertion mode

Apply the specific rules for the current insertion mode to handle the token. This may involve:

Creating new elements
Pushing elements onto the stack
Popping elements from the stack
Switching insertion modes
Triggering error recovery

Update DOM tree

Insert new nodes into the appropriate position in the DOM tree based on the algorithm’s rules.

void HTMLTreeConstructor::processToken(const Token& token) {
    switch (insertion_mode) {
        case InsertionMode::InBody:
            handleInBodyMode(token);
            break;
        case InsertionMode::InHead:
            handleInHeadMode(token);
            break;
        case InsertionMode::InTable:
            handleInTableMode(token);
            break;
        // ... other insertion modes
    }
}

void HTMLTreeConstructor::handleInBodyMode(const Token& token) {
    if (token.type == TokenType::StartTag) {
        if (token.tagName == "div" || token.tagName == "p" || 
            token.tagName == "section" /* ... other block elements */) {
            
            // Close any open <p> element
            if (hasElementInButtonScope("p")) {
                closePElement();
            }
            
            // Insert new element
            auto element = createElement(token.tagName, token.attributes);
            insertElement(element);
            openElements.push(element);
        }
        else if (token.tagName == "b" || token.tagName == "i" || 
                 token.tagName == "strong" /* ... formatting elements */) {
            reconstructActiveFormattingElements();
            auto element = createElement(token.tagName, token.attributes);
            insertElement(element);
            openElements.push(element);
            activeFormattingElements.push(element);
        }
    }
    else if (token.type == TokenType::EndTag) {
        // Find matching open element and close it
        generateImpliedEndTags();
        if (currentNode()->tagName != token.tagName) {
            // Parse error: mismatched tags
            reportParseError();
        }
        popElementsUntil(token.tagName);
    }
}

Error recovery and quirks mode

Browsers must handle malformed HTML gracefully. The HTML specification defines detailed error recovery procedures to ensure consistent rendering across browsers, even for invalid markup.

Error recovery mechanisms

Implied tags
Misnested tags
Invalid nesting
Unclosed tags

The parser automatically inserts missing required tags. For example:

<table><tr><td>Cell</table>

The parser will automatically insert the missing </td> and </tr> end tags before closing the table.Common implied tags:

<html> and </html>
<head> and </head>
<body> and </body>
</p> when a block element is encountered
</li> when another <li> is encountered

When tags are improperly nested, the parser uses the active formatting elements list to reconstruct proper nesting:

<b>Bold <i>Bold Italic</b> Just Italic</i>

This is reconstructed as:

<b>Bold <i>Bold Italic</i></b><i> Just Italic</i>

The formatting is preserved even though the original markup was invalid.

Some elements cannot legally contain others. The parser moves content to valid locations:

<table>Text before<tr><td>Cell</td></tr></table>

The text “Text before” is fostered out of the table and placed before it, as text cannot be a direct child of <table>.

At the end of the document, all open elements are automatically closed in the correct order:

<html><body><div><p>Content

Automatically becomes:

<html><body><div><p>Content</p></div></body></html>

Quirks mode detection

Browsers use different rendering modes based on the DOCTYPE to maintain compatibility with legacy web pages.

The parser determines the rendering mode based on the DOCTYPE:

Standards mode

Triggered by a proper HTML5 DOCTYPE or other modern DOCTYPEs:

<!DOCTYPE html>

Renders according to modern web standards with strict layout rules.

Quirks mode

Triggered by a missing DOCTYPE or certain legacy DOCTYPEs:

<!-- No DOCTYPE -->
<html>...

Emulates rendering bugs in older browsers (IE 5, Netscape 4) for backward compatibility. Affects:

Box model calculations
Table layout
Font size calculations
Line height handling

Limited quirks mode

Triggered by certain DOCTYPEs (also called “almost standards mode”):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  "http://www.w3.org/TR/html4/loose.dtd">

Only affects table cell layout with images, all other rendering follows standards mode.

Always use <!DOCTYPE html> in new projects to ensure standards mode rendering and consistent behavior across browsers.

Parser-blocking scripts and async/defer

Script elements can significantly impact HTML parsing performance and page load times. Understanding how scripts interact with the parser is crucial for optimization.

Parser-blocking behavior

By default, when the parser encounters a <script> tag:

Pause HTML parsing

The parser stops processing HTML tokens immediately when it encounters a <script> start tag.

Download script (if external)

If the script has a src attribute, the browser downloads the script file. This blocks all further parsing until the download completes.

Execute script

The JavaScript code is executed immediately. The script can access all DOM nodes that have been parsed so far but cannot access elements that appear later in the document.

Resume HTML parsing

Only after script execution completes does the parser continue processing the remaining HTML.

Parser-blocking scripts can severely impact page load performance, especially when scripts are large or the network is slow.

Script loading strategies

Modern HTML provides attributes to control script loading and execution:

<!-- Blocks parsing until downloaded and executed -->
<script src="script.js"></script>

<div>This content waits for script.js to download and execute</div>

Comparison of script loading strategies

Attribute	Download	Parsing	Execution timing	Order guaranteed
None (default)	Blocks parsing	Blocked	Immediately after download	Yes
`async`	Parallel	Continues	As soon as downloaded	No
`defer`	Parallel	Continues	After parsing completes	Yes

Best practices for script loading:

Use defer for scripts that don’t need to execute immediately
Use async for independent scripts like analytics that don’t depend on other scripts
Place blocking scripts at the end of <body> if async/defer aren’t suitable
Use module scripts (<script type="module">) which defer by default

Implementation details

void HTMLTreeConstructor::handleScriptElement(Element* script) {
    bool parser_inserted = true;
    bool async = script->hasAttribute("async");
    bool defer = script->hasAttribute("defer");
    
    if (script->hasAttribute("src")) {
        // External script
        if (async) {
            // Load asynchronously, execute when ready
            script_loader->loadAsync(script);
            // Continue parsing immediately
        }
        else if (defer && parser_inserted) {
            // Load asynchronously, execute after parsing
            script_loader->loadDeferred(script);
            // Continue parsing immediately
        }
        else {
            // Classic parser-blocking script
            pauseParser();
            script_loader->loadSync(script, [this]() {
                resumeParser();
            });
        }
    }
    else {
        // Inline script - always blocks
        pauseParser();
        executeScript(script->textContent());
        resumeParser();
    }
}

You now understand the complete HTML parsing pipeline: from tokenization through tree construction to script handling. This knowledge is essential for building browser engines and optimizing page load performance.

HTML & CSS

Layout & Rendering

Tokenization state machine

State machine implementation

Tree construction algorithm

Core algorithm components

Tree construction process

Error recovery and quirks mode

Error recovery mechanisms

Quirks mode detection

Parser-blocking scripts and async/defer

Parser-blocking behavior

Script loading strategies

Comparison of script loading strategies

Implementation details

Build docs developers (and LLMs) love

HTML & CSS

Layout & Rendering

​Tokenization state machine

​State machine implementation

​Tree construction algorithm

​Core algorithm components

​Tree construction process

​Error recovery and quirks mode

​Error recovery mechanisms

​Quirks mode detection

​Parser-blocking scripts and async/defer

​Parser-blocking behavior

​Script loading strategies

​Comparison of script loading strategies

​Implementation details

Build docs developers (and LLMs) love

Tokenization state machine

State machine implementation

Tree construction algorithm

Core algorithm components

Tree construction process

Error recovery and quirks mode

Error recovery mechanisms

Quirks mode detection

Parser-blocking scripts and async/defer

Parser-blocking behavior

Script loading strategies

Comparison of script loading strategies

Implementation details