Skip to main content
HTML parsing is the critical first step in transforming raw HTML markup into a structured document tree that browsers can work with. This process involves sophisticated state machines, error recovery, and careful handling of scripts.

Tokenization state machine

The HTML tokenizer implements a state machine that processes the input character stream and produces tokens. This is a complex process with numerous states and transitions.
The HTML5 specification defines a tokenization state machine with 13 distinct states and 67 possible transitions between them.
1

Initialize tokenizer state

Begin in the data state, the default state for parsing regular content. The tokenizer maintains a current state and processes characters one at a time.
2

Process character stream

Read characters from the input stream and transition between states based on the current character and current state. Each state has specific rules for handling different characters.
3

Emit tokens

When appropriate conditions are met, emit tokens such as:
  • Start tag tokens (e.g., <div>)
  • End tag tokens (e.g., </div>)
  • Character tokens (text content)
  • Comment tokens
  • DOCTYPE tokens
4

Handle state transitions

Navigate through the 67 possible state transitions based on the input. Common states include:
  • Data state
  • Tag open state
  • Tag name state
  • Before attribute name state
  • Attribute name state
  • Attribute value states (quoted and unquoted)
  • Script data states
  • Comment states

State machine implementation

The tokenizer processes characters and transitions states according to specific rules:
enum class State {
    Data,
    TagOpen,
    TagName,
    BeforeAttributeName,
    AttributeName,
    AfterAttributeName,
    BeforeAttributeValue,
    AttributeValueDoubleQuoted,
    AttributeValueSingleQuoted,
    AttributeValueUnquoted,
    ScriptData,
    CommentStart,
    Comment
};

void HTMLTokenizer::processCharacter(char c) {
    switch (current_state) {
        case State::Data:
            if (c == '<') {
                current_state = State::TagOpen;
            } else {
                emitCharacter(c);
            }
            break;
        
        case State::TagOpen:
            if (c == '/') {
                current_state = State::EndTagOpen;
            } else if (isAlpha(c)) {
                current_state = State::TagName;
                current_token.appendToTagName(toLowerCase(c));
            }
            break;
        
        case State::TagName:
            if (isWhitespace(c)) {
                current_state = State::BeforeAttributeName;
            } else if (c == '>') {
                emitCurrentToken();
                current_state = State::Data;
            } else {
                current_token.appendToTagName(toLowerCase(c));
            }
            break;
        
        // ... additional state handling
    }
}
The tokenizer must handle special parsing rules for <script> and <style> elements, switching to script data states where most HTML parsing rules don’t apply.

Tree construction algorithm

Once tokens are generated, the tree construction algorithm builds the DOM tree. This process uses a stack-based approach with insertion modes and special rules for different contexts.

Core algorithm components

Maintains the hierarchy of currently open elements. Elements are pushed onto the stack when their start tag is encountered and popped when their end tag is processed.The stack represents the current nesting structure and is used to determine where new elements should be inserted.
The tree constructor operates in different insertion modes depending on the current context:
  • Initial mode: Before any content
  • Before html mode: Before the <html> element
  • Before head mode: Before the <head> element
  • In head mode: Inside the <head> element
  • After head mode: Between </head> and <body>
  • In body mode: Inside the <body> element (most common)
  • In table mode: Inside <table> elements
  • In select mode: Inside <select> elements
  • After body mode: After the </body> end tag
  • After after body mode: Final state
Each mode has specific rules for handling different tokens.
Tracks formatting elements (like <b>, <i>, <a>) that need to be reopened when their formatting scope is broken by block-level elements.This ensures that markup like <b>Bold <div>Block</div> Still Bold</b> is handled correctly.
Special handling for table-related content that appears in invalid positions. Content that shouldn’t be in tables is “fostered” out to maintain proper structure.

Tree construction process

1

Receive token from tokenizer

The tree constructor receives tokens one at a time from the tokenizer. Each token type requires different handling.
2

Determine current insertion mode

Based on the current parsing context and open elements stack, determine which insertion mode is active.
3

Process token according to insertion mode

Apply the specific rules for the current insertion mode to handle the token. This may involve:
  • Creating new elements
  • Pushing elements onto the stack
  • Popping elements from the stack
  • Switching insertion modes
  • Triggering error recovery
4

Update DOM tree

Insert new nodes into the appropriate position in the DOM tree based on the algorithm’s rules.
void HTMLTreeConstructor::processToken(const Token& token) {
    switch (insertion_mode) {
        case InsertionMode::InBody:
            handleInBodyMode(token);
            break;
        case InsertionMode::InHead:
            handleInHeadMode(token);
            break;
        case InsertionMode::InTable:
            handleInTableMode(token);
            break;
        // ... other insertion modes
    }
}

void HTMLTreeConstructor::handleInBodyMode(const Token& token) {
    if (token.type == TokenType::StartTag) {
        if (token.tagName == "div" || token.tagName == "p" || 
            token.tagName == "section" /* ... other block elements */) {
            
            // Close any open <p> element
            if (hasElementInButtonScope("p")) {
                closePElement();
            }
            
            // Insert new element
            auto element = createElement(token.tagName, token.attributes);
            insertElement(element);
            openElements.push(element);
        }
        else if (token.tagName == "b" || token.tagName == "i" || 
                 token.tagName == "strong" /* ... formatting elements */) {
            reconstructActiveFormattingElements();
            auto element = createElement(token.tagName, token.attributes);
            insertElement(element);
            openElements.push(element);
            activeFormattingElements.push(element);
        }
    }
    else if (token.type == TokenType::EndTag) {
        // Find matching open element and close it
        generateImpliedEndTags();
        if (currentNode()->tagName != token.tagName) {
            // Parse error: mismatched tags
            reportParseError();
        }
        popElementsUntil(token.tagName);
    }
}

Error recovery and quirks mode

Browsers must handle malformed HTML gracefully. The HTML specification defines detailed error recovery procedures to ensure consistent rendering across browsers, even for invalid markup.

Error recovery mechanisms

The parser automatically inserts missing required tags. For example:
<table><tr><td>Cell</table>
The parser will automatically insert the missing </td> and </tr> end tags before closing the table.Common implied tags:
  • <html> and </html>
  • <head> and </head>
  • <body> and </body>
  • </p> when a block element is encountered
  • </li> when another <li> is encountered

Quirks mode detection

Browsers use different rendering modes based on the DOCTYPE to maintain compatibility with legacy web pages.
The parser determines the rendering mode based on the DOCTYPE:
Triggered by a proper HTML5 DOCTYPE or other modern DOCTYPEs:
<!DOCTYPE html>
Renders according to modern web standards with strict layout rules.
Triggered by a missing DOCTYPE or certain legacy DOCTYPEs:
<!-- No DOCTYPE -->
<html>...
Emulates rendering bugs in older browsers (IE 5, Netscape 4) for backward compatibility. Affects:
  • Box model calculations
  • Table layout
  • Font size calculations
  • Line height handling
Triggered by certain DOCTYPEs (also called “almost standards mode”):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  "http://www.w3.org/TR/html4/loose.dtd">
Only affects table cell layout with images, all other rendering follows standards mode.
Always use <!DOCTYPE html> in new projects to ensure standards mode rendering and consistent behavior across browsers.

Parser-blocking scripts and async/defer

Script elements can significantly impact HTML parsing performance and page load times. Understanding how scripts interact with the parser is crucial for optimization.

Parser-blocking behavior

By default, when the parser encounters a <script> tag:
1

Pause HTML parsing

The parser stops processing HTML tokens immediately when it encounters a <script> start tag.
2

Download script (if external)

If the script has a src attribute, the browser downloads the script file. This blocks all further parsing until the download completes.
3

Execute script

The JavaScript code is executed immediately. The script can access all DOM nodes that have been parsed so far but cannot access elements that appear later in the document.
4

Resume HTML parsing

Only after script execution completes does the parser continue processing the remaining HTML.
Parser-blocking scripts can severely impact page load performance, especially when scripts are large or the network is slow.

Script loading strategies

Modern HTML provides attributes to control script loading and execution:
<!-- Blocks parsing until downloaded and executed -->
<script src="script.js"></script>

<div>This content waits for script.js to download and execute</div>

Comparison of script loading strategies

AttributeDownloadParsingExecution timingOrder guaranteed
None (default)Blocks parsingBlockedImmediately after downloadYes
asyncParallelContinuesAs soon as downloadedNo
deferParallelContinuesAfter parsing completesYes
Best practices for script loading:
  • Use defer for scripts that don’t need to execute immediately
  • Use async for independent scripts like analytics that don’t depend on other scripts
  • Place blocking scripts at the end of <body> if async/defer aren’t suitable
  • Use module scripts (<script type="module">) which defer by default

Implementation details

void HTMLTreeConstructor::handleScriptElement(Element* script) {
    bool parser_inserted = true;
    bool async = script->hasAttribute("async");
    bool defer = script->hasAttribute("defer");
    
    if (script->hasAttribute("src")) {
        // External script
        if (async) {
            // Load asynchronously, execute when ready
            script_loader->loadAsync(script);
            // Continue parsing immediately
        }
        else if (defer && parser_inserted) {
            // Load asynchronously, execute after parsing
            script_loader->loadDeferred(script);
            // Continue parsing immediately
        }
        else {
            // Classic parser-blocking script
            pauseParser();
            script_loader->loadSync(script, [this]() {
                resumeParser();
            });
        }
    }
    else {
        // Inline script - always blocks
        pauseParser();
        executeScript(script->textContent());
        resumeParser();
    }
}
You now understand the complete HTML parsing pipeline: from tokenization through tree construction to script handling. This knowledge is essential for building browser engines and optimizing page load performance.

Build docs developers (and LLMs) love