Tokenization state machine
The HTML tokenizer implements a state machine that processes the input character stream and produces tokens. This is a complex process with numerous states and transitions.The HTML5 specification defines a tokenization state machine with 13 distinct states and 67 possible transitions between them.
Initialize tokenizer state
Begin in the data state, the default state for parsing regular content. The tokenizer maintains a current state and processes characters one at a time.
Process character stream
Read characters from the input stream and transition between states based on the current character and current state. Each state has specific rules for handling different characters.
Emit tokens
When appropriate conditions are met, emit tokens such as:
- Start tag tokens (e.g.,
<div>) - End tag tokens (e.g.,
</div>) - Character tokens (text content)
- Comment tokens
- DOCTYPE tokens
State machine implementation
The tokenizer processes characters and transitions states according to specific rules:Tree construction algorithm
Once tokens are generated, the tree construction algorithm builds the DOM tree. This process uses a stack-based approach with insertion modes and special rules for different contexts.Core algorithm components
Stack of open elements
Stack of open elements
Maintains the hierarchy of currently open elements. Elements are pushed onto the stack when their start tag is encountered and popped when their end tag is processed.The stack represents the current nesting structure and is used to determine where new elements should be inserted.
Insertion modes
Insertion modes
The tree constructor operates in different insertion modes depending on the current context:
- Initial mode: Before any content
- Before html mode: Before the
<html>element - Before head mode: Before the
<head>element - In head mode: Inside the
<head>element - After head mode: Between
</head>and<body> - In body mode: Inside the
<body>element (most common) - In table mode: Inside
<table>elements - In select mode: Inside
<select>elements - After body mode: After the
</body>end tag - After after body mode: Final state
Active formatting elements
Active formatting elements
Tracks formatting elements (like
<b>, <i>, <a>) that need to be reopened when their formatting scope is broken by block-level elements.This ensures that markup like <b>Bold <div>Block</div> Still Bold</b> is handled correctly.Foster parenting
Foster parenting
Special handling for table-related content that appears in invalid positions. Content that shouldn’t be in tables is “fostered” out to maintain proper structure.
Tree construction process
Receive token from tokenizer
The tree constructor receives tokens one at a time from the tokenizer. Each token type requires different handling.
Determine current insertion mode
Based on the current parsing context and open elements stack, determine which insertion mode is active.
Process token according to insertion mode
Apply the specific rules for the current insertion mode to handle the token. This may involve:
- Creating new elements
- Pushing elements onto the stack
- Popping elements from the stack
- Switching insertion modes
- Triggering error recovery
Error recovery and quirks mode
Browsers must handle malformed HTML gracefully. The HTML specification defines detailed error recovery procedures to ensure consistent rendering across browsers, even for invalid markup.Error recovery mechanisms
- Invalid nesting
Quirks mode detection
Browsers use different rendering modes based on the DOCTYPE to maintain compatibility with legacy web pages.
Standards mode
Standards mode
Triggered by a proper HTML5 DOCTYPE or other modern DOCTYPEs:Renders according to modern web standards with strict layout rules.
Quirks mode
Quirks mode
Triggered by a missing DOCTYPE or certain legacy DOCTYPEs:Emulates rendering bugs in older browsers (IE 5, Netscape 4) for backward compatibility. Affects:
- Box model calculations
- Table layout
- Font size calculations
- Line height handling
Limited quirks mode
Limited quirks mode
Triggered by certain DOCTYPEs (also called “almost standards mode”):Only affects table cell layout with images, all other rendering follows standards mode.
Parser-blocking scripts and async/defer
Script elements can significantly impact HTML parsing performance and page load times. Understanding how scripts interact with the parser is crucial for optimization.Parser-blocking behavior
By default, when the parser encounters a<script> tag:
Pause HTML parsing
The parser stops processing HTML tokens immediately when it encounters a
<script> start tag.Download script (if external)
If the script has a
src attribute, the browser downloads the script file. This blocks all further parsing until the download completes.Execute script
The JavaScript code is executed immediately. The script can access all DOM nodes that have been parsed so far but cannot access elements that appear later in the document.
Parser-blocking scripts can severely impact page load performance, especially when scripts are large or the network is slow.
Script loading strategies
Modern HTML provides attributes to control script loading and execution:Comparison of script loading strategies
| Attribute | Download | Parsing | Execution timing | Order guaranteed |
|---|---|---|---|---|
| None (default) | Blocks parsing | Blocked | Immediately after download | Yes |
async | Parallel | Continues | As soon as downloaded | No |
defer | Parallel | Continues | After parsing completes | Yes |
Implementation details
You now understand the complete HTML parsing pipeline: from tokenization through tree construction to script handling. This knowledge is essential for building browser engines and optimizing page load performance.