Overview
The DOCX backend (MsWordDocumentBackend) parses Microsoft Word documents (.docx files) and converts them directly to DoclingDocument format. It preserves document structure, formatting, and embedded content without requiring ML-based analysis.
Features
- Complete structure preservation - Headings, paragraphs, lists, tables
- Rich formatting support - Bold, italic, underline, strikethrough, superscript, subscript
- Hyperlinks and cross-references - Preserves internal and external links
- Table extraction - Full table structure with merged cells
- Image extraction - Embedded pictures and diagrams
- Equation support - Converts Office Math (OMML) to LaTeX
- Textbox content - Extracts text from textboxes and shapes
- Comments - Preserves document comments
- Header and footer - Extracts header/footer content
- List numbering - Maintains numbered and bulleted lists
Usage
Basic Conversion
With Format Options
Supported Elements
Text and Formatting
Paragraphs and Headings
Paragraphs and Headings
The backend automatically detects:
- Heading levels (H1-H9) based on paragraph styles
- Title and subtitle styles
- Normal paragraphs and body text
- Numbered headings (preserves numbering)
Text Formatting
Text Formatting
Supported inline formatting:
- Bold (
<w:b>) - Italic (
<w:i>) - Underline (
<w:u>) Strikethrough(<w:strike>)- Subscript and superscript (
<w:vertAlign>)
Hyperlinks
Hyperlinks
Internal and external hyperlinks are preserved:
Lists
The backend fully supports Word’s list structures:- Bulleted lists - Unordered lists with various bullet styles
- Numbered lists - Ordered lists with automatic numbering
- Multi-level lists - Nested list hierarchies
- Mixed lists - Combination of numbered and bulleted items
Tables
Complete table extraction with:- Cell content and formatting
- Merged cells (rowspan/colspan)
- Header row detection
- Nested table support
Images and Diagrams
Extracts embedded images:- Inline pictures
- Floating images
- DrawingML shapes (requires LibreOffice)
- VML graphics
Equations
Office Math ML (OMML) equations are converted to LaTeX:Textboxes
Content from textboxes and shapes is extracted:- Modern Word textboxes (
<w:txbxContent>) - Legacy VML textboxes
- DrawingML shape text
DrawingML Support
For complex DrawingML elements (charts, diagrams, SmartArt), Docling can use LibreOffice for conversion:Setup
Comments
Document comments are extracted and linked to their annotated paragraphs:Header and Footer
Header and footer content is extracted as furniture-layer content:Advanced Features
Numbered Headings
Word documents with numbered headings (e.g., “1.2.3 Section Title”) preserve numbering:List Counters
The backend tracks list counters across the document:- Separate counters per list ID and level
- Automatic reset on new sequences
- Support for custom start numbers
Style Detection
Automatic detection of Word styles:- Built-in styles (Heading 1-9, Title, Normal, etc.)
- Custom user styles
- Style inheritance
Limitations
Performance
- Speed: Very fast for declarative conversion (no ML models)
- Memory: Low memory footprint
- Concurrency: Thread-safe per document instance
Troubleshooting
Missing images
Missing images
Cause: DrawingML shapes require LibreOfficeSolution:
Incorrect list numbering
Incorrect list numbering
Cause: Custom numbering formats or broken documentSolution: Check source document in Word, ensure numbering is valid
Missing text from textboxes
Missing text from textboxes
Cause: Nested or complex textbox structuresWorkaround: Backend attempts multiple textbox formats; some edge cases may not extract
Equation rendering issues
Equation rendering issues
Cause: Complex OMML structuresNote: Most standard equations convert correctly to LaTeX
Export Formats
After conversion, export to various formats:See Also
- Backends Overview - Backend architecture
- PPTX Backend - PowerPoint processing
- XLSX Backend - Excel processing
- DocumentConverter - Main conversion API