Overview
The XLSX backend (MsExcelDocumentBackend) parses Microsoft Excel workbooks (.xlsx files) and converts them to DoclingDocument format. Each worksheet becomes a page, with data clusters automatically detected and extracted as tables.
Features
- Sheet-by-page conversion - Each worksheet becomes a document page
- Automatic table detection - Groups connected cells into logical tables
- Merged cell handling - Properly handles cell spans
- Image extraction - Embedded pictures and charts
- Gap tolerance - Configurable gap bridging for disconnected data
- Singleton cell handling - Option to treat single cells as text
- Hidden sheet support - Processes visible and hidden sheets
- Formula values - Extracts calculated values (not formulas)
Usage
Basic Conversion
With Backend Options
MsExcelBackendOptions
Configuration options for Excel parsing.Parameters
Backend type identifier. Always set to
"xlsx" for Excel backends.Whether to treat singleton cells (1x1 tables with empty neighboring cells) as
TextItem instead of TableItem.Use when:- Spreadsheet contains scattered labels or single values
- You want individual cells as text rather than 1x1 tables
The tolerance (in number of empty rows/columns) for merging nearby data clusters into a single table.
0(strict): Cells must be adjacent to be in same table1: Allows 1 empty row/column between data2+: Bridges larger gaps
Enable fetching of remote resources referenced in the workbook.
Enable fetching of local resources referenced in the workbook.
Table Detection
The backend uses a flood-fill (BFS) algorithm to detect contiguous data regions:Algorithm
Flood fill
Starting from each unvisited cell, expand to find connected cells
- Respects
gap_tolerancefor bridging gaps - Creates rectangular bounding box
Example
Given a spreadsheet:gap_tolerance=0 (default):
- Table 1: A1:C3 (Name/Age/City table)
- Table 2: E1:F3 (ID/Score table)
- Table 3: A5:A5 (“Total: 2 people”)
gap_tolerance=1:
- Table 1: A1:F3 (All data merged into one table)
- Table 2: A5:A5 (Still separate due to 1-row gap)
treat_singleton_as_text=True:
- Table 1: A1:C3
- Table 2: E1:F3
- Text: “Total: 2 people” (as TextItem, not TableItem)
Merged Cells
Proper handling of Excel merged cells:- Correct span calculation (
rowspan,colspan) - Hidden cells in merged regions excluded
- Cell content from top-left anchor cell
Images
Extracts embedded images and charts:- Inline images
- Floating images
- Two-cell anchors (position and size)
- One-cell anchors (position only)
Worksheet Organization
Each worksheet creates a section group:Hidden Sheets
Hidden worksheets are marked withINVISIBLE content layer:
Provenance and Coordinates
Bounding boxes use cell indices (0-based) as coordinate system:Advanced Usage
Extract Specific Tables
Process Large Workbooks
Custom Gap Tolerance
Performance
- Speed: Fast for moderate-sized workbooks
- Memory: Memory usage scales with data size
- Concurrency: Thread-safe per document instance
Limitations
Troubleshooting
Too many small tables
Too many small tables
Cause: Strict gap tolerance (default 0)Solution: Increase gap tolerance
Unwanted 1x1 tables
Unwanted 1x1 tables
Cause: Singleton cells treated as tablesSolution: Enable singleton-as-text
Missing data
Missing data
Possible causes:
- Hidden sheets (check content layer)
- Empty cells not creating tables
- Data outside detected bounds
Incorrect merged cells
Incorrect merged cells
Solution: Check Excel file for corrupted merge regionsBackend respects Excel’s merge definitions exactly
Use Cases
Data Extraction
Extract tabular data from Excel reports for analysis or database import
Report Processing
Convert financial or operational reports to structured format
Data Migration
Transform Excel data for import into other systems
Archive Processing
Extract and preserve data from Excel archives
Export Formats
See Also
- Backends Overview - Backend architecture
- DOCX Backend - Word document processing
- PPTX Backend - PowerPoint processing
- Table Export - Table extraction examples