Overview
Once you’ve converted a document toDoclingDocument, Docling offers multiple export formats:
- Markdown: Human-readable text with formatting
- HTML: Rich HTML with embedded or linked images
- JSON: Structured data for programmatic access
- DocTags: Structured text format for downstream NLP
- Plain Text: Unformatted text content
- YAML: Human-readable structured data
Quick Export
Basic export to different formats:Markdown Export
Markdown is the most common export format, ideal for RAG, documentation, and human reading.Basic Markdown
Plain Text (No Formatting)
Image Handling
Control how images are included:- Placeholder (Default)
- Embedded (Base64)
- Referenced (File Paths)
Save with Options
HTML Export
HTML export creates rich, formatted output with embedded or linked images.Basic HTML
HTML with Embedded Images
HTML with Page Images
JSON Export
JSON export provides structured, machine-readable document data.Basic JSON
JSON Structure
JSON with Images
DocTags Export
DocTags is a structured text format designed for NLP pipelines:DocTags format is ideal for:
- Named entity recognition (NER)
- Document classification
- Information extraction
- Custom NLP pipelines
YAML Export
Human-readable structured data format:Batch Export
Export multiple documents to various formats:Multimodal Export (Parquet)
Export page images, text, and metadata to Parquet for machine learning:Parquet export is useful for:
- Training multimodal ML models
- Building document datasets
- Efficient storage of page images + text
- Integration with data science workflows
Custom Export Pipeline
Access document structure programmatically:Export Comparison
| Format | Use Case | Images | Structure | Size |
|---|---|---|---|---|
| Markdown | RAG, documentation, human reading | Embedded/Linked | Basic | Small |
| HTML | Web display, rich previews | Embedded/Linked | Rich | Medium |
| JSON | API integration, programmatic access | Embedded/Linked | Full | Medium |
| DocTags | NLP pipelines, text analysis | No | Semantic | Small |
| Plain Text | Search indexing, simple RAG | No | None | Smallest |
| YAML | Configuration, human editing | Embedded/Linked | Full | Medium |
| Parquet | ML datasets, analytics | Raw bytes | Full + metadata | Large |
Best Practices
Choose the right format for your use case
- RAG/Search: Markdown or Plain Text
- Web display: HTML with embedded images
- API integration: JSON
- NLP pipelines: DocTags
- ML training: Parquet
Consider image handling
- Standalone files: Use
ImageRefMode.EMBEDDED - Separate image files: Use
ImageRefMode.REFERENCEDand save images separately - Text-only: Use
ImageRefMode.PLACEHOLDERorstrict_text=True
Optimize for file size
- Use
strict_text=Truefor smallest Markdown - Use
ImageRefMode.PLACEHOLDERto exclude image data - Use JSON over YAML for large datasets (more compact)
Next Steps
Basic Conversion
Learn the fundamentals of document conversion
Batch Processing
Export large document collections efficiently
LangChain Integration
Use exports in RAG pipelines with LangChain
LlamaIndex Integration
Build search indexes with LlamaIndex