LlamaIndex Integration

Docling is available as an official LlamaIndex extension, providing two powerful components: the Docling Reader for document loading and the Docling Node Parser for intelligent chunking.

Overview

The LlamaIndex Docling integration provides:

Docling Reader - Load documents with high-fidelity structural preservation
Docling Node Parser - Parse documents into LlamaIndex nodes with structure awareness
Lossless Serialization - Preserve complete document structure as JSON
Flexible Export - Export to simplified formats like Markdown when needed

Installation

pip install llama-index-readers-docling llama-index-node-parser-docling

Components

Docling Reader

The Docling Reader loads document files and populates LlamaIndex Document objects with Docling’s rich data model.

Basic Usage

from llama_index.readers.docling import DoclingReader

# Create reader
reader = DoclingReader()

# Load documents
documents = reader.load_data(file_path="document.pdf")

# Access document content
for doc in documents:
    print(doc.text)
    print(doc.metadata)

Export Formats

from llama_index.readers.docling import DoclingReader
from docling.datamodel.base_models import FormatOptions

# Export as Markdown
reader = DoclingReader(
    export_type="markdown"
)
docs = reader.load_data(file_path="document.pdf")

# Export as JSON (lossless)
reader = DoclingReader(
    export_type="json"
)
docs = reader.load_data(file_path="document.pdf")

Docling Node Parser

The Docling Node Parser uses knowledge of Docling’s format to intelligently parse documents into LlamaIndex Node objects for downstream usage.

Basic Usage

from llama_index.readers.docling import DoclingReader
from llama_index.node_parser.docling import DoclingNodeParser

# Load documents with Docling Reader
reader = DoclingReader(export_type="json")
documents = reader.load_data(file_path="document.pdf")

# Parse into nodes
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# Nodes are ready for embedding and retrieval
for node in nodes:
    print(node.text)
    print(node.metadata)

Advanced Parsing Options

from llama_index.node_parser.docling import DoclingNodeParser

# Configure parser
parser = DoclingNodeParser(
    include_metadata=True,
    chunk_size=1024,
    chunk_overlap=128
)

nodes = parser.get_nodes_from_documents(documents)

Complete RAG Pipeline

Here’s a full example combining both components:

from llama_index.readers.docling import DoclingReader
from llama_index.node_parser.docling import DoclingNodeParser
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding

# 1. Load documents
reader = DoclingReader(export_type="json")
documents = reader.load_data(file_path="document.pdf")

# 2. Parse into nodes
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# 3. Create embeddings and index
embed_model = OpenAIEmbedding()
index = VectorStoreIndex(nodes, embed_model=embed_model)

# 4. Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic of this document?")
print(response)

Features

Structure-Aware

Preserves document hierarchy and relationships

Lossless Export

JSON export maintains complete document structure

Smart Chunking

Node parser respects document structure when chunking

Rich Metadata

Includes page numbers, headings, and structural information

Use Cases

Knowledge Base RAG

# Process multiple documents
reader = DoclingReader(export_type="json")

documents = []
for file in ["doc1.pdf", "doc2.docx", "doc3.pptx"]:
    docs = reader.load_data(file_path=file)
    documents.extend(docs)

parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes)

Table-Aware Retrieval

# Docling preserves table structure
reader = DoclingReader(
    export_type="markdown",
    pipeline_options={"do_table_structure": True}
)

documents = reader.load_data(file_path="report.pdf")
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# Tables are preserved in node content
for node in nodes:
    if "table" in node.metadata.get("type", "").lower():
        print("Found table:", node.text)

Integration Benefits

Official Components

Maintained as official LlamaIndex integrations

Two-Component System

Reader and Parser work together seamlessly

Format Flexibility

Choose between lossless JSON or simplified Markdown

Production Ready

Used in real-world RAG applications

Resources

Tutorial

Step-by-step guide

Reader Docs

API reference for Docling Reader

Parser Docs

API reference for Node Parser

GitHub

Source code

PyPI Packages

llama-index-readers-docling - Docling Reader component
llama-index-node-parser-docling - Docling Node Parser component

Next Steps

Follow the official tutorial
Explore document conversion options
Learn about export formats
Build your first RAG application

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

LlamaIndex Integration

Overview

Installation

Components

Docling Reader

Basic Usage

Export Formats

Docling Node Parser

Basic Usage

Advanced Parsing Options

Complete RAG Pipeline

Features

Structure-Aware

Lossless Export

Smart Chunking

Rich Metadata

Use Cases

Knowledge Base RAG

Table-Aware Retrieval

Integration Benefits

Resources

Tutorial

Reader Docs

Parser Docs

GitHub

PyPI Packages

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

​Overview

​Installation

​Components

​Docling Reader

​Basic Usage

​Export Formats

​Docling Node Parser

​Basic Usage

​Advanced Parsing Options

​Complete RAG Pipeline

​Features

Structure-Aware

Lossless Export

Smart Chunking

Rich Metadata

​Use Cases

​Knowledge Base RAG

​Table-Aware Retrieval

​Integration Benefits

​Resources

Tutorial

Reader Docs

Parser Docs

GitHub

​PyPI Packages

​Next Steps

Build docs developers (and LLMs) love

Overview

Installation

Components

Docling Reader

Basic Usage

Export Formats

Docling Node Parser

Basic Usage

Advanced Parsing Options

Complete RAG Pipeline

Features

Use Cases

Knowledge Base RAG

Table-Aware Retrieval

Integration Benefits

Resources

PyPI Packages

Next Steps