Skip to main content
Docling is available as a document converter in Haystack, enabling high-fidelity document processing in your Haystack pipelines.

Overview

The Docling Haystack integration provides:
  • Document conversion for Haystack pipelines
  • Support for multiple document formats (PDF, DOCX, PPTX, etc.)
  • High-fidelity table and layout extraction
  • Easy integration with existing Haystack workflows

Installation

pip install docling-haystack

Quick Start

Here’s a simple example of using Docling in a Haystack pipeline:
from docling_haystack import DoclingConverter
from haystack import Pipeline

# Create converter
converter = DoclingConverter()

# Convert a document
result = converter.run(
    sources=["document.pdf"]
)

# Access converted documents
for doc in result["documents"]:
    print(doc.content)
    print(doc.meta)

Building a RAG Pipeline

from docling_haystack import DoclingConverter
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Create components
document_store = InMemoryDocumentStore()
converter = DoclingConverter()
splitter = DocumentSplitter(split_length=500, split_overlap=50)
embedder = SentenceTransformersDocumentEmbedder()
writer = DocumentWriter(document_store=document_store)

# Build pipeline
pipeline = Pipeline()
pipeline.add_component("converter", converter)
pipeline.add_component("splitter", splitter)
pipeline.add_component("embedder", embedder)
pipeline.add_component("writer", writer)

# Connect components
pipeline.connect("converter.documents", "splitter.documents")
pipeline.connect("splitter.documents", "embedder.documents")
pipeline.connect("embedder.documents", "writer.documents")

# Run pipeline
result = pipeline.run({
    "converter": {"sources": ["document.pdf"]}
})

Advanced Configuration

Custom Conversion Options

from docling_haystack import DoclingConverter
from docling.datamodel.pipeline_options import PipelineOptions

# Configure Docling options
pipeline_options = PipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

# Create converter with options
converter = DoclingConverter(
    pipeline_options=pipeline_options
)

result = converter.run(sources=["document.pdf"])

Batch Processing

from docling_haystack import DoclingConverter
import glob

converter = DoclingConverter()

# Process multiple files
files = glob.glob("documents/*.pdf")
result = converter.run(sources=files)

print(f"Converted {len(result['documents'])} documents")

Features

Pipeline Integration

Seamlessly integrates into Haystack pipelines

Multi-Format Support

Supports PDF, DOCX, PPTX, HTML, and more

Table Extraction

Accurately extracts table structures

OCR Support

Process scanned documents and images

Complete RAG Application

from docling_haystack import DoclingConverter
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Initialize document store
document_store = InMemoryDocumentStore()

# Indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", DoclingConverter())
indexing_pipeline.add_component("writer", DocumentWriter(document_store))
indexing_pipeline.connect("converter.documents", "writer.documents")

# Index documents
indexing_pipeline.run({"converter": {"sources": ["document.pdf"]}})

# Query pipeline
template = """
Given the following documents, answer the question.

Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}

Question: {{ question }}
Answer:
"""

query_pipeline = Pipeline()
query_pipeline.add_component("retriever", InMemoryBM25Retriever(document_store))
query_pipeline.add_component("prompt_builder", PromptBuilder(template=template))
query_pipeline.add_component("llm", OpenAIGenerator())

query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "llm.prompt")

# Query
response = query_pipeline.run({
    "retriever": {"query": "What is the main topic?"},
    "prompt_builder": {"question": "What is the main topic?"}
})

print(response["llm"]["replies"][0])

Use Cases

1

Document Indexing

Convert and index large document collections
2

RAG Applications

Build question-answering systems over documents
3

Content Extraction

Extract structured content from unstructured documents
4

Search Pipelines

Enable semantic search over document collections

Resources

Documentation

Official Haystack integration docs

GitHub

Source code and examples

Example Notebook

Complete RAG example

PyPI

Package repository

Next Steps

Build docs developers (and LLMs) love