Skip to main content
Docling is available as an official LangChain extension, providing seamless integration for loading and processing documents in your LangChain applications.

Overview

The LangChain Docling integration allows you to:
  • Load documents with high-fidelity structural preservation
  • Extract tables, images, and complex layouts accurately
  • Convert documents to LangChain Document objects
  • Build RAG applications with superior document understanding

Installation

pip install langchain-docling

Quick Start

Here’s a simple example of using Docling with LangChain:
from langchain_docling import DoclingLoader

# Load a document
loader = DoclingLoader(file_path="document.pdf")
documents = loader.load()

# Use in your LangChain pipeline
for doc in documents:
    print(doc.page_content)
    print(doc.metadata)

Advanced Usage

Custom Conversion Options

from langchain_docling import DoclingLoader
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PipelineOptions

# Configure Docling options
pipeline_options = PipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

# Create loader with custom options
loader = DoclingLoader(
    file_path="document.pdf",
    pipeline_options=pipeline_options
)

documents = loader.load()

Building a RAG Pipeline

from langchain_docling import DoclingLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Load documents
loader = DoclingLoader(file_path="document.pdf")
documents = loader.load()

# Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)

# Query
retriever = vectorstore.as_retriever()
results = retriever.get_relevant_documents("What is the main topic?")

Features

High-Fidelity Parsing

Preserves document structure including tables, lists, and headings

OCR Support

Extract text from scanned documents and images

Table Extraction

Accurately parse complex table structures

Metadata Enrichment

Includes document metadata like page numbers and structure

Supported Document Formats

The LangChain Docling integration supports:
  • PDF documents
  • Microsoft Word (DOCX)
  • PowerPoint (PPTX)
  • HTML files
  • Images (with OCR)
  • And more

Integration Benefits

1

Official Integration

Maintained as part of the LangChain ecosystem with full support
2

Easy to Use

Simple API that follows LangChain conventions
3

Production Ready

Battle-tested in real-world applications
4

Active Development

Regular updates and improvements

Resources

Documentation

Official LangChain integration docs

GitHub

Source code and examples

Tutorial

Step-by-step guide

PyPI

Package repository

Example Notebooks

For complete working examples, check out:

Next Steps

Build docs developers (and LLMs) love