Skip to main content

Overview

This example shows you how to crawl entire documentation sites and convert them into RAG-ready datasets. Firecrawl extracts clean markdown content from web pages, making it perfect for building knowledge bases that power AI applications.
This example was tested in real hackathons for building knowledge bases from documentation sites.

What you’ll build

A Python script that:
  • Crawls documentation sites automatically
  • Extracts clean markdown content (no navigation or footers)
  • Structures data in a format ready for RAG frameworks
  • Provides statistics on the crawled content

Prerequisites

Before you start, make sure you have:
1

Get a Firecrawl API key

Sign up at Firecrawl and get your API key from the dashboard.
2

Install the Firecrawl SDK

pip install firecrawl-py
3

Set your API key

export FIRECRAWL_API_KEY="your_api_key_here"

Complete code

Here’s the full implementation that crawls documentation and saves it for RAG:
firecrawl-rag-dataset.py
#!/usr/bin/env python
"""
Firecrawl + RAG Dataset Creation
Personal experience: Used for building knowledge bases from documentation
"""

import os
from firecrawl import Firecrawl
import json

# Initialize Firecrawl
firecrawl = Firecrawl(api_key=os.getenv("FIRECRAWL_API_KEY"))

def crawl_documentation(base_url: str, max_pages: int = 50) -> list:
    """
    Crawl entire documentation site and return markdown pages
    
    Args:
        base_url: Starting URL to crawl
        max_pages: Maximum pages to crawl
    
    Returns:
        List of documents with markdown content
    """
    print(f"Starting crawl of {base_url}...")
    
    # Crawl the entire site
    response = firecrawl.crawl(
        base_url,
        limit=max_pages,
        scrape_options={
            "formats": ["markdown"],
            "onlyMainContent": True,  # Skip nav, footer, etc.
        }
    )
    
    documents = []
    for page in response:
        documents.append({
            "url": page.get("url"),
            "title": page.get("title"),
            "content": page.get("markdown"),
            "metadata": {
                "word_count": len(page.get("markdown", "").split()),
                "source": base_url
            }
        })
    
    print(f"Crawled {len(documents)} pages")
    return documents

def save_for_rag(documents: list, output_file: str = "rag_dataset.json"):
    """
    Save crawled documents in RAG-ready format
    """
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(documents, f, indent=2, ensure_ascii=False)
    
    print(f"Saved {len(documents)} documents to {output_file}")
    
    # Print stats
    total_words = sum(doc["metadata"]["word_count"] for doc in documents)
    print(f"Total words: {total_words:,}")
    print(f"Avg words per page: {total_words // len(documents):,}")

if __name__ == "__main__":
    # Example: Crawl a documentation site
    docs = crawl_documentation(
        "https://docs.python.org/3/tutorial/",
        max_pages=30
    )
    
    save_for_rag(docs)
    
    # Now use with LangChain/LlamaIndex for RAG!
    print("\nReady to use with RAG frameworks:")
    print("- Load JSON into LangChain DocumentLoader")
    print("- Create embeddings with OpenAI/HuggingFace")
    print("- Store in Chroma/Pinecone vector DB")

How it works

The script creates a Firecrawl client using your API key from environment variables:
firecrawl = Firecrawl(api_key=os.getenv("FIRECRAWL_API_KEY"))
The crawl_documentation() function crawls the entire site starting from a base URL:
  • limit: Controls maximum number of pages to crawl
  • formats: ["markdown"]: Extracts content as clean markdown
  • onlyMainContent: True: Removes navigation, footers, and other UI elements
This ensures you only get the actual documentation content without noise.
Each document includes:
  • url: Original page URL for source attribution
  • title: Page title
  • content: Clean markdown content
  • metadata: Word count and source information
This structure works seamlessly with RAG frameworks like LangChain and LlamaIndex.
The save_for_rag() function saves all documents to a JSON file with UTF-8 encoding and provides useful statistics:
  • Total documents crawled
  • Total word count across all pages
  • Average words per page

Usage instructions

1

Run the script

python firecrawl-rag-dataset.py
The script will crawl Python documentation and save it to rag_dataset.json.
2

Customize for your use case

Change the base URL and max pages:
docs = crawl_documentation(
    "https://your-docs-site.com/",
    max_pages=100  # Adjust as needed
)
3

Integrate with RAG frameworks

Use the generated JSON file with your preferred RAG framework:
from langchain.document_loaders import JSONLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Load the crawled data
loader = JSONLoader(
    file_path="rag_dataset.json",
    jq_schema=".[]",
    text_content=False
)
documents = loader.load()

# Create embeddings and store in vector DB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

Use cases

Documentation chatbots

Build AI assistants that answer questions about your product documentation

Knowledge base search

Create semantic search over large documentation sites

Content migration

Extract and migrate documentation from legacy systems

Competitive analysis

Analyze and compare documentation from multiple sources
Pro tip: Start with a smaller max_pages value (10-20) to test the output format before crawling entire sites. Firecrawl API calls are rate-limited based on your plan.
Respect robots.txt and terms of service when crawling websites. Only crawl documentation you have permission to access.

Next steps

Build docs developers (and LLMs) love