Firecrawl + RAG dataset creation

Overview

This example shows you how to crawl entire documentation sites and convert them into RAG-ready datasets. Firecrawl extracts clean markdown content from web pages, making it perfect for building knowledge bases that power AI applications.

This example was tested in real hackathons for building knowledge bases from documentation sites.

What you’ll build

A Python script that:

Crawls documentation sites automatically
Extracts clean markdown content (no navigation or footers)
Structures data in a format ready for RAG frameworks
Provides statistics on the crawled content

Prerequisites

Before you start, make sure you have:

Get a Firecrawl API key

Install the Firecrawl SDK

pip install firecrawl-py

Set your API key

export FIRECRAWL_API_KEY="your_api_key_here"

Complete code

Here’s the full implementation that crawls documentation and saves it for RAG:

firecrawl-rag-dataset.py

#!/usr/bin/env python
"""
Firecrawl + RAG Dataset Creation
Personal experience: Used for building knowledge bases from documentation
"""

import os
from firecrawl import Firecrawl
import json

# Initialize Firecrawl
firecrawl = Firecrawl(api_key=os.getenv("FIRECRAWL_API_KEY"))

def crawl_documentation(base_url: str, max_pages: int = 50) -> list:
    """
    Crawl entire documentation site and return markdown pages
    
    Args:
        base_url: Starting URL to crawl
        max_pages: Maximum pages to crawl
    
    Returns:
        List of documents with markdown content
    """
    print(f"Starting crawl of {base_url}...")
    
    # Crawl the entire site
    response = firecrawl.crawl(
        base_url,
        limit=max_pages,
        scrape_options={
            "formats": ["markdown"],
            "onlyMainContent": True,  # Skip nav, footer, etc.
        }
    )
    
    documents = []
    for page in response:
        documents.append({
            "url": page.get("url"),
            "title": page.get("title"),
            "content": page.get("markdown"),
            "metadata": {
                "word_count": len(page.get("markdown", "").split()),
                "source": base_url
            }
        })
    
    print(f"Crawled {len(documents)} pages")
    return documents

def save_for_rag(documents: list, output_file: str = "rag_dataset.json"):
    """
    Save crawled documents in RAG-ready format
    """
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(documents, f, indent=2, ensure_ascii=False)
    
    print(f"Saved {len(documents)} documents to {output_file}")
    
    # Print stats
    total_words = sum(doc["metadata"]["word_count"] for doc in documents)
    print(f"Total words: {total_words:,}")
    print(f"Avg words per page: {total_words // len(documents):,}")

if __name__ == "__main__":
    # Example: Crawl a documentation site
    docs = crawl_documentation(
        "https://docs.python.org/3/tutorial/",
        max_pages=30
    )
    
    save_for_rag(docs)
    
    # Now use with LangChain/LlamaIndex for RAG!
    print("\nReady to use with RAG frameworks:")
    print("- Load JSON into LangChain DocumentLoader")
    print("- Create embeddings with OpenAI/HuggingFace")
    print("- Store in Chroma/Pinecone vector DB")

How it works

1. Initialize Firecrawl client

The script creates a Firecrawl client using your API key from environment variables:

firecrawl = Firecrawl(api_key=os.getenv("FIRECRAWL_API_KEY"))

2. Crawl documentation site

The crawl_documentation() function crawls the entire site starting from a base URL:

limit: Controls maximum number of pages to crawl
formats: ["markdown"]: Extracts content as clean markdown
onlyMainContent: True: Removes navigation, footers, and other UI elements

This ensures you only get the actual documentation content without noise.

3. Structure data for RAG

Each document includes:

url: Original page URL for source attribution
title: Page title
content: Clean markdown content
metadata: Word count and source information

This structure works seamlessly with RAG frameworks like LangChain and LlamaIndex.

4. Save to JSON file

The save_for_rag() function saves all documents to a JSON file with UTF-8 encoding and provides useful statistics:

Total documents crawled
Total word count across all pages
Average words per page

Usage instructions

Run the script

python firecrawl-rag-dataset.py

The script will crawl Python documentation and save it to rag_dataset.json.

Customize for your use case

Change the base URL and max pages:

docs = crawl_documentation(
    "https://your-docs-site.com/",
    max_pages=100  # Adjust as needed
)

Integrate with RAG frameworks

Use the generated JSON file with your preferred RAG framework:

from langchain.document_loaders import JSONLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Load the crawled data
loader = JSONLoader(
    file_path="rag_dataset.json",
    jq_schema=".[]",
    text_content=False
)
documents = loader.load()

# Create embeddings and store in vector DB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

Use cases

Documentation chatbots

Build AI assistants that answer questions about your product documentation

Knowledge base search

Create semantic search over large documentation sites

Content migration

Extract and migrate documentation from legacy systems

Competitive analysis

Analyze and compare documentation from multiple sources

Pro tip: Start with a smaller max_pages value (10-20) to test the output format before crawling entire sites. Firecrawl API calls are rate-limited based on your plan.

Respect robots.txt and terms of service when crawling websites. Only crawl documentation you have permission to access.

Next steps

Check out the AI & ML resources for more tools and platforms
Explore the Serper search example for real-time data enrichment
Review the voice agent example to add voice interfaces

Getting Started

Resources

Examples

Overview

What you’ll build

Prerequisites

Complete code

How it works

Usage instructions

Use cases

Documentation chatbots

Knowledge base search

Content migration

Competitive analysis

Next steps

Build docs developers (and LLMs) love

Getting Started

Resources

Examples

​Overview

​What you’ll build

​Prerequisites

​Complete code

​How it works

​Usage instructions

​Use cases

Documentation chatbots

Knowledge base search

Content migration

Competitive analysis

​Next steps

Build docs developers (and LLMs) love

Overview

What you’ll build

Prerequisites

Complete code

How it works

Usage instructions

Use cases

Next steps