Overview
This example shows you how to crawl entire documentation sites and convert them into RAG-ready datasets. Firecrawl extracts clean markdown content from web pages, making it perfect for building knowledge bases that power AI applications.This example was tested in real hackathons for building knowledge bases from documentation sites.
What you’ll build
A Python script that:- Crawls documentation sites automatically
- Extracts clean markdown content (no navigation or footers)
- Structures data in a format ready for RAG frameworks
- Provides statistics on the crawled content
Prerequisites
Before you start, make sure you have:Get a Firecrawl API key
Sign up at Firecrawl and get your API key from the dashboard.
Complete code
Here’s the full implementation that crawls documentation and saves it for RAG:firecrawl-rag-dataset.py
How it works
1. Initialize Firecrawl client
1. Initialize Firecrawl client
The script creates a Firecrawl client using your API key from environment variables:
2. Crawl documentation site
2. Crawl documentation site
The
crawl_documentation() function crawls the entire site starting from a base URL:limit: Controls maximum number of pages to crawlformats: ["markdown"]: Extracts content as clean markdownonlyMainContent: True: Removes navigation, footers, and other UI elements
3. Structure data for RAG
3. Structure data for RAG
Each document includes:
url: Original page URL for source attributiontitle: Page titlecontent: Clean markdown contentmetadata: Word count and source information
4. Save to JSON file
4. Save to JSON file
The
save_for_rag() function saves all documents to a JSON file with UTF-8 encoding and provides useful statistics:- Total documents crawled
- Total word count across all pages
- Average words per page
Usage instructions
Use cases
Documentation chatbots
Build AI assistants that answer questions about your product documentation
Knowledge base search
Create semantic search over large documentation sites
Content migration
Extract and migrate documentation from legacy systems
Competitive analysis
Analyze and compare documentation from multiple sources
Next steps
- Check out the AI & ML resources for more tools and platforms
- Explore the Serper search example for real-time data enrichment
- Review the voice agent example to add voice interfaces