Overview
DocumentScraperGraph is a scraping pipeline that automates the process of extracting information from Markdown documents using a natural language model to interpret and answer prompts.
Class Signature
class DocumentScraperGraph(AbstractGraph):
def __init__(
self,
prompt: str,
source: str,
config: dict,
schema: Optional[Type[BaseModel]] = None,
)
Constructor Parameters
The natural language prompt describing what information to extract from the document.
The source Markdown file or directory. Can be:
- Path to a single
.md file (e.g., "README.md")
- Path to a directory containing multiple Markdown files (e.g.,
"./docs/")
Configuration parameters for the graph. Must include:
llm: LLM configuration (e.g., {"model": "openai/gpt-4o"})
Optional parameters:
verbose (bool): Enable detailed logging
headless (bool): Run in headless mode
additional_info (str): Extra context for the LLM
loader_kwargs (dict): Parameters for document loading
storage_state (str): Browser state file path
schema
Type[BaseModel]
default:"None"
Optional Pydantic model defining the expected output structure.
Attributes
The user’s extraction prompt.
The Markdown file path or directory path.
Configuration dictionary for the graph.
Optional output schema for structured data extraction.
The configured language model instance.
Either “md” (single file) or “md_dir” (directory) based on the source.
Methods
run()
Executes the document scraping process and returns the extracted information.
The extracted information from the Markdown document(s), or “No answer found.” if extraction fails.
Basic Usage
from scrapegraphai.graphs import DocumentScraperGraph
graph_config = {
"llm": {
"model": "openai/gpt-4o",
"api_key": "your-api-key"
}
}
doc_scraper = DocumentScraperGraph(
prompt="List all the main features mentioned in the documentation.",
source="README.md",
config=graph_config
)
result = doc_scraper.run()
print(result)
Example Markdown Document
# ScrapeGraphAI
A powerful web scraping library powered by AI.
## Features
- Natural language prompts for data extraction
- Support for multiple LLM providers (OpenAI, Anthropic, etc.)
- Schema-based output validation
- Browser automation support
## Installation
```bash
pip install scrapegraphai
Quick Start
from scrapegraphai.graphs import SmartScraperGraph
graph = SmartScraperGraph(
prompt="Extract the title",
source="https://example.com",
config={"llm": {"model": "openai/gpt-4o"}}
)
result = graph.run()
Supported Models
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude)
- Google (Gemini)
- Local models via Ollama
## Query Examples
### Extract Specific Sections
```python
doc_scraper = DocumentScraperGraph(
prompt="Extract the installation instructions",
source="README.md",
config=graph_config
)
result = doc_scraper.run()
doc_scraper = DocumentScraperGraph(
prompt="Extract all Python code examples from the documentation",
source="docs/tutorial.md",
config=graph_config
)
result = doc_scraper.run()
doc_scraper = DocumentScraperGraph(
prompt="List all features with their descriptions",
source="FEATURES.md",
config=graph_config
)
result = doc_scraper.run()
API Documentation
doc_scraper = DocumentScraperGraph(
prompt="Extract all API endpoints with their parameters and return types",
source="docs/api.md",
config=graph_config
)
result = doc_scraper.run()
Structured Output with Schema
from pydantic import BaseModel, Field
from typing import List
class Feature(BaseModel):
name: str = Field(description="Feature name")
description: str = Field(description="Feature description")
class CodeExample(BaseModel):
language: str = Field(description="Programming language")
code: str = Field(description="Code snippet")
description: str = Field(description="What the code does")
class Documentation(BaseModel):
features: List[Feature]
examples: List[CodeExample]
summary: str = Field(description="Overall summary")
doc_scraper = DocumentScraperGraph(
prompt="Extract features, code examples, and provide a summary",
source="README.md",
config=graph_config,
schema=Documentation
)
result = doc_scraper.run()
# Result is automatically validated against the schema
Multiple Markdown Files
# Directory structure:
# docs/
# ├── getting-started.md
# ├── api-reference.md
# ├── examples.md
# └── faq.md
doc_scraper = DocumentScraperGraph(
prompt="Summarize all the documentation and create a table of contents",
source="./docs/", # Directory path
config=graph_config
)
result = doc_scraper.run()
# Automatically processes all Markdown files in the directory
Graph Workflow
The DocumentScraperGraph uses the following node pipeline:
FetchNode → ParseNode → GenerateAnswerNode
- FetchNode: Loads the Markdown file(s)
- ParseNode: Parses the Markdown content without HTML parsing
- GenerateAnswerNode: Extracts information based on the prompt (with
is_md_scraper=True flag)
Use Cases
- Documentation Analysis: Extract information from technical documentation
- README Parsing: Parse project README files
- Knowledge Base Querying: Query markdown-based knowledge bases
- Content Migration: Extract structured data from markdown content
- Documentation Generation: Extract info to generate other doc formats
Advanced Usage
With Additional Context
config = {
"llm": {"model": "openai/gpt-4o"},
"additional_info": """
This is technical documentation for a Python library.
Focus on code examples and API specifications.
"""
}
doc_scraper = DocumentScraperGraph(
prompt="Extract all function signatures with their parameters and descriptions",
source="docs/api-reference.md",
config=config
)
from pydantic import BaseModel
from typing import List, Optional
class Parameter(BaseModel):
name: str
type: str
required: bool
description: str
class APIEndpoint(BaseModel):
method: str
path: str
parameters: List[Parameter]
returns: str
description: str
example: Optional[str] = None
class APIDocumentation(BaseModel):
endpoints: List[APIEndpoint]
base_url: Optional[str] = None
config = {
"llm": {"model": "openai/gpt-4o"},
"additional_info": "Extract RESTful API endpoint documentation"
}
doc_scraper = DocumentScraperGraph(
prompt="Extract all API endpoints with complete documentation",
source="docs/api.md",
config=config,
schema=APIDocumentation
)
result = doc_scraper.run()
from pydantic import BaseModel
from typing import List
class TutorialStep(BaseModel):
step_number: int
title: str
description: str
code: Optional[str] = None
notes: Optional[str] = None
class Tutorial(BaseModel):
title: str
steps: List[TutorialStep]
prerequisites: List[str]
estimated_time: Optional[str] = None
doc_scraper = DocumentScraperGraph(
prompt="Extract the complete tutorial with all steps, code examples, and prerequisites",
source="docs/tutorial.md",
config=graph_config,
schema=Tutorial
)
result = doc_scraper.run()
from pydantic import BaseModel
from typing import List
class FAQItem(BaseModel):
question: str
answer: str
category: Optional[str] = None
class FAQ(BaseModel):
items: List[FAQItem]
total_count: int
doc_scraper = DocumentScraperGraph(
prompt="Extract all FAQ items with questions and answers, and categorize them if possible",
source="docs/faq.md",
config=graph_config,
schema=FAQ
)
result = doc_scraper.run()
Accessing Results
result = doc_scraper.run()
# Get the answer
print("Answer:", result)
# Access full state
final_state = doc_scraper.get_state()
raw_doc = final_state.get("doc")
parsed_doc = final_state.get("parsed_doc")
answer = final_state.get("answer")
print(f"Document length: {len(str(raw_doc))} characters")
print(f"Parsed chunks: {len(parsed_doc) if isinstance(parsed_doc, list) else 1}")
# Execution info
exec_info = doc_scraper.get_execution_info()
for node_info in exec_info:
print(f"{node_info['node_name']}: {node_info['exec_time']:.2f}s")
print(f"Tokens: {node_info['total_tokens']}")
print(f"Cost: ${node_info['total_cost_USD']:.4f}")
Working with Large Documents
config = {
"llm": {
"model": "openai/gpt-4o",
"model_tokens": 128000 # Use model with larger context
},
"additional_info": "Focus on the most relevant sections for the query"
}
doc_scraper = DocumentScraperGraph(
prompt="Summarize the key architectural decisions",
source="docs/architecture.md",
config=config
)
Markdown Features Support
DocumentScraperGraph handles various Markdown features:
- Headers (H1-H6): Structural navigation
- Code blocks: Both inline and fenced
- Lists: Ordered and unordered
- Tables: Data extraction
- Links: Reference extraction
- Quotes: Emphasis detection
- Bold/Italic: Text emphasis
Error Handling
import os
try:
# Check if file exists
if not os.path.exists("docs/file.md"):
raise FileNotFoundError("Markdown file not found")
result = doc_scraper.run()
if result == "No answer found.":
print("Failed to extract information from document")
else:
print(f"Success: {result}")
except FileNotFoundError as e:
print(f"File error: {e}")
except Exception as e:
print(f"Error during processing: {e}")
Tips for Better Results
- Understand structure: Review document structure before querying
- Be specific: Clear prompts get better answers
- Use schema: Define schemas for type-safe output
- Section targeting: Reference specific sections in your prompt
- Provide context: Use
additional_info for domain knowledge
- Test queries: Start simple and iterate
- Handle code blocks: Specify if you want code extraction
- Document Size: Large documents may exceed LLM context limits
- Multiple Files: Processing multiple files increases execution time
- Code Blocks: Many code blocks increase token usage
- Complex Queries: More complex extraction requires more tokens
| Format | Use Case | Complexity | DocumentScraperGraph Support |
|---|
| Markdown | Documentation | Low-Medium | Yes (this graph) |
| HTML | Web pages | Medium-High | Use SmartScraperGraph |
| JSON | Structured data | Low | Use JSONScraperGraph |
| CSV | Tabular data | Low | Use CSVScraperGraph |
| XML | Config/Data | Medium | Use XMLScraperGraph |