Skip to main content

Overview

DocumentScraperGraph is specialized for extracting information from plain text documents and markdown files. Unlike SmartScraperGraph which handles HTML, this graph processes raw text content efficiently.

Features

  • Extract structured data from plain text and markdown
  • No HTML parsing overhead
  • Efficient text chunking for large documents
  • Schema-based output for structured data
  • Supports both single files and directories

Parameters

The DocumentScraperGraph constructor accepts the following parameters:
DocumentScraperGraph(
    prompt: str,              # Natural language description of what to extract
    source: str,              # Text content, .md file path, or directory
    config: dict,             # Configuration dictionary
    schema: Optional[BaseModel] = None  # Pydantic schema for structured output
)

Configuration Options

ParameterTypeDefaultDescription
llmdictRequiredLLM model configuration
verboseboolFalseEnable detailed logging
additional_infostrOptionalAdditional context for the LLM
loader_kwargsdict{}Additional arguments for document loading

Usage Examples

import os
import json
from dotenv import load_dotenv
from scrapegraphai.graphs import DocumentScraperGraph

load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o",
    }
}

# Example: Direct text input
source = """
    The Divine Comedy, Italian La Divina Commedia, original name La commedia, 
    long narrative poem written in Italian circa 1308/21 by Dante. It is usually 
    held to be one of the world's great works of literature. Divided into three 
    major sections—Inferno, Purgatorio, and Paradiso—the narrative traces the 
    journey of Dante from darkness and error to the revelation of the divine light, 
    culminating in the Beatific Vision of God. Dante is guided by the Roman poet 
    Virgil, who represents the epitome of human knowledge, from the dark wood 
    through the descending circles of the pit of Hell (Inferno). He then climbs 
    the mountain of Purgatory, guided by the Roman poet Statius, who represents 
    the fulfilment of human knowledge, and is finally led by his lifelong love, 
    the Beatrice of his earlier poetry, through the celestial spheres of Paradise.
"""

document_scraper = DocumentScraperGraph(
    prompt="Summarize the text and find the main topics",
    source=source,
    config=graph_config,
)

result = document_scraper.run()
print(json.dumps(result, indent=4))

Input Types

Direct Text

Pass text content directly as a string:
text_content = """
Your document content here.
Can span multiple lines.
"""

document_scraper = DocumentScraperGraph(
    prompt="Extract key information",
    source=text_content,
    config=graph_config,
)

Markdown File

Process a single markdown file:
document_scraper = DocumentScraperGraph(
    prompt="Summarize the documentation",
    source="/path/to/document.md",
    config=graph_config,
)

Directory of Files

Process all markdown files in a directory:
document_scraper = DocumentScraperGraph(
    prompt="Extract all code examples",
    source="/path/to/docs/directory/",
    config=graph_config,
)

Schema-Based Extraction

Use Pydantic schemas for structured output:
from pydantic import BaseModel, Field
from typing import List

class Topic(BaseModel):
    name: str = Field(description="Topic name")
    description: str = Field(description="Topic description")

class DocumentSummary(BaseModel):
    title: str = Field(description="Document title")
    main_topics: List[Topic] = Field(description="Main topics discussed")
    key_takeaways: List[str] = Field(description="Key takeaways")

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": os.getenv("OPENAI_API_KEY"),
    },
}

document_scraper = DocumentScraperGraph(
    prompt="Analyze this document",
    source=document_text,
    config=graph_config,
    schema=DocumentSummary
)

result = document_scraper.run()
# Returns structured DocumentSummary object

Advanced Examples

Extract Code Snippets

from pydantic import BaseModel
from typing import List

class CodeSnippet(BaseModel):
    language: str
    code: str
    description: str

class CodeExamples(BaseModel):
    snippets: List[CodeSnippet]

document_scraper = DocumentScraperGraph(
    prompt="Extract all code examples with their programming language and description",
    source="/path/to/technical/doc.md",
    config=graph_config,
    schema=CodeExamples
)

result = document_scraper.run()
for snippet in result['snippets']:
    print(f"Language: {snippet['language']}")
    print(f"Code: {snippet['code']}")

Research Paper Analysis

from pydantic import BaseModel
from typing import List

class ResearchPaper(BaseModel):
    title: str
    authors: List[str]
    abstract: str
    methodology: str
    key_findings: List[str]
    conclusions: str

paper_text = """
[Your research paper text here]
"""

document_scraper = DocumentScraperGraph(
    prompt="Analyze this research paper and extract structured information",
    source=paper_text,
    config=graph_config,
    schema=ResearchPaper
)

result = document_scraper.run()

Meeting Notes Extraction

from pydantic import BaseModel
from typing import List

class MeetingNotes(BaseModel):
    date: str
    attendees: List[str]
    topics_discussed: List[str]
    action_items: List[str]
    decisions_made: List[str]

meeting_notes = """
[Your meeting notes here]
"""

document_scraper = DocumentScraperGraph(
    prompt="Extract meeting information including attendees, topics, and action items",
    source=meeting_notes,
    config=graph_config,
    schema=MeetingNotes
)

result = document_scraper.run()

How It Works

  1. Fetch: Loads text content from string, file, or directory
  2. Parse: Chunks text without HTML parsing (more efficient)
  3. Generate: Extracts information based on your prompt

Performance Benefits

DocumentScraperGraph is optimized for text:
FeatureDocumentScraperGraphSmartScraperGraph
HTML Parsing❌ No✅ Yes
Browser Loading❌ No✅ Yes
Speed⚡ Faster🐌 Slower
Best ForText/MarkdownHTML/Web pages
Use DocumentScraperGraph for pure text content to avoid unnecessary HTML parsing overhead.

Output Format

The run() method returns extracted data:
result = document_scraper.run()
# Returns: Dictionary with extracted information
# or schema-validated object if schema provided

Use Cases

  • Documentation Analysis: Extract information from technical documentation
  • Research: Analyze research papers and extract key findings
  • Meeting Notes: Structure unstructured meeting notes
  • Content Summarization: Summarize long-form text content
  • Code Documentation: Extract code examples from documentation
  • Legal Documents: Extract clauses and key terms
  • Academic Papers: Structure abstracts, methodologies, and conclusions

Processing Multiple Documents

import os
from pathlib import Path

# Process all markdown files in a directory
docs_dir = "/path/to/documentation/"

document_scraper = DocumentScraperGraph(
    prompt="Extract all API endpoints and their descriptions",
    source=docs_dir,
    config=graph_config,
)

result = document_scraper.run()

Error Handling

try:
    result = document_scraper.run()
    if result:
        print("Extraction successful:", result)
    else:
        print("No information extracted")
except FileNotFoundError:
    print("Document file not found")
except Exception as e:
    print(f"Error during extraction: {e}")

Performance Tips

  • DocumentScraperGraph is faster than SmartScraperGraph for text content
  • No browser overhead or HTML parsing
  • Use model_tokens to control chunk size for large documents
  • Provide specific prompts for better extraction accuracy

SmartScraperGraph

For HTML/web page scraping

ScriptCreatorGraph

Generate scraping scripts

Build docs developers (and LLMs) love