JSONScraperGraph

Overview

JSONScraperGraph defines a scraping pipeline for JSON files. It allows you to query JSON data using natural language without writing complex JSON path queries or parsing code.

Class Signature

class JSONScraperGraph(AbstractGraph):
    def __init__(
        self,
        prompt: str,
        source: str,
        config: dict,
        schema: Optional[Type[BaseModel]] = None,
    )

Constructor Parameters

prompt

str

required

The natural language query to extract information from the JSON file.

source

str

required

The source JSON file or directory. Can be:

Path to a single JSON file (e.g., "data.json")
Path to a directory containing multiple JSON files (e.g., "./json_data/")

config

dict

required

Configuration parameters for the graph. Must include:

llm: LLM configuration (e.g., {"model": "openai/gpt-4o"})

Optional parameters:

verbose (bool): Enable detailed logging
headless (bool): Run in headless mode
additional_info (str): Extra context for the LLM

schema

Type[BaseModel]

default:"None"

Optional Pydantic model defining the expected output structure.

Attributes

prompt

str

The user’s query prompt.

source

str

The JSON file path or directory path.

config

dict

Configuration dictionary for the graph.

schema

BaseModel

Optional output schema for structured data extraction.

llm_model

object

The configured language model instance.

input_key

str

Either “json” (single file) or “json_dir” (directory) based on the source.

Methods

run()

Executes the JSON querying process and returns the answer.

def run(self) -> str

return

str

The extracted information from the JSON file(s), or “No answer found.” if extraction fails.

Basic Usage

from scrapegraphai.graphs import JSONScraperGraph

graph_config = {
    "llm": {
        "model": "openai/gpt-4o",
        "api_key": "your-api-key"
    }
}

json_scraper = JSONScraperGraph(
    prompt="List all the attractions in Chioggia.",
    source="data/chioggia.json",
    config=graph_config
)

result = json_scraper.run()
print(result)

Example JSON File

{
  "city": "Chioggia",
  "country": "Italy",
  "population": 50000,
  "attractions": [
    {
      "name": "Piazza Vigo",
      "type": "square",
      "description": "Main square with historic buildings",
      "rating": 4.5
    },
    {
      "name": "Museo della Laguna Sud",
      "type": "museum",
      "description": "Museum showcasing local history",
      "rating": 4.2
    },
    {
      "name": "Cathedral of Santa Maria",
      "type": "church",
      "description": "Historic cathedral from the 11th century",
      "rating": 4.7
    }
  ],
  "restaurants": [
    {
      "name": "Ristorante El Gato",
      "cuisine": "Italian",
      "price_range": "$$",
      "rating": 4.6
    }
  ]
}

Query Examples

Simple Extraction

json_scraper = JSONScraperGraph(
    prompt="What is the population of the city?",
    source="data/chioggia.json",
    config=graph_config
)

result = json_scraper.run()
# Output: "The population is 50,000"

Nested Data Access

json_scraper = JSONScraperGraph(
    prompt="List all museum attractions with their ratings",
    source="data/chioggia.json",
    config=graph_config
)

result = json_scraper.run()

Aggregation

json_scraper = JSONScraperGraph(
    prompt="What is the average rating of all attractions?",
    source="data/chioggia.json",
    config=graph_config
)

result = json_scraper.run()

Filtering

json_scraper = JSONScraperGraph(
    prompt="List all attractions with a rating above 4.5",
    source="data/chioggia.json",
    config=graph_config
)

result = json_scraper.run()

Structured Output with Schema

from pydantic import BaseModel, Field
from typing import List

class Attraction(BaseModel):
    name: str = Field(description="Attraction name")
    type: str = Field(description="Type of attraction")
    rating: float = Field(description="Rating score")

class AttractionList(BaseModel):
    attractions: List[Attraction]
    count: int = Field(description="Total number of attractions")

json_scraper = JSONScraperGraph(
    prompt="Extract all attractions with their details",
    source="data/chioggia.json",
    config=graph_config,
    schema=AttractionList
)

result = json_scraper.run()
# Result is automatically validated against the schema

Complex JSON Structures

{
  "api_response": {
    "status": "success",
    "data": {
      "users": [
        {
          "id": 1,
          "name": "John Doe",
          "profile": {
            "email": "[email protected]",
            "addresses": [
              {
                "type": "home",
                "city": "New York",
                "country": "USA"
              }
            ]
          }
        }
      ]
    },
    "metadata": {
      "total_records": 1,
      "page": 1
    }
  }
}

json_scraper = JSONScraperGraph(
    prompt="Extract all user email addresses and their cities",
    source="data/api_response.json",
    config=graph_config
)

result = json_scraper.run()

Multiple JSON Files

# Directory structure:
# json_data/
#   ├── users_2023.json
#   ├── users_2024.json
#   └── users_2025.json

json_scraper = JSONScraperGraph(
    prompt="Count total users across all files",
    source="./json_data/",  # Directory path
    config=graph_config
)

result = json_scraper.run()
# Automatically processes all JSON files in the directory

Graph Workflow

The JSONScraperGraph uses a simple node pipeline:

FetchNode → GenerateAnswerNode

FetchNode: Loads and parses the JSON file(s)
GenerateAnswerNode: Processes the JSON data and answers the query

Advanced Usage

With Additional Context

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": """
        This JSON contains e-commerce product data.
        Prices are in USD. Focus on products with high ratings.
    """
}

json_scraper = JSONScraperGraph(
    prompt="Identify top-rated products under $100",
    source="data/products.json",
    config=config
)

API Response Analysis

from pydantic import BaseModel
from typing import List, Optional

class ErrorInfo(BaseModel):
    code: str
    message: str
    field: Optional[str] = None

class APIAnalysis(BaseModel):
    status: str
    success_rate: float
    error_types: List[ErrorInfo]
    insights: str

config = {
    "llm": {"model": "openai/gpt-4o"},
    "additional_info": "Analyze API responses for error patterns"
}

json_scraper = JSONScraperGraph(
    prompt="Analyze the API responses and identify common error patterns",
    source="data/api_logs.json",
    config=config,
    schema=APIAnalysis
)

result = json_scraper.run()

Use Cases

API Response Analysis: Query and analyze API responses
Configuration Files: Extract information from config files
Data Exploration: Explore JSON datasets without coding
Log Analysis: Query application logs in JSON format
Data Transformation: Extract and transform JSON data

Example: E-commerce Product Analysis

{
  "products": [
    {
      "id": "P001",
      "name": "Laptop Pro",
      "category": "Electronics",
      "price": 1299.99,
      "stock": 45,
      "ratings": {
        "average": 4.5,
        "count": 230
      },
      "reviews": [
        {"rating": 5, "comment": "Excellent!"},
        {"rating": 4, "comment": "Good value"}
      ]
    }
  ]
}

from pydantic import BaseModel
from typing import List

class ProductInsight(BaseModel):
    product: str
    price: float
    rating: float
    review_sentiment: str

class Insights(BaseModel):
    top_products: List[ProductInsight]
    categories_summary: dict
    recommendations: str

json_scraper = JSONScraperGraph(
    prompt="""Analyze products and provide:
    1. Top 3 products by rating
    2. Summary by category
    3. Recommendations for inventory
    """,
    source="data/products.json",
    config=graph_config,
    schema=Insights
)

result = json_scraper.run()

Accessing Results

result = json_scraper.run()

# Get the answer
print("Answer:", result)

# Access full state
final_state = json_scraper.get_state()
raw_data = final_state.get("doc")
answer = final_state.get("answer")

print(f"Processed JSON data size: {len(str(raw_data))} characters")

# Execution info
exec_info = json_scraper.get_execution_info()
for node_info in exec_info:
    print(f"{node_info['node_name']}: {node_info['exec_time']:.2f}s")
    print(f"Tokens: {node_info['total_tokens']}")
    print(f"Cost: ${node_info['total_cost_USD']:.4f}")

Working with Large JSON Files

config = {
    "llm": {
        "model": "openai/gpt-4o",
        "model_tokens": 128000  # Use model with larger context
    },
    "additional_info": "Focus on the most relevant data for the query"
}

json_scraper = JSONScraperGraph(
    prompt="Summarize key metrics from the data",
    source="data/large_dataset.json",
    config=config
)

Error Handling

import json

try:
    # Validate JSON first
    with open("data/file.json", "r") as f:
        json.load(f)  # Check if valid JSON
    
    result = json_scraper.run()
    
    if result == "No answer found.":
        print("Failed to extract information from JSON")
    else:
        print(f"Success: {result}")
        
except json.JSONDecodeError as e:
    print(f"Invalid JSON format: {e}")
except FileNotFoundError:
    print("JSON file not found")
except Exception as e:
    print(f"Error during processing: {e}")

Tips for Better Results

Understand structure: Know your JSON structure before querying
Be specific: Clear queries get better answers
Use schema: Define schemas for type-safe output
Provide context: Use additional_info for domain knowledge
Test queries: Start simple and iterate
Validate JSON: Ensure JSON is well-formed

JSON vs Other Formats

Feature	JSON	CSV	XML
Structure	Nested/Hierarchical	Flat/Tabular	Nested/Hierarchical
Complexity	Medium-High	Low	High
Use Case	APIs, Config	Data tables	Documents, Config
Performance	Good	Fast	Slower

Performance Considerations

File Size: Large JSON files may exceed LLM context limits
Nesting Depth: Deeply nested structures take more tokens
Multiple Files: Processing multiple files increases execution time
Complex Queries: More complex analysis requires more tokens

JSONScraperMultiGraph - Process multiple JSON files separately
CSVScraperGraph - Query CSV files
XMLScraperGraph - Query XML files

Graphs

Nodes

Models

Utilities

Overview

Class Signature

Constructor Parameters

Attributes

Methods

run()

Basic Usage

Example JSON File

Query Examples

Simple Extraction

Nested Data Access

Aggregation

Filtering

Structured Output with Schema

Complex JSON Structures

Multiple JSON Files

Graph Workflow

Advanced Usage

With Additional Context

API Response Analysis

Use Cases

Example: E-commerce Product Analysis

Accessing Results

Working with Large JSON Files

Error Handling

Tips for Better Results

JSON vs Other Formats

Performance Considerations

Build docs developers (and LLMs) love

Graphs

Nodes

Models

Utilities

​Overview

​Class Signature

​Constructor Parameters

​Attributes

​Methods

​run()

​Basic Usage

​Example JSON File

​Query Examples

​Simple Extraction

​Nested Data Access

​Aggregation

​Filtering

​Structured Output with Schema

​Complex JSON Structures

​Multiple JSON Files

​Graph Workflow

​Advanced Usage

​With Additional Context

​API Response Analysis

​Use Cases

​Example: E-commerce Product Analysis

​Accessing Results

​Working with Large JSON Files

​Error Handling

​Tips for Better Results

​JSON vs Other Formats

​Performance Considerations

​Related Graphs

Build docs developers (and LLMs) love

Overview

Class Signature

Constructor Parameters

Attributes

Methods

run()

Basic Usage

Example JSON File

Query Examples

Simple Extraction

Nested Data Access

Aggregation

Filtering

Structured Output with Schema

Complex JSON Structures

Multiple JSON Files

Graph Workflow

Advanced Usage

With Additional Context

API Response Analysis

Use Cases

Example: E-commerce Product Analysis

Accessing Results

Working with Large JSON Files

Error Handling

Tips for Better Results

JSON vs Other Formats

Performance Considerations

Related Graphs