Schemas

Schemas in ScrapeGraphAI use Pydantic models to define the structure and validation rules for extracted data. They ensure your scraping results are consistently formatted and type-safe.

Why Use Schemas?

Schemas provide several benefits:

Type Safety

Enforce data types and validation rules

Structure

Define exact output format

Documentation

Self-documenting data models

IDE Support

Autocomplete and type hints

Basic Schema

Schemas are Pydantic BaseModel classes:

from pydantic import BaseModel, Field

class Product(BaseModel):
    name: str = Field(description="The product name")
    price: float = Field(description="Price in USD")
    available: bool = Field(description="Whether product is in stock")

Using the Schema

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."}
}

scraper = SmartScraperGraph(
    prompt="Extract product information",
    source="https://example.com/product",
    config=graph_config,
    schema=Product  # Pass the schema class
)

result = scraper.run()
print(result)
# Output: {'name': 'Laptop', 'price': 999.99, 'available': True}

The LLM is automatically guided to generate output matching your schema structure.

Field Descriptions

Always include descriptions using Field(). These help the LLM understand what to extract:

from pydantic import BaseModel, Field

class Article(BaseModel):
    title: str = Field(
        description="The main headline or title of the article"
    )
    author: str = Field(
        description="Full name of the article author"
    )
    published_date: str = Field(
        description="Publication date in YYYY-MM-DD format"
    )
    summary: str = Field(
        description="A brief 2-3 sentence summary of the article content"
    )
    tags: list[str] = Field(
        description="List of relevant topic tags or categories"
    )

Clear, detailed descriptions significantly improve extraction accuracy.

Complex Schemas

Nested Objects

Create hierarchical data structures:

from pydantic import BaseModel, Field
from typing import List

class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")
    country: str = Field(description="Country name")
    postal_code: str = Field(description="Postal/ZIP code")

class Contact(BaseModel):
    email: str = Field(description="Email address")
    phone: str = Field(description="Phone number")

class Company(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Company description")
    address: Address = Field(description="Company address")
    contact: Contact = Field(description="Contact information")
    employee_count: int = Field(description="Number of employees")

# Usage
scraper = SmartScraperGraph(
    prompt="Extract company information",
    source="https://example.com/about",
    config=graph_config,
    schema=Company
)

result = scraper.run()
print(result['address']['city'])  # Access nested data

Lists of Objects

Extract multiple items with a wrapper class:

from pydantic import BaseModel, Field
from typing import List

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD")
    rating: float = Field(description="Average rating out of 5")
    reviews: int = Field(description="Number of reviews")

class ProductList(BaseModel):
    products: List[Product] = Field(
        description="List of all products found on the page"
    )

# Usage
scraper = SmartScraperGraph(
    prompt="Extract all products from the catalog",
    source="https://example.com/products",
    config=graph_config,
    schema=ProductList
)

result = scraper.run()
for product in result['products']:
    print(f"{product['name']}: ${product['price']}")

When extracting multiple items, always use a wrapper class with a List field.

Optional Fields

Make fields optional when data might not be available:

from pydantic import BaseModel, Field
from typing import Optional

class JobPosting(BaseModel):
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Job location")
    salary: Optional[str] = Field(
        default=None,
        description="Salary range if available"
    )
    remote: Optional[bool] = Field(
        default=None,
        description="Whether job is remote"
    )
    description: str = Field(description="Job description")

Default Values

Provide defaults for fields:

from pydantic import BaseModel, Field

class Review(BaseModel):
    author: str = Field(description="Reviewer name")
    rating: int = Field(description="Rating from 1-5")
    comment: str = Field(description="Review text")
    verified: bool = Field(
        default=False,
        description="Whether purchase is verified"
    )
    helpful_count: int = Field(
        default=0,
        description="Number of helpful votes"
    )

Field Validation

Add custom validation rules:

from pydantic import BaseModel, Field, field_validator
from typing import List

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD", gt=0)  # Must be > 0
    rating: float = Field(
        description="Rating from 0-5",
        ge=0,  # Greater than or equal to 0
        le=5   # Less than or equal to 5
    )
    tags: List[str] = Field(description="Product tags")
    
    @field_validator('price')
    @classmethod
    def price_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Price must be positive')
        return v
    
    @field_validator('tags')
    @classmethod
    def tags_must_not_be_empty(cls, v):
        if not v:
            raise ValueError('At least one tag required')
        return v

Common Patterns

E-commerce Product

from pydantic import BaseModel, Field
from typing import List, Optional

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Current price in USD")
    original_price: Optional[float] = Field(
        default=None,
        description="Original price if discounted"
    )
    description: str = Field(description="Product description")
    images: List[str] = Field(description="List of product image URLs")
    available: bool = Field(description="Whether in stock")
    rating: Optional[float] = Field(
        default=None,
        description="Average rating 0-5"
    )
    review_count: Optional[int] = Field(
        default=0,
        description="Number of reviews"
    )

class Products(BaseModel):
    products: List[Product]

News Article

from pydantic import BaseModel, Field
from typing import List, Optional

class Article(BaseModel):
    title: str = Field(description="Article headline")
    author: str = Field(description="Author name")
    published_date: str = Field(description="Publication date")
    category: str = Field(description="Article category")
    summary: str = Field(description="Brief summary")
    content: str = Field(description="Full article text")
    tags: List[str] = Field(description="Article tags")
    image_url: Optional[str] = Field(
        default=None,
        description="Featured image URL"
    )

class Articles(BaseModel):
    articles: List[Article]

Job Listings

from pydantic import BaseModel, Field
from typing import List, Optional

class Job(BaseModel):
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Job location")
    salary_range: Optional[str] = Field(
        default=None,
        description="Salary range"
    )
    job_type: str = Field(description="Full-time, part-time, contract, etc.")
    remote: bool = Field(description="Whether job is remote")
    description: str = Field(description="Job description")
    requirements: List[str] = Field(description="Required qualifications")
    posted_date: str = Field(description="Date posted")

class JobListings(BaseModel):
    jobs: List[Job]

from pydantic import BaseModel, Field
from typing import List, Optional

class MenuItem(BaseModel):
    name: str = Field(description="Dish name")
    description: str = Field(description="Dish description")
    price: float = Field(description="Price in USD")
    category: str = Field(description="Menu category (appetizer, entree, etc.)")
    dietary: Optional[List[str]] = Field(
        default=None,
        description="Dietary tags (vegetarian, vegan, gluten-free, etc.)"
    )

class Menu(BaseModel):
    restaurant_name: str = Field(description="Restaurant name")
    items: List[MenuItem] = Field(description="All menu items")

Schema with Search Graph

Schemas work with all graph types:

from pydantic import BaseModel, Field
from typing import List
from scrapegraphai.graphs import SearchGraph

class Event(BaseModel):
    name: str = Field(description="Event name")
    date: str = Field(description="Event date")
    location: str = Field(description="Event location")
    description: str = Field(description="Event description")

class Events(BaseModel):
    events: List[Event]

search_config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."},
    "verbose": True
}

search_graph = SearchGraph(
    prompt="Find upcoming tech conferences in 2024",
    config=search_config,
    schema=Events
)

result = search_graph.run()
print(result)

Schema with JSON Sources

Schemas also structure data from JSON scraping:

from scrapegraphai.graphs import JSONScraperGraph

class User(BaseModel):
    id: int = Field(description="User ID")
    name: str = Field(description="Full name")
    email: str = Field(description="Email address")

class Users(BaseModel):
    users: List[User]

json_config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."},
}

json_scraper = JSONScraperGraph(
    prompt="Extract all user information",
    source="users.json",
    config=json_config,
    schema=Users
)

result = json_scraper.run()

Without Schemas

You can also scrape without defining a schema:

scraper = SmartScraperGraph(
    prompt="Extract product name, price, and description",
    source="https://example.com/product",
    config=graph_config
    # No schema parameter
)

result = scraper.run()
# LLM returns unstructured JSON
print(result)

Without a schema, output structure is less predictable. Schemas are recommended for production use.

Best Practices

Always Use Descriptions

# Good
name: str = Field(description="Product name as shown on the page")

# Bad
name: str

Descriptions guide the LLM on what to extract.

Use Clear Field Names

# Good
published_date: str = Field(description="Publication date")
author_name: str = Field(description="Author full name")

# Bad
dt: str
auth: str

Make Optional Fields Explicit

# Good
salary: Optional[str] = Field(default=None, description="Salary if listed")

# Bad (raises error if not found)
salary: str = Field(description="Salary")

Use Wrapper Classes for Lists

# Good
class Products(BaseModel):
    products: List[Product]

# Avoid returning List[Product] directly

Keep Schemas Focused

Don’t try to extract everything. Create focused schemas for specific data:

# Good: Focused on products
class Product(BaseModel):
    name: str
    price: float
    available: bool

# Bad: Too many unrelated fields
class Page(BaseModel):
    product_name: str
    product_price: float
    site_title: str
    footer_text: str
    ad_content: str

Troubleshooting

Schema Not Being Followed

Check field descriptions - Make them clear and specific
Simplify the schema - Start simple, add complexity gradually
Verify data exists - Ensure the data is on the page
Use verbose mode - See what’s being sent to the LLM

config = {
    "llm": {...},
    "verbose": True  # Enable detailed logging
}

Missing Optional Fields

Ensure optional fields have defaults:

# Correct
rating: Optional[float] = Field(default=None, description="Rating")

# Will error if not found
rating: Optional[float] = Field(description="Rating")

Validation Errors

Check field constraints:

price: float = Field(description="Price", gt=0)  # Must be positive
rating: float = Field(description="Rating", ge=0, le=5)  # 0-5 range

Next Steps

Configuration

Learn about graph configuration

Examples

See complete schema examples

Pydantic Docs

Learn more about Pydantic

API Reference

View API documentation

Get Started

Core Concepts

Graphs

Configuration

Examples

Advanced

​Schemas

​Why Use Schemas?

Type Safety

Structure

Documentation

IDE Support

​Basic Schema

​Using the Schema

​Field Descriptions

​Complex Schemas

​Nested Objects

​Lists of Objects

​Optional Fields

​Default Values

​Field Validation

​Common Patterns

​E-commerce Product

​News Article

​Job Listings

​Restaurant Menu

​Schema with Search Graph

​Schema with JSON Sources

​Without Schemas

​Best Practices

​Troubleshooting

​Schema Not Being Followed

​Missing Optional Fields

​Validation Errors

​Next Steps

Configuration

Examples

Pydantic Docs

API Reference

Build docs developers (and LLMs) love

Schemas

Why Use Schemas?

Basic Schema

Using the Schema

Field Descriptions

Complex Schemas

Nested Objects

Lists of Objects

Optional Fields

Default Values

Field Validation

Common Patterns

E-commerce Product

News Article

Job Listings

Restaurant Menu

Schema with Search Graph

Schema with JSON Sources

Without Schemas

Best Practices

Troubleshooting

Schema Not Being Followed

Missing Optional Fields

Validation Errors

Next Steps