Skip to main content

Schemas

Schemas in ScrapeGraphAI use Pydantic models to define the structure and validation rules for extracted data. They ensure your scraping results are consistently formatted and type-safe.

Why Use Schemas?

Schemas provide several benefits:

Type Safety

Enforce data types and validation rules

Structure

Define exact output format

Documentation

Self-documenting data models

IDE Support

Autocomplete and type hints

Basic Schema

Schemas are Pydantic BaseModel classes:
from pydantic import BaseModel, Field

class Product(BaseModel):
    name: str = Field(description="The product name")
    price: float = Field(description="Price in USD")
    available: bool = Field(description="Whether product is in stock")

Using the Schema

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."}
}

scraper = SmartScraperGraph(
    prompt="Extract product information",
    source="https://example.com/product",
    config=graph_config,
    schema=Product  # Pass the schema class
)

result = scraper.run()
print(result)
# Output: {'name': 'Laptop', 'price': 999.99, 'available': True}
The LLM is automatically guided to generate output matching your schema structure.

Field Descriptions

Always include descriptions using Field(). These help the LLM understand what to extract:
from pydantic import BaseModel, Field

class Article(BaseModel):
    title: str = Field(
        description="The main headline or title of the article"
    )
    author: str = Field(
        description="Full name of the article author"
    )
    published_date: str = Field(
        description="Publication date in YYYY-MM-DD format"
    )
    summary: str = Field(
        description="A brief 2-3 sentence summary of the article content"
    )
    tags: list[str] = Field(
        description="List of relevant topic tags or categories"
    )
Clear, detailed descriptions significantly improve extraction accuracy.

Complex Schemas

Nested Objects

Create hierarchical data structures:
from pydantic import BaseModel, Field
from typing import List

class Address(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City name")
    country: str = Field(description="Country name")
    postal_code: str = Field(description="Postal/ZIP code")

class Contact(BaseModel):
    email: str = Field(description="Email address")
    phone: str = Field(description="Phone number")

class Company(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Company description")
    address: Address = Field(description="Company address")
    contact: Contact = Field(description="Contact information")
    employee_count: int = Field(description="Number of employees")

# Usage
scraper = SmartScraperGraph(
    prompt="Extract company information",
    source="https://example.com/about",
    config=graph_config,
    schema=Company
)

result = scraper.run()
print(result['address']['city'])  # Access nested data

Lists of Objects

Extract multiple items with a wrapper class:
from pydantic import BaseModel, Field
from typing import List

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD")
    rating: float = Field(description="Average rating out of 5")
    reviews: int = Field(description="Number of reviews")

class ProductList(BaseModel):
    products: List[Product] = Field(
        description="List of all products found on the page"
    )

# Usage
scraper = SmartScraperGraph(
    prompt="Extract all products from the catalog",
    source="https://example.com/products",
    config=graph_config,
    schema=ProductList
)

result = scraper.run()
for product in result['products']:
    print(f"{product['name']}: ${product['price']}")
When extracting multiple items, always use a wrapper class with a List field.

Optional Fields

Make fields optional when data might not be available:
from pydantic import BaseModel, Field
from typing import Optional

class JobPosting(BaseModel):
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Job location")
    salary: Optional[str] = Field(
        default=None,
        description="Salary range if available"
    )
    remote: Optional[bool] = Field(
        default=None,
        description="Whether job is remote"
    )
    description: str = Field(description="Job description")

Default Values

Provide defaults for fields:
from pydantic import BaseModel, Field

class Review(BaseModel):
    author: str = Field(description="Reviewer name")
    rating: int = Field(description="Rating from 1-5")
    comment: str = Field(description="Review text")
    verified: bool = Field(
        default=False,
        description="Whether purchase is verified"
    )
    helpful_count: int = Field(
        default=0,
        description="Number of helpful votes"
    )

Field Validation

Add custom validation rules:
from pydantic import BaseModel, Field, field_validator
from typing import List

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Price in USD", gt=0)  # Must be > 0
    rating: float = Field(
        description="Rating from 0-5",
        ge=0,  # Greater than or equal to 0
        le=5   # Less than or equal to 5
    )
    tags: List[str] = Field(description="Product tags")
    
    @field_validator('price')
    @classmethod
    def price_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Price must be positive')
        return v
    
    @field_validator('tags')
    @classmethod
    def tags_must_not_be_empty(cls, v):
        if not v:
            raise ValueError('At least one tag required')
        return v

Common Patterns

E-commerce Product

from pydantic import BaseModel, Field
from typing import List, Optional

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Current price in USD")
    original_price: Optional[float] = Field(
        default=None,
        description="Original price if discounted"
    )
    description: str = Field(description="Product description")
    images: List[str] = Field(description="List of product image URLs")
    available: bool = Field(description="Whether in stock")
    rating: Optional[float] = Field(
        default=None,
        description="Average rating 0-5"
    )
    review_count: Optional[int] = Field(
        default=0,
        description="Number of reviews"
    )

class Products(BaseModel):
    products: List[Product]

News Article

from pydantic import BaseModel, Field
from typing import List, Optional

class Article(BaseModel):
    title: str = Field(description="Article headline")
    author: str = Field(description="Author name")
    published_date: str = Field(description="Publication date")
    category: str = Field(description="Article category")
    summary: str = Field(description="Brief summary")
    content: str = Field(description="Full article text")
    tags: List[str] = Field(description="Article tags")
    image_url: Optional[str] = Field(
        default=None,
        description="Featured image URL"
    )

class Articles(BaseModel):
    articles: List[Article]

Job Listings

from pydantic import BaseModel, Field
from typing import List, Optional

class Job(BaseModel):
    title: str = Field(description="Job title")
    company: str = Field(description="Company name")
    location: str = Field(description="Job location")
    salary_range: Optional[str] = Field(
        default=None,
        description="Salary range"
    )
    job_type: str = Field(description="Full-time, part-time, contract, etc.")
    remote: bool = Field(description="Whether job is remote")
    description: str = Field(description="Job description")
    requirements: List[str] = Field(description="Required qualifications")
    posted_date: str = Field(description="Date posted")

class JobListings(BaseModel):
    jobs: List[Job]

Restaurant Menu

from pydantic import BaseModel, Field
from typing import List, Optional

class MenuItem(BaseModel):
    name: str = Field(description="Dish name")
    description: str = Field(description="Dish description")
    price: float = Field(description="Price in USD")
    category: str = Field(description="Menu category (appetizer, entree, etc.)")
    dietary: Optional[List[str]] = Field(
        default=None,
        description="Dietary tags (vegetarian, vegan, gluten-free, etc.)"
    )

class Menu(BaseModel):
    restaurant_name: str = Field(description="Restaurant name")
    items: List[MenuItem] = Field(description="All menu items")

Schema with Search Graph

Schemas work with all graph types:
from pydantic import BaseModel, Field
from typing import List
from scrapegraphai.graphs import SearchGraph

class Event(BaseModel):
    name: str = Field(description="Event name")
    date: str = Field(description="Event date")
    location: str = Field(description="Event location")
    description: str = Field(description="Event description")

class Events(BaseModel):
    events: List[Event]

search_config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."},
    "verbose": True
}

search_graph = SearchGraph(
    prompt="Find upcoming tech conferences in 2024",
    config=search_config,
    schema=Events
)

result = search_graph.run()
print(result)

Schema with JSON Sources

Schemas also structure data from JSON scraping:
from scrapegraphai.graphs import JSONScraperGraph

class User(BaseModel):
    id: int = Field(description="User ID")
    name: str = Field(description="Full name")
    email: str = Field(description="Email address")

class Users(BaseModel):
    users: List[User]

json_config = {
    "llm": {"model": "openai/gpt-4o-mini", "api_key": "sk-..."},
}

json_scraper = JSONScraperGraph(
    prompt="Extract all user information",
    source="users.json",
    config=json_config,
    schema=Users
)

result = json_scraper.run()

Without Schemas

You can also scrape without defining a schema:
scraper = SmartScraperGraph(
    prompt="Extract product name, price, and description",
    source="https://example.com/product",
    config=graph_config
    # No schema parameter
)

result = scraper.run()
# LLM returns unstructured JSON
print(result)
Without a schema, output structure is less predictable. Schemas are recommended for production use.

Best Practices

# Good
name: str = Field(description="Product name as shown on the page")

# Bad
name: str
Descriptions guide the LLM on what to extract.
# Good
published_date: str = Field(description="Publication date")
author_name: str = Field(description="Author full name")

# Bad
dt: str
auth: str
# Good
salary: Optional[str] = Field(default=None, description="Salary if listed")

# Bad (raises error if not found)
salary: str = Field(description="Salary")
# Good
class Products(BaseModel):
    products: List[Product]

# Avoid returning List[Product] directly
Don’t try to extract everything. Create focused schemas for specific data:
# Good: Focused on products
class Product(BaseModel):
    name: str
    price: float
    available: bool

# Bad: Too many unrelated fields
class Page(BaseModel):
    product_name: str
    product_price: float
    site_title: str
    footer_text: str
    ad_content: str

Troubleshooting

Schema Not Being Followed

  1. Check field descriptions - Make them clear and specific
  2. Simplify the schema - Start simple, add complexity gradually
  3. Verify data exists - Ensure the data is on the page
  4. Use verbose mode - See what’s being sent to the LLM
config = {
    "llm": {...},
    "verbose": True  # Enable detailed logging
}

Missing Optional Fields

Ensure optional fields have defaults:
# Correct
rating: Optional[float] = Field(default=None, description="Rating")

# Will error if not found
rating: Optional[float] = Field(description="Rating")

Validation Errors

Check field constraints:
price: float = Field(description="Price", gt=0)  # Must be positive
rating: float = Field(description="Rating", ge=0, le=5)  # 0-5 range

Next Steps

Configuration

Learn about graph configuration

Examples

See complete schema examples

Pydantic Docs

Learn more about Pydantic

API Reference

View API documentation

Build docs developers (and LLMs) love