Schemas
Schemas in ScrapeGraphAI use Pydantic models to define the structure and validation rules for extracted data. They ensure your scraping results are consistently formatted and type-safe.
Why Use Schemas?
Schemas provide several benefits:
Type Safety Enforce data types and validation rules
Structure Define exact output format
Documentation Self-documenting data models
IDE Support Autocomplete and type hints
Basic Schema
Schemas are Pydantic BaseModel classes:
from pydantic import BaseModel, Field
class Product ( BaseModel ):
name: str = Field( description = "The product name" )
price: float = Field( description = "Price in USD" )
available: bool = Field( description = "Whether product is in stock" )
Using the Schema
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm" : { "model" : "openai/gpt-4o-mini" , "api_key" : "sk-..." }
}
scraper = SmartScraperGraph(
prompt = "Extract product information" ,
source = "https://example.com/product" ,
config = graph_config,
schema = Product # Pass the schema class
)
result = scraper.run()
print (result)
# Output: {'name': 'Laptop', 'price': 999.99, 'available': True}
The LLM is automatically guided to generate output matching your schema structure.
Field Descriptions
Always include descriptions using Field(). These help the LLM understand what to extract:
from pydantic import BaseModel, Field
class Article ( BaseModel ):
title: str = Field(
description = "The main headline or title of the article"
)
author: str = Field(
description = "Full name of the article author"
)
published_date: str = Field(
description = "Publication date in YYYY-MM-DD format"
)
summary: str = Field(
description = "A brief 2-3 sentence summary of the article content"
)
tags: list[ str ] = Field(
description = "List of relevant topic tags or categories"
)
Clear, detailed descriptions significantly improve extraction accuracy.
Complex Schemas
Nested Objects
Create hierarchical data structures:
from pydantic import BaseModel, Field
from typing import List
class Address ( BaseModel ):
street: str = Field( description = "Street address" )
city: str = Field( description = "City name" )
country: str = Field( description = "Country name" )
postal_code: str = Field( description = "Postal/ZIP code" )
class Contact ( BaseModel ):
email: str = Field( description = "Email address" )
phone: str = Field( description = "Phone number" )
class Company ( BaseModel ):
name: str = Field( description = "Company name" )
description: str = Field( description = "Company description" )
address: Address = Field( description = "Company address" )
contact: Contact = Field( description = "Contact information" )
employee_count: int = Field( description = "Number of employees" )
# Usage
scraper = SmartScraperGraph(
prompt = "Extract company information" ,
source = "https://example.com/about" ,
config = graph_config,
schema = Company
)
result = scraper.run()
print (result[ 'address' ][ 'city' ]) # Access nested data
Lists of Objects
Extract multiple items with a wrapper class:
from pydantic import BaseModel, Field
from typing import List
class Product ( BaseModel ):
name: str = Field( description = "Product name" )
price: float = Field( description = "Price in USD" )
rating: float = Field( description = "Average rating out of 5" )
reviews: int = Field( description = "Number of reviews" )
class ProductList ( BaseModel ):
products: List[Product] = Field(
description = "List of all products found on the page"
)
# Usage
scraper = SmartScraperGraph(
prompt = "Extract all products from the catalog" ,
source = "https://example.com/products" ,
config = graph_config,
schema = ProductList
)
result = scraper.run()
for product in result[ 'products' ]:
print ( f " { product[ 'name' ] } : $ { product[ 'price' ] } " )
When extracting multiple items, always use a wrapper class with a List field.
Optional Fields
Make fields optional when data might not be available:
from pydantic import BaseModel, Field
from typing import Optional
class JobPosting ( BaseModel ):
title: str = Field( description = "Job title" )
company: str = Field( description = "Company name" )
location: str = Field( description = "Job location" )
salary: Optional[ str ] = Field(
default = None ,
description = "Salary range if available"
)
remote: Optional[ bool ] = Field(
default = None ,
description = "Whether job is remote"
)
description: str = Field( description = "Job description" )
Default Values
Provide defaults for fields:
from pydantic import BaseModel, Field
class Review ( BaseModel ):
author: str = Field( description = "Reviewer name" )
rating: int = Field( description = "Rating from 1-5" )
comment: str = Field( description = "Review text" )
verified: bool = Field(
default = False ,
description = "Whether purchase is verified"
)
helpful_count: int = Field(
default = 0 ,
description = "Number of helpful votes"
)
Field Validation
Add custom validation rules:
from pydantic import BaseModel, Field, field_validator
from typing import List
class Product ( BaseModel ):
name: str = Field( description = "Product name" )
price: float = Field( description = "Price in USD" , gt = 0 ) # Must be > 0
rating: float = Field(
description = "Rating from 0-5" ,
ge = 0 , # Greater than or equal to 0
le = 5 # Less than or equal to 5
)
tags: List[ str ] = Field( description = "Product tags" )
@field_validator ( 'price' )
@ classmethod
def price_must_be_positive ( cls , v ):
if v <= 0 :
raise ValueError ( 'Price must be positive' )
return v
@field_validator ( 'tags' )
@ classmethod
def tags_must_not_be_empty ( cls , v ):
if not v:
raise ValueError ( 'At least one tag required' )
return v
Common Patterns
E-commerce Product
from pydantic import BaseModel, Field
from typing import List, Optional
class Product ( BaseModel ):
name: str = Field( description = "Product name" )
price: float = Field( description = "Current price in USD" )
original_price: Optional[ float ] = Field(
default = None ,
description = "Original price if discounted"
)
description: str = Field( description = "Product description" )
images: List[ str ] = Field( description = "List of product image URLs" )
available: bool = Field( description = "Whether in stock" )
rating: Optional[ float ] = Field(
default = None ,
description = "Average rating 0-5"
)
review_count: Optional[ int ] = Field(
default = 0 ,
description = "Number of reviews"
)
class Products ( BaseModel ):
products: List[Product]
News Article
from pydantic import BaseModel, Field
from typing import List, Optional
class Article ( BaseModel ):
title: str = Field( description = "Article headline" )
author: str = Field( description = "Author name" )
published_date: str = Field( description = "Publication date" )
category: str = Field( description = "Article category" )
summary: str = Field( description = "Brief summary" )
content: str = Field( description = "Full article text" )
tags: List[ str ] = Field( description = "Article tags" )
image_url: Optional[ str ] = Field(
default = None ,
description = "Featured image URL"
)
class Articles ( BaseModel ):
articles: List[Article]
Job Listings
from pydantic import BaseModel, Field
from typing import List, Optional
class Job ( BaseModel ):
title: str = Field( description = "Job title" )
company: str = Field( description = "Company name" )
location: str = Field( description = "Job location" )
salary_range: Optional[ str ] = Field(
default = None ,
description = "Salary range"
)
job_type: str = Field( description = "Full-time, part-time, contract, etc." )
remote: bool = Field( description = "Whether job is remote" )
description: str = Field( description = "Job description" )
requirements: List[ str ] = Field( description = "Required qualifications" )
posted_date: str = Field( description = "Date posted" )
class JobListings ( BaseModel ):
jobs: List[Job]
from pydantic import BaseModel, Field
from typing import List, Optional
class MenuItem ( BaseModel ):
name: str = Field( description = "Dish name" )
description: str = Field( description = "Dish description" )
price: float = Field( description = "Price in USD" )
category: str = Field( description = "Menu category (appetizer, entree, etc.)" )
dietary: Optional[List[ str ]] = Field(
default = None ,
description = "Dietary tags (vegetarian, vegan, gluten-free, etc.)"
)
class Menu ( BaseModel ):
restaurant_name: str = Field( description = "Restaurant name" )
items: List[MenuItem] = Field( description = "All menu items" )
Schema with Search Graph
Schemas work with all graph types:
from pydantic import BaseModel, Field
from typing import List
from scrapegraphai.graphs import SearchGraph
class Event ( BaseModel ):
name: str = Field( description = "Event name" )
date: str = Field( description = "Event date" )
location: str = Field( description = "Event location" )
description: str = Field( description = "Event description" )
class Events ( BaseModel ):
events: List[Event]
search_config = {
"llm" : { "model" : "openai/gpt-4o-mini" , "api_key" : "sk-..." },
"verbose" : True
}
search_graph = SearchGraph(
prompt = "Find upcoming tech conferences in 2024" ,
config = search_config,
schema = Events
)
result = search_graph.run()
print (result)
Schema with JSON Sources
Schemas also structure data from JSON scraping:
from scrapegraphai.graphs import JSONScraperGraph
class User ( BaseModel ):
id : int = Field( description = "User ID" )
name: str = Field( description = "Full name" )
email: str = Field( description = "Email address" )
class Users ( BaseModel ):
users: List[User]
json_config = {
"llm" : { "model" : "openai/gpt-4o-mini" , "api_key" : "sk-..." },
}
json_scraper = JSONScraperGraph(
prompt = "Extract all user information" ,
source = "users.json" ,
config = json_config,
schema = Users
)
result = json_scraper.run()
Without Schemas
You can also scrape without defining a schema:
scraper = SmartScraperGraph(
prompt = "Extract product name, price, and description" ,
source = "https://example.com/product" ,
config = graph_config
# No schema parameter
)
result = scraper.run()
# LLM returns unstructured JSON
print (result)
Without a schema, output structure is less predictable. Schemas are recommended for production use.
Best Practices
# Good
name: str = Field( description = "Product name as shown on the page" )
# Bad
name: str
Descriptions guide the LLM on what to extract.
# Good
published_date: str = Field( description = "Publication date" )
author_name: str = Field( description = "Author full name" )
# Bad
dt: str
auth: str
Make Optional Fields Explicit
# Good
salary: Optional[ str ] = Field( default = None , description = "Salary if listed" )
# Bad (raises error if not found)
salary: str = Field( description = "Salary" )
Use Wrapper Classes for Lists
# Good
class Products ( BaseModel ):
products: List[Product]
# Avoid returning List[Product] directly
Don’t try to extract everything. Create focused schemas for specific data: # Good: Focused on products
class Product ( BaseModel ):
name: str
price: float
available: bool
# Bad: Too many unrelated fields
class Page ( BaseModel ):
product_name: str
product_price: float
site_title: str
footer_text: str
ad_content: str
Troubleshooting
Schema Not Being Followed
Check field descriptions - Make them clear and specific
Simplify the schema - Start simple, add complexity gradually
Verify data exists - Ensure the data is on the page
Use verbose mode - See what’s being sent to the LLM
config = {
"llm" : { ... },
"verbose" : True # Enable detailed logging
}
Missing Optional Fields
Ensure optional fields have defaults:
# Correct
rating: Optional[ float ] = Field( default = None , description = "Rating" )
# Will error if not found
rating: Optional[ float ] = Field( description = "Rating" )
Validation Errors
Check field constraints:
price: float = Field( description = "Price" , gt = 0 ) # Must be positive
rating: float = Field( description = "Rating" , ge = 0 , le = 5 ) # 0-5 range
Next Steps
Configuration Learn about graph configuration
Examples See complete schema examples
Pydantic Docs Learn more about Pydantic
API Reference View API documentation