Skip to main content

Overview

The web scraping module provides multiple functions for extracting content from URLs with different levels of processing. All functions are located in prediction_market_agent.tools.web_scrape.

Functions

web_scrape (Markdown)

Extracts web content and converts it to clean markdown format with automatic caching and retry logic. Location: prediction_market_agent.tools.web_scrape.markdown
url
str
required
The URL to scrape
timeout
int
default:"10"
Request timeout in seconds
return
str | None
Markdown-formatted text content, or None if the request fails or content is non-HTML
Features:
  • Cached for 1 day using @db_cache decorator
  • Automatic retry with 3 attempts and 1-second delays
  • Removes scripts, styles, images, and non-content elements
  • User-agent spoofing to avoid bot detection
  • Returns None for non-HTML content
from prediction_market_agent.tools.web_scrape.markdown import web_scrape

markdown_content = web_scrape(
    url="https://example.com/article",
    timeout=10
)

if markdown_content:
    print(f"Scraped {len(markdown_content)} characters")
The function uses tenacity for retry logic and markdownify for HTML-to-markdown conversion:
@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_fixed(1),
    reraise=True
)
def fetch_html(url: str, timeout: int) -> Response:
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0"
    }
    response = requests.get(url, headers=headers, timeout=timeout)
    return response
Elements removed during scraping:
  • <script>, <style>, <noscript> tags
  • <link>, <head> sections
  • <image> and <img> tags

web_scrape (Basic Summary)

Scrapes a URL and automatically summarizes content exceeding 10,000 characters using LLM. Location: prediction_market_agent.tools.web_scrape.basic_summary
objective
str
required
The objective that defines what content to extract and summarize
url
str
required
The URL of the website to scrape
return
str
Extracted text content, summarized if longer than 10,000 characters
Summary Process:
  • Uses LangChain’s load_summarize_chain with map-reduce strategy
  • Splits text into 10,000 character chunks with 500 character overlap
  • Employs OpenAI’s GPT model (configurable via DEFAULT_OPENAI_MODEL)
  • Temperature set to 0 for consistent results
from prediction_market_agent.tools.web_scrape.basic_summary import web_scrape

result = web_scrape(
    objective="Extract information about prediction markets",
    url="https://example.com/long-article"
)
The summary function uses the DEFAULT_OPENAI_MODEL from prediction_market_agent.utils and requires OPENAI_API_KEY to be set in the environment.

web_scrape_structured

Scrapes content while preserving HTML structure and hierarchy. Location: prediction_market_agent.tools.web_scrape.structured_summary
url
str
required
The URL to scrape (automatically prefixes with https:// if protocol is missing)
Whether to remove anchor tags from the output
return
str
Structured text content with preserved hierarchy
Output Format: Unlike plain text scrapers, this preserves hierarchical structure ideal for tables and nested content:
A Historical look at Gnosis, GNO's price
    GNO/USD Pair
        GNO
        USD
        16 January 2021
        106.76
        GNO
        USD
        16 January 2022
        398.11
from prediction_market_agent.tools.web_scrape.structured_summary import (
    web_scrape_structured
)

structured_content = web_scrape_structured(
    url="https://example.com/data-page",
    remove_a_links=True
)

web_scrape_structured_and_summarized

Combines structured scraping with LLM-based summarization. Location: prediction_market_agent.tools.web_scrape.structured_summary
objective
str
required
The objective defining what information to extract
url
str
required
The URL to scrape
Whether to remove anchor tags
return
str
Summarized structured content focused on the objective

WebScrapingTool Class

Function calling tool for microchain agents and LLM function calling. Location: prediction_market_agent.tools.web_scrape.basic_summary

Schema

web_scraping_schema = {
    "type": "function",
    "function": {
        "name": "web_scraping",
        "parameters": {
            "type": "object",
            "properties": {
                "objective": {
                    "type": "string",
                    "description": "The objective that defines the content to be scraped from the website.",
                },
                "url": {
                    "type": "string",
                    "description": "The URL of the website to be scraped.",
                },
            },
            "required": ["query"],
        },
        "description": "Web scrape a URL to retrieve information relevant to the objective.",
    },
}

Usage

from prediction_market_agent.tools.web_scrape.basic_summary import WebScrapingTool

tool = WebScrapingTool()
result = tool.fn(objective="...", url="...")
schema = tool.schema

Helper Functions

fetch_html

Location: prediction_market_agent.tools.web_scrape.markdown Internal helper with automatic retry logic.
url
str
required
URL to fetch
timeout
int
required
Request timeout in seconds
return
Response
HTTP response object from requests library

clean_soup

Location: prediction_market_agent.tools.web_scrape.structured_summary Cleans BeautifulSoup Tag objects for structured extraction.
soup
Tag
required
BeautifulSoup Tag object to clean
Whether to remove anchor elements
return
Tag
Cleaned Tag object
Cleaning operations:
  • Removes all attributes except href
  • Removes noscript, script, style tags
  • Optionally removes anchor tags
  • Removes HTML comments
  • Removes empty elements

Error Handling

All scraping functions can raise:
  • requests.RequestException - Network or HTTP errors
  • requests.Timeout - Request timeout exceeded
  • requests.HTTPError - HTTP error responses (404, 403, etc.)

Exception Handler

Use the tool exception handler for graceful error handling:
from prediction_market_agent.tools.tool_exception_handler import tool_exception_handler
import requests

web_scrape_safe = tool_exception_handler(
    map_exception_to_output={
        requests.exceptions.HTTPError: "Couldn't reach the URL.",
        requests.exceptions.Timeout: "Request timed out.",
    }
)(web_scrape_structured)

result = web_scrape_safe(url="https://example.com")

Configuration

Environment Variables

OPENAI_API_KEY
str
required
Required for summarization features in basic_summary and structured_summary modules

Database Cache

The markdown scraper uses @db_cache which requires:
SQLALCHEMY_DB_URL
str
Database URL for caching (optional, uses in-memory cache if not set)

Best Practices

Choose the Right Tool

  • Use markdown scraper for general content
  • Use basic_summary for long articles
  • Use structured for tables and hierarchical data

Handle Errors Gracefully

Always wrap scraping calls with error handlers or use tool_exception_handler

Respect Timeouts

Set appropriate timeouts based on expected page load times (default: 10s)

Cache Results

Use the markdown scraper’s built-in caching or implement your own for custom scrapers

Dependencies

pip install requests beautifulsoup4 markdownify tenacity langchain langchain-openai

See Also

Build docs developers (and LLMs) love