Web Scraping API

Overview

The web scraping module provides multiple functions for extracting content from URLs with different levels of processing. All functions are located in prediction_market_agent.tools.web_scrape.

Functions

web_scrape (Markdown)

Extracts web content and converts it to clean markdown format with automatic caching and retry logic. Location: prediction_market_agent.tools.web_scrape.markdown

url

str

required

The URL to scrape

timeout

int

default:"10"

Request timeout in seconds

return

str | None

Markdown-formatted text content, or None if the request fails or content is non-HTML

Features:

Cached for 1 day using @db_cache decorator
Automatic retry with 3 attempts and 1-second delays
Removes scripts, styles, images, and non-content elements
User-agent spoofing to avoid bot detection
Returns None for non-HTML content

from prediction_market_agent.tools.web_scrape.markdown import web_scrape

markdown_content = web_scrape(
    url="https://example.com/article",
    timeout=10
)

if markdown_content:
    print(f"Scraped {len(markdown_content)} characters")

Implementation Details

The function uses tenacity for retry logic and markdownify for HTML-to-markdown conversion:

@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_fixed(1),
    reraise=True
)
def fetch_html(url: str, timeout: int) -> Response:
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0"
    }
    response = requests.get(url, headers=headers, timeout=timeout)
    return response

Elements removed during scraping:

<script>, <style>, <noscript> tags
<link>, <head> sections
<image> and <img> tags

web_scrape (Basic Summary)

Scrapes a URL and automatically summarizes content exceeding 10,000 characters using LLM. Location: prediction_market_agent.tools.web_scrape.basic_summary

objective

str

required

The objective that defines what content to extract and summarize

url

str

required

The URL of the website to scrape

return

str

Extracted text content, summarized if longer than 10,000 characters

Summary Process:

Uses LangChain’s load_summarize_chain with map-reduce strategy
Splits text into 10,000 character chunks with 500 character overlap
Employs OpenAI’s GPT model (configurable via DEFAULT_OPENAI_MODEL)
Temperature set to 0 for consistent results

from prediction_market_agent.tools.web_scrape.basic_summary import web_scrape

result = web_scrape(
    objective="Extract information about prediction markets",
    url="https://example.com/long-article"
)

The summary function uses the DEFAULT_OPENAI_MODEL from prediction_market_agent.utils and requires OPENAI_API_KEY to be set in the environment.

web_scrape_structured

Scrapes content while preserving HTML structure and hierarchy. Location: prediction_market_agent.tools.web_scrape.structured_summary

url

str

required

The URL to scrape (automatically prefixes with https:// if protocol is missing)

remove_a_links

bool

default:"True"

Whether to remove anchor tags from the output

return

str

Structured text content with preserved hierarchy

Output Format: Unlike plain text scrapers, this preserves hierarchical structure ideal for tables and nested content:

A Historical look at Gnosis, GNO's price
    GNO/USD Pair
        GNO
        USD
        16 January 2021
        106.76
        GNO
        USD
        16 January 2022
        398.11

from prediction_market_agent.tools.web_scrape.structured_summary import (
    web_scrape_structured
)

structured_content = web_scrape_structured(
    url="https://example.com/data-page",
    remove_a_links=True
)

web_scrape_structured_and_summarized

Combines structured scraping with LLM-based summarization. Location: prediction_market_agent.tools.web_scrape.structured_summary

objective

str

required

The objective defining what information to extract

url

str

required

The URL to scrape

remove_a_links

bool

default:"True"

Whether to remove anchor tags

return

str

Summarized structured content focused on the objective

WebScrapingTool Class

Function calling tool for microchain agents and LLM function calling. Location: prediction_market_agent.tools.web_scrape.basic_summary

Schema

web_scraping_schema = {
    "type": "function",
    "function": {
        "name": "web_scraping",
        "parameters": {
            "type": "object",
            "properties": {
                "objective": {
                    "type": "string",
                    "description": "The objective that defines the content to be scraped from the website.",
                },
                "url": {
                    "type": "string",
                    "description": "The URL of the website to be scraped.",
                },
            },
            "required": ["query"],
        },
        "description": "Web scrape a URL to retrieve information relevant to the objective.",
    },
}

Usage

from prediction_market_agent.tools.web_scrape.basic_summary import WebScrapingTool

tool = WebScrapingTool()
result = tool.fn(objective="...", url="...")
schema = tool.schema

Helper Functions

fetch_html

Location: prediction_market_agent.tools.web_scrape.markdown Internal helper with automatic retry logic.

url

str

required

URL to fetch

timeout

int

required

Request timeout in seconds

return

Response

HTTP response object from requests library

clean_soup

Location: prediction_market_agent.tools.web_scrape.structured_summary Cleans BeautifulSoup Tag objects for structured extraction.

soup

Tag

required

BeautifulSoup Tag object to clean

remove_a_links

bool

required

Whether to remove anchor elements

return

Tag

Cleaned Tag object

Cleaning operations:

Removes all attributes except href
Removes noscript, script, style tags
Optionally removes anchor tags
Removes HTML comments
Removes empty elements

Error Handling

All scraping functions can raise:

requests.RequestException - Network or HTTP errors
requests.Timeout - Request timeout exceeded
requests.HTTPError - HTTP error responses (404, 403, etc.)

Exception Handler

Use the tool exception handler for graceful error handling:

from prediction_market_agent.tools.tool_exception_handler import tool_exception_handler
import requests

web_scrape_safe = tool_exception_handler(
    map_exception_to_output={
        requests.exceptions.HTTPError: "Couldn't reach the URL.",
        requests.exceptions.Timeout: "Request timed out.",
    }
)(web_scrape_structured)

result = web_scrape_safe(url="https://example.com")

Configuration

Environment Variables

OPENAI_API_KEY

str

required

Required for summarization features in basic_summary and structured_summary modules

Database Cache

The markdown scraper uses @db_cache which requires:

SQLALCHEMY_DB_URL

str

Database URL for caching (optional, uses in-memory cache if not set)

Best Practices

Choose the Right Tool

Use markdown scraper for general content
Use basic_summary for long articles
Use structured for tables and hierarchical data

Handle Errors Gracefully

Always wrap scraping calls with error handlers or use tool_exception_handler

Respect Timeouts

Set appropriate timeouts based on expected page load times (default: 10s)

Cache Results

Use the markdown scraper’s built-in caching or implement your own for custom scrapers

Dependencies

pip install requests beautifulsoup4 markdownify tenacity langchain langchain-openai

Core Classes

Agent Implementations

Markets

Betting Strategies

Tools

Overview

Functions

web_scrape (Markdown)

web_scrape (Basic Summary)

web_scrape_structured

web_scrape_structured_and_summarized

WebScrapingTool Class

Schema

Usage

Helper Functions

fetch_html

clean_soup

Error Handling

Exception Handler

Configuration

Environment Variables

Database Cache

Best Practices

Choose the Right Tool

Handle Errors Gracefully

Respect Timeouts

Cache Results

Dependencies

See Also

Build docs developers (and LLMs) love

Core Classes

Agent Implementations

Markets

Betting Strategies

Tools

​Overview

​Functions

​web_scrape (Markdown)

​web_scrape (Basic Summary)

​web_scrape_structured

​web_scrape_structured_and_summarized

​WebScrapingTool Class

​Schema

​Usage

​Helper Functions

​fetch_html

​clean_soup

​Error Handling

​Exception Handler

​Configuration

​Environment Variables

​Database Cache

​Best Practices

Choose the Right Tool

Handle Errors Gracefully

Respect Timeouts

Cache Results

​Dependencies

​See Also

Build docs developers (and LLMs) love

Overview

Functions

web_scrape (Markdown)

web_scrape (Basic Summary)

web_scrape_structured

web_scrape_structured_and_summarized

WebScrapingTool Class

Schema

Usage

Helper Functions

fetch_html

clean_soup

Error Handling

Exception Handler

Configuration

Environment Variables

Database Cache

Best Practices

Dependencies

See Also