Overview
The web scraping module provides multiple functions for extracting content from URLs with different levels of processing. All functions are located inprediction_market_agent.tools.web_scrape.
Functions
web_scrape (Markdown)
Extracts web content and converts it to clean markdown format with automatic caching and retry logic. Location:prediction_market_agent.tools.web_scrape.markdown
The URL to scrape
Request timeout in seconds
Markdown-formatted text content, or
None if the request fails or content is non-HTML- Cached for 1 day using
@db_cachedecorator - Automatic retry with 3 attempts and 1-second delays
- Removes scripts, styles, images, and non-content elements
- User-agent spoofing to avoid bot detection
- Returns
Nonefor non-HTML content
Implementation Details
Implementation Details
The function uses Elements removed during scraping:
tenacity for retry logic and markdownify for HTML-to-markdown conversion:<script>,<style>,<noscript>tags<link>,<head>sections<image>and<img>tags
web_scrape (Basic Summary)
Scrapes a URL and automatically summarizes content exceeding 10,000 characters using LLM. Location:prediction_market_agent.tools.web_scrape.basic_summary
The objective that defines what content to extract and summarize
The URL of the website to scrape
Extracted text content, summarized if longer than 10,000 characters
- Uses LangChain’s
load_summarize_chainwith map-reduce strategy - Splits text into 10,000 character chunks with 500 character overlap
- Employs OpenAI’s GPT model (configurable via
DEFAULT_OPENAI_MODEL) - Temperature set to 0 for consistent results
The summary function uses the
DEFAULT_OPENAI_MODEL from prediction_market_agent.utils and requires OPENAI_API_KEY to be set in the environment.web_scrape_structured
Scrapes content while preserving HTML structure and hierarchy. Location:prediction_market_agent.tools.web_scrape.structured_summary
The URL to scrape (automatically prefixes with
https:// if protocol is missing)Whether to remove anchor tags from the output
Structured text content with preserved hierarchy
web_scrape_structured_and_summarized
Combines structured scraping with LLM-based summarization. Location:prediction_market_agent.tools.web_scrape.structured_summary
The objective defining what information to extract
The URL to scrape
Whether to remove anchor tags
Summarized structured content focused on the objective
WebScrapingTool Class
Function calling tool for microchain agents and LLM function calling. Location:prediction_market_agent.tools.web_scrape.basic_summary
Schema
Usage
Helper Functions
fetch_html
Location:prediction_market_agent.tools.web_scrape.markdown
Internal helper with automatic retry logic.
URL to fetch
Request timeout in seconds
HTTP response object from
requests libraryclean_soup
Location:prediction_market_agent.tools.web_scrape.structured_summary
Cleans BeautifulSoup Tag objects for structured extraction.
BeautifulSoup Tag object to clean
Whether to remove anchor elements
Cleaned Tag object
- Removes all attributes except
href - Removes
noscript,script,styletags - Optionally removes anchor tags
- Removes HTML comments
- Removes empty elements
Error Handling
Exception Handler
Use the tool exception handler for graceful error handling:Configuration
Environment Variables
Required for summarization features in
basic_summary and structured_summary modulesDatabase Cache
The markdown scraper uses@db_cache which requires:
Database URL for caching (optional, uses in-memory cache if not set)
Best Practices
Choose the Right Tool
- Use markdown scraper for general content
- Use basic_summary for long articles
- Use structured for tables and hierarchical data
Handle Errors Gracefully
Always wrap scraping calls with error handlers or use
tool_exception_handlerRespect Timeouts
Set appropriate timeouts based on expected page load times (default: 10s)
Cache Results
Use the markdown scraper’s built-in caching or implement your own for custom scrapers
Dependencies
See Also
- Search Tools API - For finding URLs to scrape
- LLM Utils API - For processing scraped content
- Web Scraping Guide - Detailed usage guide and examples