The Gnosis Prediction Market Agent provides multiple web scraping tools for extracting content from URLs. These tools offer different levels of processing, from raw text extraction to structured summaries, making them suitable for various use cases in market research and prediction analysis.
The basic web scraping tool extracts text from a URL and optionally summarizes it using GPT if the content exceeds 10,000 characters.
from prediction_market_agent.tools.web_scrape.basic_summary import web_scrape# Scrape and auto-summarize if neededresult = web_scrape( objective="Extract information about prediction markets", url="https://example.com/article")
Implementation Details
The web_scrape function uses:
BeautifulSoup for HTML parsing
LangChain’s map-reduce chain for intelligent summarization
Recursive text splitting with 10,000 character chunks and 500 character overlap
OpenAI GPT for generating objective-focused summaries
def web_scrape(objective: str, url: str) -> str: response = requests.get(url) response.raise_for_status() soup = bs4.BeautifulSoup(response.content, "html.parser") text: str = soup.get_text() if len(text) > 10000: text = _summary(objective, text) return text
This tool converts HTML content to clean markdown format with automatic retry logic and caching.
from prediction_market_agent.tools.web_scrape.markdown import web_scrape# Scrape and convert to markdownmarkdown_content = web_scrape( url="https://example.com/page", timeout=10)
This function is cached for 1 day using db_cache and includes automatic retries with exponential backoff.
Key Features
Automatic retry: Up to 3 attempts with 1-second delays
Database caching: Results cached for 24 hours
Clean extraction: Removes scripts, styles, images, and other non-content elements
Markdown conversion: Uses markdownify for clean text formatting
User-agent spoofing: Mimics Firefox browser to avoid bot detection
For preserving document structure, use the structured scraping tool that maintains hierarchy and optionally summarizes the content.
from prediction_market_agent.tools.web_scrape.structured_summary import ( web_scrape_structured, web_scrape_structured_and_summarized)# Get structured contentstructured = web_scrape_structured( url="https://example.com", remove_a_links=True)# Get structured and summarized contentsummary = web_scrape_structured_and_summarized( objective="Analyze market trends", url="https://example.com", remove_a_links=True)
Output Format
Unlike other scrapers that return plain text, this tool preserves hierarchical structure:
A Historical look at Gnosis, GNO's price GNO/USD Pair GNO USD 16 January 2021 106.76 GNO USD 16 January 2022 398.11
This format is ideal for extracting structured data like tables, lists, and nested content.
Cleaning Process
def clean_soup(soup: Tag, remove_a_links: bool) -> Tag: # Remove all attributes except href for tag in soup.findAll(lambda x: len(x.attrs) > 0): tag.attrs = {k: v for k, v in tag.attrs.items() if k == "href"} # Remove unwanted tags tags_to_remove = ["noscript", "script", "style"] if remove_a_links: tags_to_remove.append("a") for element_name in tags_to_remove: for element in soup.select(element_name): element.extract() # Remove comments and empty elements for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() for element in soup.find_all(): if len(element.get_text(strip=True)) == 0: element.extract() return soup
The AdvancedAgent demonstrates real-world usage of web scraping for market predictions:
from prediction_market_agent.tools.web_scrape.markdown import web_scrapefrom prediction_market_agent_tooling.tools.google_utils import search_google_serperclass AdvancedAgent(DeployableTraderAgent): def answer_binary_market(self, market: AgentMarket) -> ProbabilisticAnswer | None: # Search for results on Google google_results = search_google_serper(market.question) # Filter out Manifold results google_results = [url for url in google_results if "manifold" not in url] if not google_results: return None # Scrape and truncate content contents = [ scraped[:10000] for url in google_results[:5] if (scraped := web_scrape(url)) ] if not contents: return None # Use LLM to analyze and predict probability, confidence = llm(market.question, contents) return ProbabilisticAnswer( confidence=confidence, p_yes=Probability(probability), reasoning="I asked Google and LLM to do it!", )
1
Search
Use Google to find relevant URLs for the market question
2
Filter
Remove duplicate or low-quality sources
3
Scrape
Extract content from top URLs using web_scrape
4
Analyze
Feed scraped content to LLM for probability estimation
web_scraping_schema = { "type": "function", "function": { "name": "web_scraping", "parameters": { "type": "object", "properties": { "objective": { "type": "string", "description": "The objective that defines the content to be scraped from the website.", }, "url": { "type": "string", "description": "The URL of the website to be scraped.", }, }, "required": ["query"], }, "description": "Web scrape a URL to retrieve information relevant to the objective.", },}
from prediction_market_agent.tools.tool_exception_handler import tool_exception_handlerimport requests# Wrap scraper with exception handlerweb_scrape_handled = tool_exception_handler( map_exception_to_output={ requests.exceptions.HTTPError: "Couldn't reach the URL.", requests.exceptions.Timeout: "Request timed out.", })(web_scrape_structured)# Now safe to use without try/exceptresult = web_scrape_handled(url="https://example.com")