Overview
The utilities directory contains helper modules that provide core functionality for HTTP requests, HTML parsing, and ad blocking. These utilities are used throughout the extractors and API endpoints.HTTP Client
The HTTP client (utils/http_client.py) uses cloudscraper to bypass Cloudflare and other anti-bot protections.
Module Structure
Fetching HTML
- 30-second timeout to prevent hanging requests
- Automatic Cloudflare bypass via cloudscraper
- Error logging for debugging
- Returns
Noneon failure for easy error handling
Fetching JSON
Accessing the Scraper Directly
For advanced use cases, access the global_scraper instance:
Parser Utilities
The parser module (utils/parser.py) provides helper functions for safe HTML parsing.
Why Use safe_text?
Direct text extraction can fail if elements are missing:Ad Blocker
The ad blocker (utils/adblocker.py) uses EasyList rules to remove ads and tracking scripts from HTML.
Loading Ad Block Rules
Cleaning HTML
When to Use Ad Blocking
Use ad blocking when:- Extracting iframe players (to avoid ad iframes)
- Scraping pages with heavy ad content
- Improving parsing reliability
- Speed is critical (adds processing overhead)
- The target site has minimal ads
- You need the original HTML structure
Creating Custom Utilities
To add new utilities:1. Create Utility Module
2. Export from init.py
3. Use in Extractors
Best Practices
Reuse HTTP Client Instance
Reuse HTTP Client Instance
Always use the global
_scraper instance instead of creating new scrapers:Handle Timeouts Appropriately
Handle Timeouts Appropriately
Set reasonable timeouts based on the operation:
Log Errors with Context
Log Errors with Context
Include relevant context in error messages:
Use safe_text for Optional Fields
Use safe_text for Optional Fields
Use
safe_text for any field that might be missing:Common Patterns
Retry Failed Requests
Extract with Fallbacks
Validate and Clean Data
Performance Tips
Connection Pooling
The global
_scraper instance automatically manages connection pooling for better performance.Timeout Configuration
Use appropriate timeouts to prevent slow requests from blocking the server.
Selective Ad Blocking
Only use ad blocking when necessary, as it adds processing overhead.
Error Handling
Return
None or default values instead of raising exceptions for better reliability.Next Steps
Flask Setup
Learn about Flask application structure
Extractors
Create custom content extractors