Overview
Extractors are specialized modules that parse HTML content and extract structured data. Web Scrapping Hub uses a modular extractor pattern that makes it easy to add support for new content types.Extractor Architecture
All extractors follow a consistent pattern:- Fetch HTML: Use
http_client.fetch_html()to retrieve the page - Parse with BeautifulSoup: Create a soup object for DOM traversal
- Extract Data: Use CSS selectors to find and extract content
- Return Structured Data: Return consistent JSON-like dictionaries
Available Extractors
Generic Extractor
The generic extractor (extractors/generic_extractor.py) handles content listings and movie information.
Listing Extraction
Movie Information Extraction
Series Extractor
The series extractor (extractors/serie_extractor.py) handles TV shows and anime series with multiple episodes.
Iframe Extractor
The iframe extractor (extractors/iframe_extractor.py) finds video player iframes in content pages.
Creating Custom Extractors
To create a new extractor, follow this pattern:1. Create Extractor File
Create a new file inbackend/extractors/:
2. Import in App
Add your extractor toapp.py:
3. Handle Edge Cases
Always handle common issues:- Lazy-loaded images: Check multiple attributes (data-src, data-lazy-src, src)
- Missing elements: Use conditional checks before accessing text/attributes
- Encoding issues: BeautifulSoup handles this automatically with ‘html.parser’
- Exceptions: Wrap extraction logic in try-except blocks
Best Practices
Use CSS Selectors Efficiently
Use CSS Selectors Efficiently
Prefer specific CSS selectors over complex traversal:
Handle Lazy Loading
Handle Lazy Loading
Always check for lazy-loaded images:
Return Consistent Data Structures
Return Consistent Data Structures
Always return dictionaries with consistent keys:
Log Errors Appropriately
Log Errors Appropriately
Use descriptive error messages:
Testing Extractors
Test your extractors with real HTML:Next Steps
Flask Setup
Learn about Flask application structure
Utilities
Explore HTTP client and parsing utilities