Overview
The Generic Extractor module provides functions to parse HTML content and extract structured data for movies and series. It handles lazy-loaded images, metadata extraction, and robust fallback mechanisms.Functions
extraer_listado
Extracts a list of movies or series from an HTML page containing article listings.Signature
Parameters
The HTML content to parse, typically from a listing or catalog page
Returns
Returns a list of dictionaries, where each dictionary contains:The post ID from the
data-id attributeThe URL slug extracted from the article link
The title of the movie or series
The alt text from the poster image
The poster image URL (handles lazy-loading)
The release year
The genres as a string
The language (e.g., “Latino”, “Otro”)
The content type: “pelicula” or “serie”
The full URL to the content page
Example
Implementation Details
- Lazy Loading Handling: The function prioritizes
data-srcset,data-src, anddata-lazy-srcattributes oversrcto avoid placeholder images - Fallback Mechanism: If a placeholder image is detected (containing
data:image), it searches for a<noscript>tag with the actual image - Content Type Detection: Determines if content is a movie or series based on CSS classes (
moviesortvshows) - Error Handling: Catches and logs parsing errors for individual articles without stopping the entire extraction
extraer_info_pelicula
Extracts detailed information about a specific movie from its detail page.Signature
Parameters
The HTML content of a movie detail page
Returns
Returns a dictionary containing:The movie title
The movie synopsis/description
The release date
A list of genre names
The poster image URL with multiple fallback options
Example
Implementation Details
- Multi-level Image Fallback:
- Attempts to extract from lazy-loaded attributes (
data-src,data-lazy-src) - Falls back to
<noscript>tags if placeholder detected - Uses Open Graph (
og:image) meta tags as secondary fallback - Uses Twitter Card (
twitter:image) meta tags as final fallback
- Attempts to extract from lazy-loaded attributes (
- Robust Element Selection: Uses CSS selectors and attribute-based selection for reliable parsing
- Graceful Degradation: Returns empty strings or lists when elements are not found
Use Cases
Scraping Movie Catalogs
Useextraer_listado to build a database of available movies and series from catalog pages:
Getting Movie Details
Useextraer_info_pelicula to fetch complete information about a specific movie:
Dependencies
- BeautifulSoup4: HTML parsing library
Error Handling
Both functions include error handling mechanisms:extraer_listado: Catches exceptions during article parsing and continues processing remaining articlesextraer_info_pelicula: Returns default values (empty strings, empty lists) when elements are missing