Overview
Web Scraping Hub is built on a robust web scraping architecture that bypasses anti-bot protections, extracts structured data from HTML sources, and handles dynamic content. The system uses cloudscraper for Cloudflare bypass and BeautifulSoup for HTML parsing.
Core Technologies
Cloudscraper Bypasses Cloudflare and anti-bot challenges automatically
BeautifulSoup Parses HTML and extracts data with CSS selectors
Flask API Serves extracted data through RESTful endpoints
Cloudflare Bypass System
The Challenge
Modern content sites use Cloudflare protection to prevent automated scraping:
JavaScript challenges
Cookie validation
Browser fingerprinting
Rate limiting
CAPTCHA systems
The Solution: Cloudscraper
The platform uses cloudscraper to automatically handle these protections:
import cloudscraper
_scraper = cloudscraper.create_scraper()
def fetch_html ( url ):
"""
Obtiene el HTML de una URL usando cloudscraper.
Supera Cloudflare y protecciones anti-bot.
Retorna el texto de la respuesta o None en caso de error.
"""
try :
response = _scraper.get(url, timeout = 30 )
if response.status_code == 200 :
return response.text
print ( f "[ERROR] fetch_html: status { response.status_code } para { url } " )
except Exception as e:
print ( f "[ERROR] fetch_html: { e } " )
return None
def fetch_json ( url ):
"""
Obtiene JSON de una URL usando cloudscraper.
Retorna el objeto JSON parseado o None en caso de error.
"""
try :
response = _scraper.get(url, timeout = 30 )
if response.status_code == 200 :
return response.json()
print ( f "[ERROR] fetch_json: status { response.status_code } para { url } " )
except Exception as e:
print ( f "[ERROR] fetch_json: { e } " )
return None
The _scraper instance is created once and reused for all requests, maintaining session cookies and browser fingerprints for consistent bypass.
How Cloudscraper Works
Initial Request
Cloudscraper sends a request that mimics a real browser
Challenge Detection
Detects if Cloudflare returned a challenge page
JavaScript Execution
Solves JavaScript challenges automatically
Cookie Management
Stores and sends required cookies for subsequent requests
Content Retrieval
Returns the actual page content after bypass
The scraping system uses specialized extractors for different data types:
The extraer_listado() function parses HTML to extract catalog items:
generic_extractor.py:4-51
def extraer_listado ( html ):
soup = BeautifulSoup(html, 'html.parser' )
articulos = soup.select( 'article.item' )
datos = []
for articulo in articulos:
try :
poster = articulo.select_one( '.poster' )
enlace = articulo.select_one( 'a' )[ 'href' ]
id_post = articulo.get( 'data-id' , 'N/A' )
slug = enlace.rstrip( '/' ).split( '/' )[ - 1 ]
titulo = poster.select_one( 'h3' ).text.strip()
# Image extraction with lazy-loading handling
img_tag = poster.select_one( 'img' )
imagen = ''
if img_tag:
# Priority: data-srcset > data-src > data-lazy-src > src
imagen = img_tag.get( 'data-srcset' ) or img_tag.get( 'data-src' ) or \
img_tag.get( 'data-lazy-src' ) or img_tag.get( 'src' , '' )
# Fallback to noscript if placeholder detected
if 'data:image' in imagen:
noscript = articulo.select_one( 'noscript img' )
if noscript:
imagen = noscript.get( 'src' , imagen)
# Clean srcset format
if ',' in imagen:
imagen = imagen.split( ',' )[ 0 ].split( ' ' )[ 0 ]
# Extract metadata
year = poster.select_one( '.data p' ).text.strip()
generos = poster.select_one( '.data span' ).text.strip()
idioma = 'Latino' if poster.select_one( '.audio .latino' ) else 'Otro'
tipo = 'pelicula' if 'movies' in articulo.get( 'class' , []) else \
'serie' if 'tvshows' in articulo.get( 'class' , []) else 'Otro'
datos.append({
"id" : id_post,
"slug" : slug,
"titulo" : titulo,
"imagen" : imagen,
"year" : year,
"generos" : generos,
"idioma" : idioma,
"tipo" : tipo,
"url" : enlace
})
except Exception as e:
print ( f "[ERROR] Falló al parsear un artículo: { e } " )
return datos
Modern websites use lazy loading to improve performance, requiring sophisticated image extraction:
generic_extractor.py:18-30
if img_tag:
imagen = img_tag.get( 'data-srcset' ) or img_tag.get( 'data-src' ) or \
img_tag.get( 'data-lazy-src' ) or img_tag.get( 'src' , '' )
# If still a placeholder, look for noscript
if 'data:image' in imagen:
noscript = articulo.select_one( 'noscript img' )
if noscript:
imagen = noscript.get( 'src' , imagen)
# Clean srcset (e.g., "image.jpg 300w, image2.jpg 600w")
if ',' in imagen:
imagen = imagen.split( ',' )[ 0 ].split( ' ' )[ 0 ]
The extraer_info_pelicula() function extracts comprehensive movie details:
generic_extractor.py:54-90
def extraer_info_pelicula ( html ):
soup = BeautifulSoup(html, 'html.parser' )
# Title extraction
titulo = soup.select_one( 'div.data h1' )
titulo = titulo.text.strip() if titulo else 'No encontrado'
# Synopsis extraction
sinopsis_div = soup.find( 'div' , itemprop = 'description' )
sinopsis = sinopsis_div.find( 'p' ).text.strip() if sinopsis_div and sinopsis_div.find( 'p' ) else ''
# Release date
fecha_estreno = soup.find( 'span' , itemprop = 'dateCreated' )
fecha_estreno = fecha_estreno.text.strip() if fecha_estreno else ''
# Genres extraction
generos_div = soup.find( 'div' , class_ = 'sgeneros' )
generos = [a.text.strip() for a in generos_div.find_all( 'a' )] if generos_div else []
# Poster image with fallbacks
poster_img = soup.select_one( 'div.poster img' )
imagen_poster = ''
if poster_img:
imagen_poster = poster_img.get( 'data-src' ) or poster_img.get( 'data-lazy-src' ) or \
poster_img.get( 'src' , '' )
# Noscript fallback
if 'data:image' in imagen_poster:
noscript = soup.select_one( 'div.poster noscript img' )
if noscript:
imagen_poster = noscript.get( 'src' , imagen_poster)
# OG Tags fallback
if not imagen_poster or 'data:image' in imagen_poster:
og_image = soup.find( 'meta' , property = 'og:image' )
if og_image:
imagen_poster = og_image.get( 'content' , imagen_poster)
return {
'titulo' : titulo,
'sinopsis' : sinopsis,
'fecha_estreno' : fecha_estreno,
'generos' : generos,
'imagen_poster' : imagen_poster
}
The extractor uses multiple fallback mechanisms: direct attributes → noscript tags → Open Graph meta tags → Twitter Card meta tags.
Episode and Season Extraction
The series extractor handles complex episodic structures:
def extraer_episodios_serie ( url ):
html = fetch_html(url)
if not html:
return { "info" : {}, "episodios" : []}
soup = BeautifulSoup(html, 'html.parser' )
# Extract series metadata
titulo = soup.select_one( 'div.data h1' )
titulo = titulo.text.strip() if titulo else ''
sinopsis = soup.select_one( 'div[itemprop="description"].wp-content' )
sinopsis = sinopsis.text.strip() if sinopsis else ''
generos_div = soup.find( 'div' , class_ = 'sgeneros' )
generos = [a.text.strip() for a in generos_div.find_all( 'a' )] if generos_div else []
# Extract poster with fallbacks
poster_img = soup.select_one( 'div.poster img' )
imagen_poster = poster_img.get( 'data-src' ) or poster_img.get( 'src' , '' ) if poster_img else ''
# Extract episodes by season
temporadas_divs = soup.select( '#seasons .se-c' )
episodios_data = []
fechas_episodios = []
for temporada_div in temporadas_divs:
num_temporada = int (temporada_div.get( 'data-season' , 0 ))
episodios = temporada_div.select( 'li' )
for episodio in episodios:
try :
enlace_episodio = episodio.select_one( 'a' )[ 'href' ]
titulo_ep = episodio.select_one( '.epst' ).text.strip()
numerando = episodio.select_one( '.numerando' ).text.strip()
numero_ep = int (numerando.split( '-' )[ - 1 ].strip())
fecha = episodio.select_one( '.date' ).text.strip()
img_ep = episodio.select_one( 'img' )
imagen = img_ep.get( 'data-src' ) or img_ep.get( 'src' , '' ) if img_ep else ''
episodios_data.append({
"temporada" : num_temporada,
"episodio" : numero_ep,
"titulo" : titulo_ep,
"fecha" : fecha,
"imagen" : imagen,
"url" : enlace_episodio
})
if fecha:
fechas_episodios.append(fecha)
except Exception as e:
print ( f "⚠️ Error en episodio: { e } " )
# Use first episode date as series premiere date
fecha_estreno = fechas_episodios[ 0 ] if fechas_episodios else ''
info = {
"titulo" : titulo,
"sinopsis" : sinopsis,
"generos" : generos,
"imagen_poster" : imagen_poster,
"fecha_estreno" : fecha_estreno
}
return { "info" : info, "episodios" : episodios_data}
Season Structure Parsing
The extractor navigates complex HTML structures:
Season Container
Episode Structure
< div id = "seasons" >
< div class = "se-c" data-season = "1" >
< ul class = "episodios" >
< li > Episode 1 </ li >
< li > Episode 2 </ li >
</ ul >
</ div >
< div class = "se-c" data-season = "2" >
<!-- Season 2 episodes -->
</ div >
</ div >
The iframe extractor finds embedded video players:
def extraer_iframe_reproductor ( url ):
html = fetch_html(url)
if not html:
print ( f "❌ Error al acceder a: { url } " )
return None
# Clean ads before parsing
html_limpio = clean_html_ads(html)
soup = BeautifulSoup(html_limpio, 'html.parser' )
# Find player iframe
iframe = soup.select_one( '.dooplay_player iframe' )
if iframe and iframe.get( 'src' ):
url_reproductor = iframe[ 'src' ]
return {
"player_url" : url_reproductor,
"fuente" : url_reproductor.split( '/' )[ 2 ], # Extract domain
"formato" : "iframe"
}
else :
print ( "⚠️ No se encontró iframe de reproducción." )
return None
The iframe extractor requires ad cleaning before parsing to remove obfuscated ad overlays that might interfere with player detection.
Ad Blocking System
The platform includes server-side ad removal:
def clean_html_ads ( html ):
soup = BeautifulSoup(html, 'html.parser' )
# Ad selectors to remove
ad_selectors = [
'[id*="ad"]' , '[class*="ad"]' ,
'[id*="banner"]' , '[class*="banner"]' ,
'[id*="sponsor"]' , '[class*="sponsor"]' ,
'.advertisement' , '.publicity' ,
'iframe[src*="ads"]' , 'iframe[src*="doubleclick"]'
]
for selector in ad_selectors:
for element in soup.select(selector):
element.decompose()
return str (soup)
Banner advertisements
Sponsor content
Ad iframes
Tracking scripts
Promotional overlays
Video pre-roll ads
Error Handling and Resilience
Graceful Degradation
All extractors include error handling:
try :
# Extraction logic
titulo = soup.select_one( 'h1' ).text.strip()
except Exception as e:
print ( f "[ERROR] Falló extracción: { e } " )
titulo = 'No disponible' # Fallback value
Timeout Configuration
response = _scraper.get(url, timeout = 30 )
30-second timeout balances between handling slow responses and avoiding indefinite hangs.
Retry Logic
The system includes implicit retry through React Query on the frontend:
Automatic retries : 3 attempts by default
Exponential backoff : Increasing delays between retries
Error boundaries : Catch and display extraction failures
Content Type Classification
The system automatically classifies content by analyzing CSS classes:
tipo = 'pelicula' if 'movies' in articulo.get( 'class' , []) else \
'serie' if 'tvshows' in articulo.get( 'class' , []) else 'Otro'
Movies CSS class contains movies
Series CSS class contains tvshows
Other Neither class (typically anime)
CSS Selector Strategy
The extractors use specific, resilient selectors:
Target Selector Fallback Title div.data h1h1.entry-titleSynopsis div[itemprop="description"]None Poster div.poster imgOG meta tags Genres div.sgeneros aEmpty array Episodes #seasons .se-c liEmpty array Player .dooplay_player iframeNone
Selectors prioritize attribute selectors with semantic meaning (e.g., itemprop) over generic class names for better resilience.
Reusable Scraper Instance
_scraper = cloudscraper.create_scraper()
Creating the scraper once and reusing it:
Maintains session cookies
Preserves browser fingerprint
Reduces initialization overhead
Improves bypass success rate
Selective Parsing
Extractors only parse needed elements:
# Only select article elements, not entire document
articulos = soup.select( 'article.item' )
Early Returns
Functions return early on failures:
html = fetch_html(url)
if not html:
return { "info" : {}, "episodios" : []}
Multi-Architecture Support
The scraping system works across different CPU architectures:
AMD64 Standard x86-64 servers and desktops
ARM64 Raspberry Pi 4, Apple Silicon, cloud instances
ARMv7 Older Raspberry Pi models
The Docker image compiles dependencies natively for each architecture, ensuring optimal performance.
Best Practices
Use specific selectors with semantic meaning
Implement multiple fallbacks for critical data
Test selectors against multiple pages
Document selector assumptions
Wrap extraction logic in try-except blocks
Log errors with context for debugging
Return structured empty data on failure
Use default values for optional fields
Debugging Scraping Issues
Common Issues and Solutions
Cloudflare Block
Selector Failure
Image Loading Issues
Symptom: Requests return 403 or challenge pagesSolutions:
Update cloudscraper library
Check if IP is rate-limited
Add delays between requests
Rotate user agents
Symptom: Data extraction returns empty or NoneSolutions:
Verify selector against current HTML
Check for site structure changes
Implement additional fallbacks
Log HTML snippets for inspection
Symptom: Placeholder images instead of real imagesSolutions:
Check data attributes first
Parse noscript tags
Use OG meta tag fallback
Verify srcset parsing logic
Catalog Browsing See how extracted data populates the catalog
Search Learn about search extraction
Video Player Understand player iframe extraction