Skip to main content

Overview

Web Scraping Hub is built on a robust web scraping architecture that bypasses anti-bot protections, extracts structured data from HTML sources, and handles dynamic content. The system uses cloudscraper for Cloudflare bypass and BeautifulSoup for HTML parsing.

Core Technologies

Cloudscraper

Bypasses Cloudflare and anti-bot challenges automatically

BeautifulSoup

Parses HTML and extracts data with CSS selectors

Flask API

Serves extracted data through RESTful endpoints

Cloudflare Bypass System

The Challenge

Modern content sites use Cloudflare protection to prevent automated scraping:
  • JavaScript challenges
  • Cookie validation
  • Browser fingerprinting
  • Rate limiting
  • CAPTCHA systems

The Solution: Cloudscraper

The platform uses cloudscraper to automatically handle these protections:
http_client.py:1-33
import cloudscraper

_scraper = cloudscraper.create_scraper()

def fetch_html(url):
    """
    Obtiene el HTML de una URL usando cloudscraper.
    Supera Cloudflare y protecciones anti-bot.
    Retorna el texto de la respuesta o None en caso de error.
    """
    try:
        response = _scraper.get(url, timeout=30)
        if response.status_code == 200:
            return response.text
        print(f"[ERROR] fetch_html: status {response.status_code} para {url}")
    except Exception as e:
        print(f"[ERROR] fetch_html: {e}")
    return None

def fetch_json(url):
    """
    Obtiene JSON de una URL usando cloudscraper.
    Retorna el objeto JSON parseado o None en caso de error.
    """
    try:
        response = _scraper.get(url, timeout=30)
        if response.status_code == 200:
            return response.json()
        print(f"[ERROR] fetch_json: status {response.status_code} para {url}")
    except Exception as e:
        print(f"[ERROR] fetch_json: {e}")
    return None
The _scraper instance is created once and reused for all requests, maintaining session cookies and browser fingerprints for consistent bypass.

How Cloudscraper Works

1

Initial Request

Cloudscraper sends a request that mimics a real browser
2

Challenge Detection

Detects if Cloudflare returned a challenge page
3

JavaScript Execution

Solves JavaScript challenges automatically
4

Cookie Management

Stores and sends required cookies for subsequent requests
5

Content Retrieval

Returns the actual page content after bypass

Modular Extractor Architecture

The scraping system uses specialized extractors for different data types:

Generic Extractor

Handles catalog listings and basic content extraction.File: backend/extractors/generic_extractor.pyFunctions:
  • extraer_listado() - Extract catalog items
  • extraer_info_pelicula() - Extract movie metadata

Generic Extractor Deep Dive

Catalog Listing Extraction

The extraer_listado() function parses HTML to extract catalog items:
generic_extractor.py:4-51
def extraer_listado(html):
    soup = BeautifulSoup(html, 'html.parser')
    articulos = soup.select('article.item')
    datos = []
    
    for articulo in articulos:
        try:
            poster = articulo.select_one('.poster')
            enlace = articulo.select_one('a')['href']
            id_post = articulo.get('data-id', 'N/A')
            slug = enlace.rstrip('/').split('/')[-1]
            titulo = poster.select_one('h3').text.strip()
            
            # Image extraction with lazy-loading handling
            img_tag = poster.select_one('img')
            imagen = ''
            if img_tag:
                # Priority: data-srcset > data-src > data-lazy-src > src
                imagen = img_tag.get('data-srcset') or img_tag.get('data-src') or \
                         img_tag.get('data-lazy-src') or img_tag.get('src', '')
                
                # Fallback to noscript if placeholder detected
                if 'data:image' in imagen:
                    noscript = articulo.select_one('noscript img')
                    if noscript:
                        imagen = noscript.get('src', imagen)
                
                # Clean srcset format
                if ',' in imagen:
                    imagen = imagen.split(',')[0].split(' ')[0]
            
            # Extract metadata
            year = poster.select_one('.data p').text.strip()
            generos = poster.select_one('.data span').text.strip()
            idioma = 'Latino' if poster.select_one('.audio .latino') else 'Otro'
            tipo = 'pelicula' if 'movies' in articulo.get('class', []) else \
                   'serie' if 'tvshows' in articulo.get('class', []) else 'Otro'
            
            datos.append({
                "id": id_post,
                "slug": slug,
                "titulo": titulo,
                "imagen": imagen,
                "year": year,
                "generos": generos,
                "idioma": idioma,
                "tipo": tipo,
                "url": enlace
            })
        except Exception as e:
            print(f"[ERROR] Falló al parsear un artículo: {e}")
    
    return datos

Image Extraction Strategy

Modern websites use lazy loading to improve performance, requiring sophisticated image extraction:
  1. Check data attributes first - data-srcset, data-src, data-lazy-src
  2. Detect placeholders - Check if image is data:image base64 placeholder
  3. Search noscript fallback - Find <noscript> tag with real image
  4. Clean srcset format - Extract first URL from responsive image sets
  5. Fall back to src - Use standard src attribute if all else fails
generic_extractor.py:18-30
if img_tag:
    imagen = img_tag.get('data-srcset') or img_tag.get('data-src') or \
             img_tag.get('data-lazy-src') or img_tag.get('src', '')
    
    # If still a placeholder, look for noscript
    if 'data:image' in imagen:
        noscript = articulo.select_one('noscript img')
        if noscript:
            imagen = noscript.get('src', imagen)
    
    # Clean srcset (e.g., "image.jpg 300w, image2.jpg 600w")
    if ',' in imagen:
        imagen = imagen.split(',')[0].split(' ')[0]

Movie Metadata Extraction

The extraer_info_pelicula() function extracts comprehensive movie details:
generic_extractor.py:54-90
def extraer_info_pelicula(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Title extraction
    titulo = soup.select_one('div.data h1')
    titulo = titulo.text.strip() if titulo else 'No encontrado'
    
    # Synopsis extraction
    sinopsis_div = soup.find('div', itemprop='description')
    sinopsis = sinopsis_div.find('p').text.strip() if sinopsis_div and sinopsis_div.find('p') else ''
    
    # Release date
    fecha_estreno = soup.find('span', itemprop='dateCreated')
    fecha_estreno = fecha_estreno.text.strip() if fecha_estreno else ''
    
    # Genres extraction
    generos_div = soup.find('div', class_='sgeneros')
    generos = [a.text.strip() for a in generos_div.find_all('a')] if generos_div else []
    
    # Poster image with fallbacks
    poster_img = soup.select_one('div.poster img')
    imagen_poster = ''
    if poster_img:
        imagen_poster = poster_img.get('data-src') or poster_img.get('data-lazy-src') or \
                       poster_img.get('src', '')
        
        # Noscript fallback
        if 'data:image' in imagen_poster:
            noscript = soup.select_one('div.poster noscript img')
            if noscript:
                imagen_poster = noscript.get('src', imagen_poster)
    
    # OG Tags fallback
    if not imagen_poster or 'data:image' in imagen_poster:
        og_image = soup.find('meta', property='og:image')
        if og_image:
            imagen_poster = og_image.get('content', imagen_poster)
    
    return {
        'titulo': titulo,
        'sinopsis': sinopsis,
        'fecha_estreno': fecha_estreno,
        'generos': generos,
        'imagen_poster': imagen_poster
    }
The extractor uses multiple fallback mechanisms: direct attributes → noscript tags → Open Graph meta tags → Twitter Card meta tags.

Series Extractor Deep Dive

Episode and Season Extraction

The series extractor handles complex episodic structures:
serie_extractor.py:5-81
def extraer_episodios_serie(url):
    html = fetch_html(url)
    if not html:
        return {"info": {}, "episodios": []}
    
    soup = BeautifulSoup(html, 'html.parser')
    
    # Extract series metadata
    titulo = soup.select_one('div.data h1')
    titulo = titulo.text.strip() if titulo else ''
    
    sinopsis = soup.select_one('div[itemprop="description"].wp-content')
    sinopsis = sinopsis.text.strip() if sinopsis else ''
    
    generos_div = soup.find('div', class_='sgeneros')
    generos = [a.text.strip() for a in generos_div.find_all('a')] if generos_div else []
    
    # Extract poster with fallbacks
    poster_img = soup.select_one('div.poster img')
    imagen_poster = poster_img.get('data-src') or poster_img.get('src', '') if poster_img else ''
    
    # Extract episodes by season
    temporadas_divs = soup.select('#seasons .se-c')
    episodios_data = []
    fechas_episodios = []
    
    for temporada_div in temporadas_divs:
        num_temporada = int(temporada_div.get('data-season', 0))
        episodios = temporada_div.select('li')
        
        for episodio in episodios:
            try:
                enlace_episodio = episodio.select_one('a')['href']
                titulo_ep = episodio.select_one('.epst').text.strip()
                numerando = episodio.select_one('.numerando').text.strip()
                numero_ep = int(numerando.split('-')[-1].strip())
                fecha = episodio.select_one('.date').text.strip()
                
                img_ep = episodio.select_one('img')
                imagen = img_ep.get('data-src') or img_ep.get('src', '') if img_ep else ''
                
                episodios_data.append({
                    "temporada": num_temporada,
                    "episodio": numero_ep,
                    "titulo": titulo_ep,
                    "fecha": fecha,
                    "imagen": imagen,
                    "url": enlace_episodio
                })
                
                if fecha:
                    fechas_episodios.append(fecha)
            except Exception as e:
                print(f"⚠️ Error en episodio: {e}")
    
    # Use first episode date as series premiere date
    fecha_estreno = fechas_episodios[0] if fechas_episodios else ''
    
    info = {
        "titulo": titulo,
        "sinopsis": sinopsis,
        "generos": generos,
        "imagen_poster": imagen_poster,
        "fecha_estreno": fecha_estreno
    }
    
    return {"info": info, "episodios": episodios_data}

Season Structure Parsing

The extractor navigates complex HTML structures:
<div id="seasons">
  <div class="se-c" data-season="1">
    <ul class="episodios">
      <li>Episode 1</li>
      <li>Episode 2</li>
    </ul>
  </div>
  <div class="se-c" data-season="2">
    <!-- Season 2 episodes -->
  </div>
</div>

Iframe Extractor Deep Dive

Player URL Extraction

The iframe extractor finds embedded video players:
iframe_extractor.py:6-23
def extraer_iframe_reproductor(url):
    html = fetch_html(url)
    if not html:
        print(f"❌ Error al acceder a: {url}")
        return None
    
    # Clean ads before parsing
    html_limpio = clean_html_ads(html)
    soup = BeautifulSoup(html_limpio, 'html.parser')
    
    # Find player iframe
    iframe = soup.select_one('.dooplay_player iframe')
    
    if iframe and iframe.get('src'):
        url_reproductor = iframe['src']
        return {
            "player_url": url_reproductor,
            "fuente": url_reproductor.split('/')[2],  # Extract domain
            "formato": "iframe"
        }
    else:
        print("⚠️ No se encontró iframe de reproducción.")
        return None
The iframe extractor requires ad cleaning before parsing to remove obfuscated ad overlays that might interfere with player detection.

Ad Blocking System

The platform includes server-side ad removal:
adblocker.py
def clean_html_ads(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Ad selectors to remove
    ad_selectors = [
        '[id*="ad"]', '[class*="ad"]',
        '[id*="banner"]', '[class*="banner"]',
        '[id*="sponsor"]', '[class*="sponsor"]',
        '.advertisement', '.publicity',
        'iframe[src*="ads"]', 'iframe[src*="doubleclick"]'
    ]
    
    for selector in ad_selectors:
        for element in soup.select(selector):
            element.decompose()
    
    return str(soup)
  • Banner advertisements
  • Sponsor content
  • Ad iframes
  • Tracking scripts
  • Promotional overlays
  • Video pre-roll ads

Error Handling and Resilience

Graceful Degradation

All extractors include error handling:
try:
    # Extraction logic
    titulo = soup.select_one('h1').text.strip()
except Exception as e:
    print(f"[ERROR] Falló extracción: {e}")
    titulo = 'No disponible'  # Fallback value

Timeout Configuration

http_client.py:12
response = _scraper.get(url, timeout=30)
30-second timeout balances between handling slow responses and avoiding indefinite hangs.

Retry Logic

The system includes implicit retry through React Query on the frontend:
  • Automatic retries: 3 attempts by default
  • Exponential backoff: Increasing delays between retries
  • Error boundaries: Catch and display extraction failures

Content Type Classification

The system automatically classifies content by analyzing CSS classes:
generic_extractor.py:36
tipo = 'pelicula' if 'movies' in articulo.get('class', []) else \
       'serie' if 'tvshows' in articulo.get('class', []) else 'Otro'

Movies

CSS class contains movies

Series

CSS class contains tvshows

Other

Neither class (typically anime)

CSS Selector Strategy

The extractors use specific, resilient selectors:
TargetSelectorFallback
Titlediv.data h1h1.entry-title
Synopsisdiv[itemprop="description"]None
Posterdiv.poster imgOG meta tags
Genresdiv.sgeneros aEmpty array
Episodes#seasons .se-c liEmpty array
Player.dooplay_player iframeNone
Selectors prioritize attribute selectors with semantic meaning (e.g., itemprop) over generic class names for better resilience.

Performance Optimizations

Reusable Scraper Instance

http_client.py:3
_scraper = cloudscraper.create_scraper()
Creating the scraper once and reusing it:
  • Maintains session cookies
  • Preserves browser fingerprint
  • Reduces initialization overhead
  • Improves bypass success rate

Selective Parsing

Extractors only parse needed elements:
# Only select article elements, not entire document
articulos = soup.select('article.item')

Early Returns

Functions return early on failures:
html = fetch_html(url)
if not html:
    return {"info": {}, "episodios": []}

Multi-Architecture Support

The scraping system works across different CPU architectures:

AMD64

Standard x86-64 servers and desktops

ARM64

Raspberry Pi 4, Apple Silicon, cloud instances

ARMv7

Older Raspberry Pi models
The Docker image compiles dependencies natively for each architecture, ensuring optimal performance.

Best Practices

  • Use specific selectors with semantic meaning
  • Implement multiple fallbacks for critical data
  • Test selectors against multiple pages
  • Document selector assumptions
  • Wrap extraction logic in try-except blocks
  • Log errors with context for debugging
  • Return structured empty data on failure
  • Use default values for optional fields
  • Check data attributes before src
  • Detect and handle lazy loading
  • Use noscript as fallback
  • Fall back to meta tags (OG, Twitter)
  • Reuse scraper instances
  • Parse only required elements
  • Implement timeouts
  • Cache responses where appropriate

Debugging Scraping Issues

Common Issues and Solutions

Symptom: Requests return 403 or challenge pagesSolutions:
  • Update cloudscraper library
  • Check if IP is rate-limited
  • Add delays between requests
  • Rotate user agents

Catalog Browsing

See how extracted data populates the catalog

Search

Learn about search extraction

Video Player

Understand player iframe extraction

Build docs developers (and LLMs) love