Web Scraping Technology - Web Scrapping Hub

Overview

Web Scraping Hub is built on a robust web scraping architecture that bypasses anti-bot protections, extracts structured data from HTML sources, and handles dynamic content. The system uses cloudscraper for Cloudflare bypass and BeautifulSoup for HTML parsing.

Core Technologies

Cloudscraper

Bypasses Cloudflare and anti-bot challenges automatically

BeautifulSoup

Parses HTML and extracts data with CSS selectors

Flask API

Serves extracted data through RESTful endpoints

Cloudflare Bypass System

The Challenge

Modern content sites use Cloudflare protection to prevent automated scraping:

JavaScript challenges
Cookie validation
Browser fingerprinting
Rate limiting
CAPTCHA systems

The Solution: Cloudscraper

The platform uses cloudscraper to automatically handle these protections:

http_client.py:1-33

import cloudscraper

_scraper = cloudscraper.create_scraper()

def fetch_html(url):
    """
    Obtiene el HTML de una URL usando cloudscraper.
    Supera Cloudflare y protecciones anti-bot.
    Retorna el texto de la respuesta o None en caso de error.
    """
    try:
        response = _scraper.get(url, timeout=30)
        if response.status_code == 200:
            return response.text
        print(f"[ERROR] fetch_html: status {response.status_code} para {url}")
    except Exception as e:
        print(f"[ERROR] fetch_html: {e}")
    return None

def fetch_json(url):
    """
    Obtiene JSON de una URL usando cloudscraper.
    Retorna el objeto JSON parseado o None en caso de error.
    """
    try:
        response = _scraper.get(url, timeout=30)
        if response.status_code == 200:
            return response.json()
        print(f"[ERROR] fetch_json: status {response.status_code} para {url}")
    except Exception as e:
        print(f"[ERROR] fetch_json: {e}")
    return None

The _scraper instance is created once and reused for all requests, maintaining session cookies and browser fingerprints for consistent bypass.

How Cloudscraper Works

Initial Request

Cloudscraper sends a request that mimics a real browser

Challenge Detection

Detects if Cloudflare returned a challenge page

JavaScript Execution

Solves JavaScript challenges automatically

Cookie Management

Stores and sends required cookies for subsequent requests

Content Retrieval

Returns the actual page content after bypass

Modular Extractor Architecture

The scraping system uses specialized extractors for different data types:

Generic Extractor
Series Extractor
Iframe Extractor

Generic Extractor

Handles catalog listings and basic content extraction.File: backend/extractors/generic_extractor.pyFunctions:

extraer_listado() - Extract catalog items
extraer_info_pelicula() - Extract movie metadata

Series Extractor

Specialized for episodic content extraction.File: backend/extractors/serie_extractor.pyFunctions:

extraer_episodios_serie() - Extract episodes and seasons

Iframe Extractor

Extracts video player iframes from content pages.File: backend/extractors/iframe_extractor.pyFunctions:

extraer_iframe_reproductor() - Extract player iframe URL

Generic Extractor Deep Dive

Catalog Listing Extraction

The extraer_listado() function parses HTML to extract catalog items:

generic_extractor.py:4-51

def extraer_listado(html):
    soup = BeautifulSoup(html, 'html.parser')
    articulos = soup.select('article.item')
    datos = []
    
    for articulo in articulos:
        try:
            poster = articulo.select_one('.poster')
            enlace = articulo.select_one('a')['href']
            id_post = articulo.get('data-id', 'N/A')
            slug = enlace.rstrip('/').split('/')[-1]
            titulo = poster.select_one('h3').text.strip()
            
            # Image extraction with lazy-loading handling
            img_tag = poster.select_one('img')
            imagen = ''
            if img_tag:
                # Priority: data-srcset > data-src > data-lazy-src > src
                imagen = img_tag.get('data-srcset') or img_tag.get('data-src') or \
                         img_tag.get('data-lazy-src') or img_tag.get('src', '')
                
                # Fallback to noscript if placeholder detected
                if 'data:image' in imagen:
                    noscript = articulo.select_one('noscript img')
                    if noscript:
                        imagen = noscript.get('src', imagen)
                
                # Clean srcset format
                if ',' in imagen:
                    imagen = imagen.split(',')[0].split(' ')[0]
            
            # Extract metadata
            year = poster.select_one('.data p').text.strip()
            generos = poster.select_one('.data span').text.strip()
            idioma = 'Latino' if poster.select_one('.audio .latino') else 'Otro'
            tipo = 'pelicula' if 'movies' in articulo.get('class', []) else \
                   'serie' if 'tvshows' in articulo.get('class', []) else 'Otro'
            
            datos.append({
                "id": id_post,
                "slug": slug,
                "titulo": titulo,
                "imagen": imagen,
                "year": year,
                "generos": generos,
                "idioma": idioma,
                "tipo": tipo,
                "url": enlace
            })
        except Exception as e:
            print(f"[ERROR] Falló al parsear un artículo: {e}")
    
    return datos

Image Extraction Strategy

Modern websites use lazy loading to improve performance, requiring sophisticated image extraction:

Image Extraction Process

Check data attributes first - data-srcset, data-src, data-lazy-src
Detect placeholders - Check if image is data:image base64 placeholder
Search noscript fallback - Find <noscript> tag with real image
Clean srcset format - Extract first URL from responsive image sets
Fall back to src - Use standard src attribute if all else fails

generic_extractor.py:18-30

if img_tag:
    imagen = img_tag.get('data-srcset') or img_tag.get('data-src') or \
             img_tag.get('data-lazy-src') or img_tag.get('src', '')
    
    # If still a placeholder, look for noscript
    if 'data:image' in imagen:
        noscript = articulo.select_one('noscript img')
        if noscript:
            imagen = noscript.get('src', imagen)
    
    # Clean srcset (e.g., "image.jpg 300w, image2.jpg 600w")
    if ',' in imagen:
        imagen = imagen.split(',')[0].split(' ')[0]

Movie Metadata Extraction

The extraer_info_pelicula() function extracts comprehensive movie details:

generic_extractor.py:54-90

def extraer_info_pelicula(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Title extraction
    titulo = soup.select_one('div.data h1')
    titulo = titulo.text.strip() if titulo else 'No encontrado'
    
    # Synopsis extraction
    sinopsis_div = soup.find('div', itemprop='description')
    sinopsis = sinopsis_div.find('p').text.strip() if sinopsis_div and sinopsis_div.find('p') else ''
    
    # Release date
    fecha_estreno = soup.find('span', itemprop='dateCreated')
    fecha_estreno = fecha_estreno.text.strip() if fecha_estreno else ''
    
    # Genres extraction
    generos_div = soup.find('div', class_='sgeneros')
    generos = [a.text.strip() for a in generos_div.find_all('a')] if generos_div else []
    
    # Poster image with fallbacks
    poster_img = soup.select_one('div.poster img')
    imagen_poster = ''
    if poster_img:
        imagen_poster = poster_img.get('data-src') or poster_img.get('data-lazy-src') or \
                       poster_img.get('src', '')
        
        # Noscript fallback
        if 'data:image' in imagen_poster:
            noscript = soup.select_one('div.poster noscript img')
            if noscript:
                imagen_poster = noscript.get('src', imagen_poster)
    
    # OG Tags fallback
    if not imagen_poster or 'data:image' in imagen_poster:
        og_image = soup.find('meta', property='og:image')
        if og_image:
            imagen_poster = og_image.get('content', imagen_poster)
    
    return {
        'titulo': titulo,
        'sinopsis': sinopsis,
        'fecha_estreno': fecha_estreno,
        'generos': generos,
        'imagen_poster': imagen_poster
    }

The extractor uses multiple fallback mechanisms: direct attributes → noscript tags → Open Graph meta tags → Twitter Card meta tags.

Series Extractor Deep Dive

Episode and Season Extraction

The series extractor handles complex episodic structures:

serie_extractor.py:5-81

def extraer_episodios_serie(url):
    html = fetch_html(url)
    if not html:
        return {"info": {}, "episodios": []}
    
    soup = BeautifulSoup(html, 'html.parser')
    
    # Extract series metadata
    titulo = soup.select_one('div.data h1')
    titulo = titulo.text.strip() if titulo else ''
    
    sinopsis = soup.select_one('div[itemprop="description"].wp-content')
    sinopsis = sinopsis.text.strip() if sinopsis else ''
    
    generos_div = soup.find('div', class_='sgeneros')
    generos = [a.text.strip() for a in generos_div.find_all('a')] if generos_div else []
    
    # Extract poster with fallbacks
    poster_img = soup.select_one('div.poster img')
    imagen_poster = poster_img.get('data-src') or poster_img.get('src', '') if poster_img else ''
    
    # Extract episodes by season
    temporadas_divs = soup.select('#seasons .se-c')
    episodios_data = []
    fechas_episodios = []
    
    for temporada_div in temporadas_divs:
        num_temporada = int(temporada_div.get('data-season', 0))
        episodios = temporada_div.select('li')
        
        for episodio in episodios:
            try:
                enlace_episodio = episodio.select_one('a')['href']
                titulo_ep = episodio.select_one('.epst').text.strip()
                numerando = episodio.select_one('.numerando').text.strip()
                numero_ep = int(numerando.split('-')[-1].strip())
                fecha = episodio.select_one('.date').text.strip()
                
                img_ep = episodio.select_one('img')
                imagen = img_ep.get('data-src') or img_ep.get('src', '') if img_ep else ''
                
                episodios_data.append({
                    "temporada": num_temporada,
                    "episodio": numero_ep,
                    "titulo": titulo_ep,
                    "fecha": fecha,
                    "imagen": imagen,
                    "url": enlace_episodio
                })
                
                if fecha:
                    fechas_episodios.append(fecha)
            except Exception as e:
                print(f"⚠️ Error en episodio: {e}")
    
    # Use first episode date as series premiere date
    fecha_estreno = fechas_episodios[0] if fechas_episodios else ''
    
    info = {
        "titulo": titulo,
        "sinopsis": sinopsis,
        "generos": generos,
        "imagen_poster": imagen_poster,
        "fecha_estreno": fecha_estreno
    }
    
    return {"info": info, "episodios": episodios_data}

Season Structure Parsing

The extractor navigates complex HTML structures:

<div id="seasons">
  <div class="se-c" data-season="1">
    <ul class="episodios">
      <li>Episode 1</li>
      <li>Episode 2</li>
    </ul>
  </div>
  <div class="se-c" data-season="2">
    <!-- Season 2 episodes -->
  </div>
</div>

Iframe Extractor Deep Dive

Player URL Extraction

The iframe extractor finds embedded video players:

iframe_extractor.py:6-23

def extraer_iframe_reproductor(url):
    html = fetch_html(url)
    if not html:
        print(f"❌ Error al acceder a: {url}")
        return None
    
    # Clean ads before parsing
    html_limpio = clean_html_ads(html)
    soup = BeautifulSoup(html_limpio, 'html.parser')
    
    # Find player iframe
    iframe = soup.select_one('.dooplay_player iframe')
    
    if iframe and iframe.get('src'):
        url_reproductor = iframe['src']
        return {
            "player_url": url_reproductor,
            "fuente": url_reproductor.split('/')[2],  # Extract domain
            "formato": "iframe"
        }
    else:
        print("⚠️ No se encontró iframe de reproducción.")
        return None

The iframe extractor requires ad cleaning before parsing to remove obfuscated ad overlays that might interfere with player detection.

Ad Blocking System

The platform includes server-side ad removal:

adblocker.py

def clean_html_ads(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Ad selectors to remove
    ad_selectors = [
        '[id*="ad"]', '[class*="ad"]',
        '[id*="banner"]', '[class*="banner"]',
        '[id*="sponsor"]', '[class*="sponsor"]',
        '.advertisement', '.publicity',
        'iframe[src*="ads"]', 'iframe[src*="doubleclick"]'
    ]
    
    for selector in ad_selectors:
        for element in soup.select(selector):
            element.decompose()
    
    return str(soup)

Ad Removal Targets

Banner advertisements
Sponsor content
Ad iframes
Tracking scripts
Promotional overlays
Video pre-roll ads

Error Handling and Resilience

Graceful Degradation

All extractors include error handling:

try:
    # Extraction logic
    titulo = soup.select_one('h1').text.strip()
except Exception as e:
    print(f"[ERROR] Falló extracción: {e}")
    titulo = 'No disponible'  # Fallback value

Timeout Configuration

http_client.py:12

response = _scraper.get(url, timeout=30)

30-second timeout balances between handling slow responses and avoiding indefinite hangs.

Retry Logic

The system includes implicit retry through React Query on the frontend:

Automatic retries: 3 attempts by default
Exponential backoff: Increasing delays between retries
Error boundaries: Catch and display extraction failures

Content Type Classification

The system automatically classifies content by analyzing CSS classes:

generic_extractor.py:36

tipo = 'pelicula' if 'movies' in articulo.get('class', []) else \
       'serie' if 'tvshows' in articulo.get('class', []) else 'Otro'

Movies

CSS class contains movies

Series

CSS class contains tvshows

Other

Neither class (typically anime)

CSS Selector Strategy

The extractors use specific, resilient selectors:

Target	Selector	Fallback
Title	`div.data h1`	`h1.entry-title`
Synopsis	`div[itemprop="description"]`	None
Poster	`div.poster img`	OG meta tags
Genres	`div.sgeneros a`	Empty array
Episodes	`#seasons .se-c li`	Empty array
Player	`.dooplay_player iframe`	None

Selectors prioritize attribute selectors with semantic meaning (e.g., itemprop) over generic class names for better resilience.

Performance Optimizations

Reusable Scraper Instance

http_client.py:3

_scraper = cloudscraper.create_scraper()

Creating the scraper once and reusing it:

Maintains session cookies
Preserves browser fingerprint
Reduces initialization overhead
Improves bypass success rate

Selective Parsing

Extractors only parse needed elements:

# Only select article elements, not entire document
articulos = soup.select('article.item')

Early Returns

Functions return early on failures:

html = fetch_html(url)
if not html:
    return {"info": {}, "episodios": []}

Multi-Architecture Support

The scraping system works across different CPU architectures:

AMD64

Standard x86-64 servers and desktops

ARM64

Raspberry Pi 4, Apple Silicon, cloud instances

ARMv7

Older Raspberry Pi models

The Docker image compiles dependencies natively for each architecture, ensuring optimal performance.

Best Practices

Selector Design

Use specific selectors with semantic meaning
Implement multiple fallbacks for critical data
Test selectors against multiple pages
Document selector assumptions

Error Handling

Wrap extraction logic in try-except blocks
Log errors with context for debugging
Return structured empty data on failure
Use default values for optional fields

Image Extraction

Check data attributes before src
Detect and handle lazy loading
Use noscript as fallback
Fall back to meta tags (OG, Twitter)

Performance

Reuse scraper instances
Parse only required elements
Implement timeouts
Cache responses where appropriate

Debugging Scraping Issues

Common Issues and Solutions

Cloudflare Block
Selector Failure
Image Loading Issues

Symptom: Requests return 403 or challenge pagesSolutions:

Update cloudscraper library
Check if IP is rate-limited
Add delays between requests
Rotate user agents

Catalog Browsing

See how extracted data populates the catalog

Search

Learn about search extraction

Video Player

Understand player iframe extraction

Get Started

Installation

Core Features

Architecture

Configuration

​Overview

​Core Technologies

Cloudscraper

BeautifulSoup

Flask API

​Cloudflare Bypass System

​The Challenge

​The Solution: Cloudscraper

​How Cloudscraper Works

​Modular Extractor Architecture

​Generic Extractor

​Series Extractor

​Iframe Extractor

​Generic Extractor Deep Dive

​Catalog Listing Extraction

​Image Extraction Strategy

​Movie Metadata Extraction

​Series Extractor Deep Dive

​Episode and Season Extraction

​Season Structure Parsing

​Iframe Extractor Deep Dive

​Player URL Extraction

​Ad Blocking System

​Error Handling and Resilience

​Graceful Degradation

​Timeout Configuration

​Retry Logic

​Content Type Classification

Movies

Series

Other

​CSS Selector Strategy

​Performance Optimizations

​Reusable Scraper Instance

​Selective Parsing

​Early Returns

​Multi-Architecture Support

AMD64

ARM64

ARMv7

​Best Practices

​Debugging Scraping Issues

​Common Issues and Solutions

​Related Features

Catalog Browsing

Search

Video Player

Build docs developers (and LLMs) love

Overview

Core Technologies

Cloudflare Bypass System

The Challenge

The Solution: Cloudscraper

How Cloudscraper Works

Modular Extractor Architecture

Generic Extractor

Series Extractor

Iframe Extractor

Generic Extractor Deep Dive

Catalog Listing Extraction

Image Extraction Strategy

Movie Metadata Extraction

Series Extractor Deep Dive

Episode and Season Extraction

Season Structure Parsing

Iframe Extractor Deep Dive

Player URL Extraction

Ad Blocking System

Error Handling and Resilience

Graceful Degradation

Timeout Configuration

Retry Logic

Content Type Classification

CSS Selector Strategy

Performance Optimizations

Reusable Scraper Instance

Selective Parsing

Early Returns

Multi-Architecture Support

Best Practices

Debugging Scraping Issues

Common Issues and Solutions

Related Features