Skip to main content

System Architecture

Web Scraping Hub is a full-stack web application designed to provide streaming and catalog functionality for movies, series, and anime content. The system follows a client-server architecture with clear separation between frontend and backend components.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                        Client Layer                          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  React Frontend (Vite + TypeScript)                  │  │
│  │  - React Router for navigation                       │  │
│  │  - TanStack Query for state management               │  │
│  │  - TailwindCSS for styling                          │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                            ↕ HTTP/REST
┌─────────────────────────────────────────────────────────────┐
│                        Server Layer                          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Flask Backend (Python)                              │  │
│  │  - RESTful API endpoints                            │  │
│  │  - CORS enabled                                      │  │
│  │  - Static file serving                              │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      Processing Layer                        │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Extractor System                                    │  │
│  │  - Generic Extractor (listings & info)              │  │
│  │  - Series Extractor (episodes & metadata)           │  │
│  │  - IFrame Extractor (player URLs)                   │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                       External Layer                         │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  HTTP Client (CloudScraper)                         │  │
│  │  - Cloudflare bypass                                │  │
│  │  - Ad blocking                                      │  │
│  │  - HTML/JSON fetching                               │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Tech Stack

Backend Technologies

Flask

Python web framework for building the REST API

CloudScraper

HTTP client with Cloudflare bypass capabilities

BeautifulSoup4

HTML parsing and web scraping library

AdblockParser

Ad blocking rules for clean HTML extraction

Frontend Technologies

React 18

Modern UI library with hooks and Suspense

TypeScript

Type-safe JavaScript for better developer experience

TanStack Query

Data fetching and caching solution

React Router

Client-side routing with lazy loading

TailwindCSS

Utility-first CSS framework

Vite

Fast build tool and dev server

Data Flow

Request Flow

  1. User Interaction: User navigates to a catalog page or searches for content
  2. Frontend Request: React component triggers API call via TanStack Query hook
  3. Backend Processing: Flask receives request and routes to appropriate handler
  4. Data Extraction: Extractor modules fetch and parse external content
  5. Response: Processed data returned as JSON to frontend
  6. UI Update: React components re-render with new data

Example: Movie Catalog Flow

Key Design Principles

Separation of Concerns

The architecture maintains clear boundaries between layers:
  • Presentation Layer: React components handle UI rendering
  • Business Logic Layer: Flask routes and extractors process data
  • Data Access Layer: HTTP client handles external requests

Modularity

Each component is self-contained and can be modified independently:
  • Extractors are modular and can be extended for new sources
  • Frontend hooks are organized by domain (API, UI, utils)
  • Backend utilities are separated by function

Performance Optimization

  • Lazy Loading: Frontend pages load on demand
  • Caching: TanStack Query caches API responses (5 min stale time)
  • Image Optimization: Lazy loading for catalog images
  • Preloading: First catalog image preloaded for better LCP

Error Handling

  • Backend returns consistent error responses
  • Frontend uses Error Boundaries for graceful degradation
  • Retry logic built into TanStack Query (3 retries)

Deployment Architecture

The application supports multiple deployment modes:

Development Mode

  • Backend runs on port 1234
  • Frontend dev server on port 5173
  • CORS enabled for cross-origin requests

Production Mode

  • Backend serves both API and static frontend files
  • Single port (1234) for entire application
  • Docker support for containerized deployment

Docker Deployment

# Multi-architecture support
- AMD64
- ARM64
- ARMv7
The system is optimized for deployment on CasaOS but works on any Docker-compatible platform.

Configuration

Configuration is centralized in backend/config.py:
backend/config.py
APP_VERSION = "1.4.8"
BASE_URL = "https://sololatino.net"

TARGET_URLS = [
    {"nombre": "Películas", "url": f"{BASE_URL}/peliculas"},
    {"nombre": "Series", "url": f"{BASE_URL}/series"},
    {"nombre": "Anime", "url": f"{BASE_URL}/animes"},
    # ... more sections
]

Security Considerations

  • CORS: Configured to allow frontend communication
  • Ad Blocking: EasyList rules prevent malicious scripts
  • Rate Limiting: CloudScraper handles anti-bot protections
  • Input Validation: URL parameters sanitized before processing

Scalability

Current architecture supports:
  • Horizontal scaling via Docker containers
  • Caching layer can be added (Redis/Memcached)
  • Database integration possible for user data
  • CDN integration for static assets

Next Steps

Explore detailed documentation for each architectural component:

Build docs developers (and LLMs) love