System Architecture
Web Scraping Hub is a full-stack web application designed to provide streaming and catalog functionality for movies, series, and anime content. The system follows a client-server architecture with clear separation between frontend and backend components.Architecture Diagram
Tech Stack
Backend Technologies
Flask
Python web framework for building the REST API
CloudScraper
HTTP client with Cloudflare bypass capabilities
BeautifulSoup4
HTML parsing and web scraping library
AdblockParser
Ad blocking rules for clean HTML extraction
Frontend Technologies
React 18
Modern UI library with hooks and Suspense
TypeScript
Type-safe JavaScript for better developer experience
TanStack Query
Data fetching and caching solution
React Router
Client-side routing with lazy loading
TailwindCSS
Utility-first CSS framework
Vite
Fast build tool and dev server
Data Flow
Request Flow
- User Interaction: User navigates to a catalog page or searches for content
- Frontend Request: React component triggers API call via TanStack Query hook
- Backend Processing: Flask receives request and routes to appropriate handler
- Data Extraction: Extractor modules fetch and parse external content
- Response: Processed data returned as JSON to frontend
- UI Update: React components re-render with new data
Example: Movie Catalog Flow
Key Design Principles
Separation of Concerns
The architecture maintains clear boundaries between layers:- Presentation Layer: React components handle UI rendering
- Business Logic Layer: Flask routes and extractors process data
- Data Access Layer: HTTP client handles external requests
Modularity
Each component is self-contained and can be modified independently:- Extractors are modular and can be extended for new sources
- Frontend hooks are organized by domain (API, UI, utils)
- Backend utilities are separated by function
Performance Optimization
- Lazy Loading: Frontend pages load on demand
- Caching: TanStack Query caches API responses (5 min stale time)
- Image Optimization: Lazy loading for catalog images
- Preloading: First catalog image preloaded for better LCP
Error Handling
- Backend returns consistent error responses
- Frontend uses Error Boundaries for graceful degradation
- Retry logic built into TanStack Query (3 retries)
Deployment Architecture
The application supports multiple deployment modes:Development Mode
- Backend runs on port 1234
- Frontend dev server on port 5173
- CORS enabled for cross-origin requests
Production Mode
- Backend serves both API and static frontend files
- Single port (1234) for entire application
- Docker support for containerized deployment
Docker Deployment
The system is optimized for deployment on CasaOS but works on any Docker-compatible platform.
Configuration
Configuration is centralized inbackend/config.py:
backend/config.py
Security Considerations
- CORS: Configured to allow frontend communication
- Ad Blocking: EasyList rules prevent malicious scripts
- Rate Limiting: CloudScraper handles anti-bot protections
- Input Validation: URL parameters sanitized before processing
Scalability
Current architecture supports:- Horizontal scaling via Docker containers
- Caching layer can be added (Redis/Memcached)
- Database integration possible for user data
- CDN integration for static assets
Next Steps
Explore detailed documentation for each architectural component: