Overview
Universal Novel Scraper uses a sophisticated multi-process architecture where Electron controls a Chromium browser to scrape content like a real user, then sends that data to a Python FastAPI backend for processing and EPUB generation. This page documents the complete scraping workflow from start to finish.Architecture Diagram
Component Responsibilities
React Frontend
- User interface
- Download manager UI
- Library browser
- Settings & controls
Electron Main
- Browser window management
- IPC message routing
- Provider loading
- Scraping orchestration
Chromium Browser
- Page loading
- JavaScript execution
- DOM manipulation
- Cloudflare bypass
Provider Scripts
- Site-specific selectors
- Content extraction
- Navigation logic
Python Backend
- Chapter storage
- Progress tracking
- EPUB compilation
- File management
File System
- EPUB storage
- Job history
- Progress logs
Complete Scraping Flow
Phase 1: Novel Discovery
Phase 2: Novel Metadata Extraction
Extract Metadata
Provider script extracts:
- Novel description/synopsis
- Complete chapter list with titles and URLs
- Author name (if available)
- Cover image URL
Phase 3: Scraping Initialization
User Configures Scrape
User sets scraping options:
- Start chapter (default: 1)
- End chapter (default: last)
- Enable Cloudflare bypass checkbox
- Show browser window toggle
Phase 4: Chapter Scraping Loop
This is the core of the scraping process:Cloudflare Detection
Phase 5: EPUB Generation
Data Flow Summary
Request Flow
File Storage Structure
jobs_history.json Format
Progress File (.jsonl) Format
- Chapter title (string)
- Array of paragraphs (array of strings)
State Management
Global State Variables
State Transitions
Error Handling
Chapter Scraping Errors
- No content found: Provider selectors don’t match page structure
- Timeout: Page takes too long to load
- Network error: Connection issues
- Cloudflare timeout: Challenge not solved within 60 seconds
Backend Errors
- 400: Missing required fields
- 404: File/job not found
- 500: Internal server error
Performance Characteristics
Scraping Speed
| Configuration | Chapters/Minute | Notes |
|---|---|---|
| Normal mode | 60-120 | 100-500ms delay between chapters |
| Cloudflare bypass | 15-40 | 1500-4000ms delay between chapters |
| Show browser | 50-100 | Similar to normal, slight overhead |
Resource Usage
- Memory: ~200-400 MB (Electron + Chromium + Python)
- CPU: Low (mostly waiting for network)
- Disk I/O: Minimal (small JSON writes, final EPUB write)
- Network: Depends on chapter size, typically 50-200 KB per chapter
Pause and Resume
UNS supports pausing and resuming scrapes:Best Practices
Check Progress Files
During development, inspect
.jsonl files to debug extraction issues:Monitor Backend Logs
Run backend manually to see detailed logs:
Test with Small Ranges
When testing providers, scrape only 2-3 chapters first.
Use Show Browser
Enable “Show Browser” to watch real-time scraping and debug issues.
Troubleshooting Flow
When scraping fails, trace through the pipeline:Page won't load
Page won't load
- Check if URL is valid
- Test URL in regular browser
- Check for Cloudflare (enable bypass)
- Check internet connection
No content extracted
No content extracted
- Enable “Show Browser” to see actual page
- Open DevTools on scraper window
- Test provider script in console
- Check if selectors match page HTML
- Verify page isn’t login-protected
Backend errors
Backend errors
- Check Python backend is running (port 8000)
- Review backend console logs
- Check file permissions on output directory
- Verify disk space available
EPUB generation fails
EPUB generation fails
- Check progress file exists and has content
- Verify ebooklib is installed
- Check for invalid characters in novel/chapter titles
- Review backend logs for stack trace
Next Steps
Cloudflare Bypass
Learn how to handle Cloudflare challenges
Provider System
Create custom providers for new websites
Quick Start
Back to basics: Using UNS as a user
Troubleshooting
Common issues and solutions
