Skip to main content

Overview

The backend configuration is managed through config.py, which contains all core settings for the Web Scraping Hub application. This file controls application versioning, scraping targets, and API endpoints.

Configuration File Location

The main configuration file is located at:
backend/config.py

Application Configuration

Version Settings

APP_VERSION
string
default:"1.4.8"
Current version of the application. This is used for version checking and update notifications.
GITHUB_VERSION_URL
string
URL to check for the latest version available on GitHub.
GITHUB_VERSION_URL = "https://raw.githubusercontent.com/UnfairAdventage/Web-Scrapping/refs/heads/main/CurrentVersion"
GITHUB_CHANGES_URL
string
URL to fetch the changelog from GitHub.
GITHUB_CHANGES_URL = "https://raw.githubusercontent.com/UnfairAdventage/Web-Scrapping/refs/heads/main/Changes"

Scraping Configuration

Base URL

BASE_URL
string
default:"https://sololatino.net"
The base URL for all scraping operations. All target URLs are constructed relative to this base.
BASE_URL = "https://sololatino.net"
Changing the BASE_URL will affect all scraping operations. Ensure the new URL follows the same structure and API patterns as the default.

Target URLs Configuration

TARGET_URLS
list[dict]
List of scraping targets, each containing a name and URL. This defines all content categories available in the application.Each target is a dictionary with:
  • nombre (string): Display name for the section
  • url (string): Full URL to scrape

Flask Server Configuration

The Flask server is configured in backend/app.py with the following settings:

Server Settings

host
string
default:"0.0.0.0"
Server host address. Using 0.0.0.0 allows external connections.
app.run(debug=True, host='0.0.0.0', port="1234")
port
string
default:"1234"
Port number for the Flask server.
debug
boolean
default:"true"
Enable debug mode for development. Should be false in production.

CORS Configuration

from flask_cors import CORS
CORS(app)
CORS is enabled by default to allow frontend-backend communication during development.

Caching Configuration

SEND_FILE_MAX_AGE_DEFAULT
integer
default:"0"
Maximum age for file caching. Set to 0 to disable caching.
app.config['SEND_FILE_MAX_AGE_DEFAULT'] = 0
ETAG_DISABLED
boolean
default:"true"
Disable ETags for responses.
app.config['ETAG_DISABLED'] = True

HTTP Client Configuration

The application uses cloudscraper to bypass anti-bot protection:
import cloudscraper
_scraper = cloudscraper.create_scraper()

Request Timeouts

timeout
integer
default:"30"
Default timeout for HTTP requests in seconds.
response = _scraper.get(url, timeout=30)
The timeout for version checks is reduced to 5 seconds to prevent blocking during startup.

Custom Configuration Example

To customize the configuration for your needs:
# backend/config.py

# APP CONFIG
APP_VERSION = "1.5.0"
GITHUB_VERSION_URL = "https://your-repo.com/version"
GITHUB_CHANGES_URL = "https://your-repo.com/changes"

# SCRAPING CONFIG
BASE_URL = "https://your-target-site.com"

TARGET_URLS = [
    {"nombre": "Movies", "url": f"{BASE_URL}/movies"},
    {"nombre": "TV Shows", "url": f"{BASE_URL}/tv-shows"},
    {"nombre": "Documentaries", "url": f"{BASE_URL}/docs"},
]

Modifying Server Settings

To change the server port or host, edit backend/app.py:
if __name__ == '__main__':
    app.run(
        debug=False,  # Set to False in production
        host='0.0.0.0',  # Allow external connections
        port="8080"  # Your custom port
    )
Always set debug=False in production environments to prevent security vulnerabilities and information leakage.

Configuration Validation

To verify your configuration is working:
# Test the API endpoints
curl http://localhost:1234/api/secciones

# Check version information
curl http://localhost:1234/api/version

Target URLs

Learn how to configure scraping targets

Environment Variables

Set up environment-specific configuration

Build docs developers (and LLMs) love