Skip to main content

Overview

Target URLs define the content sources that Web Scraping Hub will scrape. Each target represents a category of content (movies, series, anime, etc.) and is configured in the TARGET_URLS list in config.py.

Default Configuration

The default target URLs are configured for the SoloLatino.net platform:
BASE_URL = "https://sololatino.net"

TARGET_URLS = [
    {"nombre": "Películas", "url": f"{BASE_URL}/peliculas"},
    {"nombre": "Series", "url": f"{BASE_URL}/series"},
    {"nombre": "Anime", "url": f"{BASE_URL}/animes"},
    {"nombre": "Peliculas de Anime", "url": f"{BASE_URL}/genero/anime"},
    {"nombre": "Caricaturas", "url": f"{BASE_URL}/genre_series/toons"},
    {"nombre": "K-Drama", "url": f"{BASE_URL}/genre_series/kdramas/"},
    {"nombre": "Amazon", "url": f"{BASE_URL}/network/amazon"},
    {"nombre": "Apple TV", "url": f"{BASE_URL}/network/apple-tv"},
    {"nombre": "Disney", "url": f"{BASE_URL}/network/disney"},
    {"nombre": "HBO", "url": f"{BASE_URL}/network/hbo"},
    {"nombre": "HBO Max", "url": f"{BASE_URL}/network/hbo-max"},
    {"nombre": "Hulu", "url": f"{BASE_URL}/network/hulu"},
    {"nombre": "Netflix", "url": f"{BASE_URL}/network/netflix"},
]

Target URL Structure

Each target URL is a dictionary with the following fields:
nombre
string
required
The display name for this content category. This name appears in the UI and is used in API endpoints.Examples: "Películas", "Series", "Netflix"
url
string
required
The full URL to scrape for this category. Typically constructed using the BASE_URL.Example: f"{BASE_URL}/peliculas"

Content Categories

Media Type Categories

These categories organize content by media type:
CategoryNameURL Pattern
MoviesPelículas/peliculas
TV SeriesSeries/series
Anime SeriesAnime/animes
Anime MoviesPeliculas de Anime/genero/anime
CartoonsCaricaturas/genre_series/toons
Korean DramasK-Drama/genre_series/kdramas/

Streaming Platform Categories

These categories filter content by streaming platform:
PlatformURL Pattern
Amazon/network/amazon
Apple TV/network/apple-tv
Disney/network/disney
HBO/network/hbo
HBO Max/network/hbo-max
Hulu/network/hulu
Netflix/network/netflix

Adding New Target URLs

To add a new content category:
TARGET_URLS = [
    # ... existing targets ...
    {"nombre": "Paramount+", "url": f"{BASE_URL}/network/paramount"},
    {"nombre": "Documentales", "url": f"{BASE_URL}/documentales"},
]
1

Add to TARGET_URLS

Add a new dictionary to the TARGET_URLS list in config.py
2

Restart the Server

Restart the Flask backend to apply the changes
3

Verify in API

Check that the new section appears in /api/secciones

Pagination Handling

The application automatically handles pagination for target URLs:
# Most categories use standard pagination
url = f"{url}/page/{pagina}"

# K-Drama has special pagination format
if seccion_real == 'K-Drama':
    url = f"{url}page/{pagina}"
else:
    url = f"{url}/page/{pagina}"
The K-Drama category uses a different URL pattern for pagination without a leading slash.

URL Normalization

The backend normalizes section names to handle case-insensitive and accent-insensitive matching:
def normaliza(texto):
    return unicodedata.normalize('NFKD', texto)
        .encode('ascii', 'ignore')
        .decode('ascii')
        .lower()
This allows users to request "peliculas", "Películas", or "PELICULAS" and get the same results.

Custom Target URL Example

For a different scraping source:
# Custom configuration for a different site
BASE_URL = "https://mystreaming.com"

TARGET_URLS = [
    {"nombre": "Action Movies", "url": f"{BASE_URL}/genre/action"},
    {"nombre": "Comedy Series", "url": f"{BASE_URL}/series/comedy"},
    {"nombre": "Thriller", "url": f"{BASE_URL}/thriller"},
    {"nombre": "Kids Content", "url": f"{BASE_URL}/kids"},
]
When using a custom site, ensure the HTML structure matches the expected format for the extractors. You may need to modify the extractor functions in backend/extractors/.

API Integration

Target URLs are exposed through the API:

List All Sections

GET /api/secciones
Response:
{
  "secciones": [
    "Películas",
    "Series",
    "Anime",
    "Peliculas de Anime",
    "Caricaturas",
    "K-Drama",
    "Amazon",
    "Apple TV",
    "Disney",
    "HBO",
    "HBO Max",
    "Hulu",
    "Netflix"
  ]
}

Get Content from Section

GET /api/listado?seccion=Películas&pagina=1
Response:
{
  "resultados": [...],
  "seccion": "Películas",
  "pagina": 1
}

URL Validation

To test a target URL configuration:
# Test in Python console
from backend.utils.http_client import fetch_html
from backend.extractors.generic_extractor import extraer_listado

url = "https://sololatino.net/peliculas"
html = fetch_html(url)
if html:
    resultados = extraer_listado(html)
    print(f"Found {len(resultados)} items")

Special URL Patterns

Search URLs

Search functionality uses a different URL pattern:
# API search endpoint
url = f"https://sololatino.net/wp-json/dooplay/search/?keyword={query}&nonce=84428a202e"

# Deep search endpoint
url = f"https://sololatino.net/?s={quote_plus(query)}"

Content-Specific URLs

When accessing specific content:
# Movie URL pattern
url = f"{BASE_URL}/peliculas/{slug}"

# Series URL pattern
url = f"{BASE_URL}/series/{slug}"

# Anime URL pattern
url = f"{BASE_URL}/animes/{slug}"

Troubleshooting

  • Verify the target URL is properly added to TARGET_URLS
  • Restart the Flask backend server
  • Check for syntax errors in config.py
  • Verify the URL is accessible
  • Check if the site structure has changed
  • Ensure the extractor matches the HTML structure
  • Test with fetch_html() function
  • Check if the URL pattern matches the site’s pagination format
  • Some sections may use different pagination patterns
  • Verify the URL construction in app.py

Backend Configuration

Configure the Flask backend server

Extractors

Learn about content extraction logic

Build docs developers (and LLMs) love