Testing Guide - Web Scrapping Hub

Overview

Web Scraping Hub includes a comprehensive test suite for the backend API and extractors. Tests are written using Python’s unittest framework and can also be run with pytest.

Test Location

All backend tests are located in:

backend/tests/
├── test_api.py           # API endpoint tests with mocking
├── test_api_real.py      # Real API integration tests
├── test_extractors.py    # Extractor unit tests
└── test_lazy_images.py   # Lazy image loading tests

Running Tests

Prerequisites

Ensure you have the development environment set up:

cd backend
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Run All Tests

Using pytest (Recommended)

cd backend
pytest tests/

With verbose output:

pytest tests/ -v

With coverage report:

pytest tests/ --cov=backend --cov-report=html

Using unittest

cd backend
python -m unittest discover tests

Run with verbose output:

python -m unittest discover tests -v

Run Specific Test Files

# Run only API tests
pytest tests/test_api.py

# Run only extractor tests
pytest tests/test_extractors.py

# Run only lazy image tests
pytest tests/test_lazy_images.py

Run Specific Test Cases

# Run a specific test class
pytest tests/test_api.py::TestAPIListado

# Run a specific test method
pytest tests/test_api.py::TestAPIListado::test_busqueda_exitosa

Test Structure

API Tests (`test_api.py`)

Tests Flask API endpoints using mocked responses:

import unittest
from unittest.mock import patch
from backend.app import app

class TestAPIListado(unittest.TestCase):
    def setUp(self):
        """Set up test client before each test"""
        self.app = app.test_client()
        self.app.testing = True

    @patch('backend.app.fetch_json')
    def test_busqueda_exitosa(self, mock_fetch_json):
        """Test successful search functionality"""
        # Mock the external API response
        mock_fetch_json.return_value = {
            "123": {
                "title": "Pelicula de Prueba",
                "url": "https://sololatino.net/peliculas/prueba-slug",
                "img": "img.jpg",
                "type": "pelicula",
                "extra": {"date": "2023"}
            }
        }
        
        # Make request
        response = self.app.get('/api/listado?busqueda=prueba')
        data = response.get_json()
        
        # Assertions
        self.assertEqual(response.status_code, 200)
        self.assertEqual(data['seccion'], 'Busqueda')
        self.assertEqual(len(data['resultados']), 1)

Key test cases:

✅ test_busqueda_exitosa - Successful search results
✅ test_listado_seccion_exitosa - Section listing with pagination
✅ test_seccion_no_encontrada - 404 error handling

Extractor Tests (`test_extractors.py`)

Unit tests for HTML parsing and extraction logic:

import unittest
from extractors.generic_extractor import extraer_listado, extraer_info_pelicula

class TestExtractors(unittest.TestCase):
    
    def test_extraer_listado(self):
        """Test catalog listing extraction"""
        html_dummy = """
        <article class="item movies" data-id="123">
            <div class="poster">
                <img src="img.jpg" alt="Pelicula Prueba">
                <div class="data">
                    <h3>Pelicula Prueba</h3>
                    <p>2023</p>
                </div>
                <a href="https://sololatino.net/peliculas/prueba-slug"></a>
            </div>
        </article>
        """
        resultados = extraer_listado(html_dummy)
        
        self.assertEqual(len(resultados), 1)
        self.assertEqual(resultados[0]['titulo'], 'Pelicula Prueba')
        self.assertEqual(resultados[0]['slug'], 'prueba-slug')
        self.assertEqual(resultados[0]['year'], '2023')

Key test cases:

✅ test_extraer_listado - Catalog item extraction
✅ test_extraer_info_pelicula - Movie detail extraction
Validates: title, slug, year, type, synopsis, genres, images

Lazy Image Tests (`test_lazy_images.py`)

Tests for handling lazy-loaded images with various fallback strategies:

import unittest
from extractors.generic_extractor import extraer_listado

class TestLazyImages(unittest.TestCase):
    
    PLACEHOLDER = "data:image/gif;base64,R0lGODdhAQABAPAAAMPDwwAAACwAAAAAAQABAAACAkQBADs="

    def test_generic_listado_lazy_image(self):
        """Test data-src attribute extraction"""
        html = f"""
        <article class="item movies">
            <div class="poster">
                <img src="{self.PLACEHOLDER}" 
                     data-src="https://real-image.com/poster.jpg" 
                     alt="Test">
                <a href="/peliculas/test"></a>
            </div>
        </article>
        """
        resultados = extraer_listado(html)
        img = resultados[0]['imagen']
        
        # Should extract real image, not placeholder
        self.assertNotEqual(img, self.PLACEHOLDER)
        self.assertEqual(img, "https://real-image.com/poster.jpg")

Key test cases:

✅ test_generic_listado_lazy_image - data-src extraction
✅ test_generic_listado_noscript_fallback - noscript fallback
✅ test_generic_info_lazy_image - data-lazy-src extraction
✅ test_generic_info_og_fallback - OpenGraph fallback
✅ test_serie_lazy_image - Series image extraction

Real API Tests (`test_api_real.py`)

Integration tests that make actual HTTP requests to the target website. These tests:

Verify real-world scraping functionality
Test Cloudflare bypass
Validate actual HTML structure
May be slower and require internet connection

Real API tests depend on external websites and may fail if the target site changes structure or is unavailable.

Writing New Tests

Test Structure Template

import sys
import os
import unittest
from unittest.mock import patch, MagicMock

# Add backend to path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from backend.app import app
from backend.extractors.generic_extractor import extraer_listado

class TestNewFeature(unittest.TestCase):
    
    def setUp(self):
        """Set up test fixtures before each test"""
        self.app = app.test_client()
        self.app.testing = True
    
    def tearDown(self):
        """Clean up after each test"""
        pass
    
    def test_feature_success(self):
        """Test successful feature execution"""
        # Arrange
        expected_result = "success"
        
        # Act
        result = some_function()
        
        # Assert
        self.assertEqual(result, expected_result)
    
    def test_feature_failure(self):
        """Test feature error handling"""
        with self.assertRaises(ValueError):
            some_function(invalid_input)

if __name__ == '__main__':
    unittest.main()

Testing Best Practices

1. Use Descriptive Test Names

# Good
def test_extraer_listado_returns_empty_list_for_invalid_html(self):
    pass

# Bad
def test_extractor(self):
    pass

2. Follow AAA Pattern

def test_example(self):
    # Arrange - Set up test data
    html = "<div>test</div>"
    
    # Act - Execute the function
    result = parse_html(html)
    
    # Assert - Verify the outcome
    self.assertEqual(result, expected)

3. Mock External Dependencies

@patch('backend.utils.http_client.fetch_html')
def test_with_mock(self, mock_fetch):
    mock_fetch.return_value = "<html>mocked</html>"
    result = function_that_uses_fetch()
    self.assertTrue(mock_fetch.called)

4. Test Edge Cases

def test_empty_input(self):
    """Test behavior with empty input"""
    result = extraer_listado("")
    self.assertEqual(result, [])

def test_malformed_html(self):
    """Test handling of malformed HTML"""
    result = extraer_listado("<div>unclosed")
    self.assertIsNotNone(result)

5. Use setUp and tearDown

class TestWithSetup(unittest.TestCase):
    
    def setUp(self):
        """Run before each test"""
        self.test_data = load_test_data()
    
    def tearDown(self):
        """Run after each test"""
        cleanup_test_data()

Test Coverage

Generate a coverage report:

pytest tests/ --cov=backend --cov-report=html

View the report:

open htmlcov/index.html  # macOS
xdg-open htmlcov/index.html  # Linux
start htmlcov/index.html  # Windows

Current Coverage Areas

✅ API endpoints (/api/listado, /api/pelicula/*, /api/serie/*)
✅ Generic extractor (listings, details)
✅ Serie extractor (episodes, seasons)
✅ Lazy image loading (multiple fallback strategies)
✅ Error handling (404s, invalid sections)

Areas for Improvement

⚠️ Frontend component testing (consider adding Jest/Vitest)
⚠️ iframe extractor tests
⚠️ Integration tests for complete workflows
⚠️ Performance/load testing

Continuous Testing

Watch Mode

Run tests automatically on file changes:

pytest tests/ --watch

Or use pytest-watch:

pip install pytest-watch
ptw tests/

Pre-commit Hook

Create .git/hooks/pre-commit:

#!/bin/bash
cd backend
python -m pytest tests/
if [ $? -ne 0 ]; then
    echo "Tests failed. Commit aborted."
    exit 1
fi

Make it executable:

chmod +x .git/hooks/pre-commit

Debugging Tests

Print Debug Output

pytest tests/ -s  # Don't capture stdout

Run Specific Test with Debugging

if __name__ == '__main__':
    unittest.main(verbosity=2)

Use pdb for Interactive Debugging

import pdb

def test_with_debugger(self):
    result = some_function()
    pdb.set_trace()  # Debugger will pause here
    self.assertEqual(result, expected)

Test Data

Sample HTML Fixtures

Create reusable test fixtures:

# tests/fixtures.py
SAMPLE_LISTING_HTML = """
<article class="item movies" data-id="123">
    <div class="poster">
        <img src="image.jpg" alt="Movie Title">
        <div class="data">
            <h3>Movie Title</h3>
            <p>2023</p>
        </div>
        <a href="/peliculas/movie-slug"></a>
    </div>
</article>
"""

# tests/test_extractors.py
from tests.fixtures import SAMPLE_LISTING_HTML

def test_with_fixture(self):
    result = extraer_listado(SAMPLE_LISTING_HTML)
    self.assertEqual(len(result), 1)

Common Test Scenarios

Testing API Endpoints

def test_api_endpoint(self):
    response = self.app.get('/api/listado?seccion=Peliculas&pagina=1')
    self.assertEqual(response.status_code, 200)
    data = response.get_json()
    self.assertIn('resultados', data)
    self.assertIsInstance(data['resultados'], list)

Testing Extractors

def test_extractor_returns_correct_structure(self):
    html = "<html>...</html>"
    result = extraer_listado(html)
    
    self.assertIsInstance(result, list)
    if result:
        item = result[0]
        self.assertIn('titulo', item)
        self.assertIn('slug', item)
        self.assertIn('imagen', item)

Testing Error Handling

def test_invalid_section_returns_404(self):
    response = self.app.get('/api/listado?seccion=InvalidSection')
    self.assertEqual(response.status_code, 404)
    data = response.get_json()
    self.assertIn('error', data)

Always run tests before committing changes to ensure you haven’t broken existing functionality.

Frontend Testing

While the current test suite focuses on backend, you can add frontend tests:

cd frontend/project
npm install --save-dev vitest @testing-library/react

Create frontend/project/src/tests/App.test.tsx:

import { describe, it, expect } from 'vitest'
import { render, screen } from '@testing-library/react'
import App from '../App'

describe('App', () => {
  it('renders without crashing', () => {
    render(<App />)
    expect(screen.getByText(/Web Scraping Hub/i)).toBeInTheDocument()
  })
})

Frontend testing infrastructure is not yet fully configured. This is a good area for contribution!

Contributing

Frontend

Backend

​Overview

​Test Location

​Running Tests

​Prerequisites

​Run All Tests

​Using pytest (Recommended)

​Using unittest

​Run Specific Test Files

​Run Specific Test Cases

​Test Structure

​API Tests (test_api.py)

​Extractor Tests (test_extractors.py)

​Lazy Image Tests (test_lazy_images.py)

​Real API Tests (test_api_real.py)

​Writing New Tests

​Test Structure Template

​Testing Best Practices

​1. Use Descriptive Test Names

​2. Follow AAA Pattern

​3. Mock External Dependencies

​4. Test Edge Cases

​5. Use setUp and tearDown

​Test Coverage

​Current Coverage Areas

​Areas for Improvement

​Continuous Testing

​Watch Mode

​Pre-commit Hook

​Debugging Tests

​Print Debug Output

​Run Specific Test with Debugging

​Use pdb for Interactive Debugging

​Test Data

​Sample HTML Fixtures

​Common Test Scenarios

​Testing API Endpoints

​Testing Extractors

​Testing Error Handling

​Frontend Testing

Build docs developers (and LLMs) love

Overview

Test Location

Running Tests

Prerequisites

Run All Tests

Using pytest (Recommended)

Using unittest

Run Specific Test Files

Run Specific Test Cases

Test Structure

API Tests (`test_api.py`)

Extractor Tests (`test_extractors.py`)

Lazy Image Tests (`test_lazy_images.py`)

Real API Tests (`test_api_real.py`)

Writing New Tests

Test Structure Template

Testing Best Practices

1. Use Descriptive Test Names

2. Follow AAA Pattern

3. Mock External Dependencies

4. Test Edge Cases

5. Use setUp and tearDown

Test Coverage

Current Coverage Areas

Areas for Improvement

Continuous Testing

Watch Mode

Pre-commit Hook

Debugging Tests

Print Debug Output

Run Specific Test with Debugging

Use pdb for Interactive Debugging

Test Data

Sample HTML Fixtures

Common Test Scenarios

Testing API Endpoints

Testing Extractors

Testing Error Handling

Frontend Testing