Skip to main content

Overview

The scraping module provides web scraping functionality for extracting job posting data from LinkedIn. It implements the Strategy pattern to allow for flexible scraping implementations.

ScraperStrategy

Abstract base class that defines the interface for all scraper implementations.

Class Definition

from abc import ABC, abstractmethod
from typing import Dict

class ScraperStrategy(ABC):
    """Interfaz base para cualquier scraper de vacantes."""

Methods

extraer_datos()

Abstract method that must be implemented by all scraper strategies.
@abstractmethod
def extraer_datos(self, termino_busqueda: str) -> Dict:
    pass
termino_busqueda
str
required
The search term to query for job postings
return
Dict
Dictionary containing extraction results with the following structure:
  • exito (bool): Whether the extraction was successful
  • titulo_oferta (str): Job posting title
  • url (str): URL of the job posting
  • habilidades_brutas (List[str]): Raw extracted skills/requirements
  • mensaje (str): Error message if extraction failed

LinkedInScraper

Concrete implementation of ScraperStrategy for extracting job data from LinkedIn using Selenium WebDriver.

Class Definition

from selenium import webdriver
from bs4 import BeautifulSoup
from .base_scraper import ScraperStrategy

class LinkedInScraper(ScraperStrategy):
    """Implementación específica para extraer datos de LinkedIn."""

Constructor

def __init__(self):
    self.driver = None
Initializes a new LinkedInScraper instance with a null WebDriver that will be created when extraction begins.

Methods

extraer_datos()

Navigates to LinkedIn, handles modals, and extracts job posting data.
def extraer_datos(self, termino_busqueda: str) -> Dict:
termino_busqueda
str
required
Search term for job postings. The word “linkedin” will be automatically removed if present.
return
Dict
Dictionary containing:
  • exito (bool): True if extraction succeeded
  • titulo_oferta (str): Extracted job title
  • url (str): Final URL of the job posting
  • habilidades_brutas (List[str]): Raw list of skills and requirements extracted from the job description
On failure:
  • exito (bool): False
  • mensaje (str): Error description
Process Flow:
  1. Constructs LinkedIn search URL with encoded search term
  2. Initializes Chrome browser with anti-detection configuration
  3. Handles cookie consent and modal dismissals
  4. Navigates to the first job posting in search results
  5. Expands the full job description (“Show more” button)
  6. Extracts job title and description using BeautifulSoup
  7. Parses requirements from <li> elements, paragraphs, or raw text
  8. Returns raw extracted data without cleaning
Example Usage:
from scraping.linkedin_scraper import LinkedInScraper

scraper = LinkedInScraper()
resultado = scraper.extraer_datos("Python Developer")

if resultado['exito']:
    print(f"Título: {resultado['titulo_oferta']}")
    print(f"URL: {resultado['url']}")
    print(f"Habilidades encontradas: {len(resultado['habilidades_brutas'])}")
else:
    print(f"Error: {resultado['mensaje']}")

_iniciar_navegador()

Private method that initializes the Selenium Chrome WebDriver with anti-detection settings.
def _iniciar_navegador(self):
Configuration:
  • Sets custom user agent to mimic real browser
  • Disables automation detection features
  • Maximizes browser window
  • Uses ChromeDriverManager for automatic driver management

_cerrar_navegador()

Private method that safely closes the WebDriver instance.
def _cerrar_navegador(self):
Automatically called in the finally block of extraer_datos() to ensure cleanup.

Important Notes

The scraper returns raw, unfiltered data in habilidades_brutas. Use the TextCleaner utility class to clean and filter this data before analysis.
LinkedIn may show cookie consent dialogs or login modals. The scraper automatically handles these with configurable wait times and fallback ESC key presses.
The search URL is hardcoded to filter for jobs in México. Modify line 46 in linkedin_scraper.py to change the location parameter.

Build docs developers (and LLMs) love