Web Scraping

Overview

The LinkedIn Job Analyzer uses Selenium WebDriver with Chrome to automate job data extraction from LinkedIn. The scraper navigates to LinkedIn job search pages, handles authentication modals, and extracts structured job information using BeautifulSoup.

Architecture

The scraping system follows the Strategy Pattern with two main components:

ScraperStrategy (base_scraper.py:4) - Abstract base class defining the scraper interface
LinkedInScraper (linkedin_scraper.py:15) - Concrete implementation for LinkedIn

from abc import ABC, abstractmethod
from typing import Dict

class ScraperStrategy(ABC):
    """Interfaz base para cualquier scraper de vacantes."""
    
    @abstractmethod
    def extraer_datos(self, termino_busqueda: str) -> Dict:
        """Debe retornar un diccionario con el éxito, título, url y habilidades brutas."""
        pass

Browser Automation Setup

The scraper initializes Chrome with anti-detection configurations to avoid bot detection:

linkedin_scraper.py:21-32

def _iniciar_navegador(self):
    """Inicia el navegador Selenium (Código original de logica.py)"""
    options = webdriver.ChromeOptions()
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"        
    options.add_argument(f'user-agent={user_agent}')
    options.add_argument('--start-maximized')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    servicio = ChromeService(ChromeDriverManager().install())
    self.driver = webdriver.Chrome(service=servicio, options=options) 
    self.driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

Anti-Detection Features

Custom User Agent

Sets a realistic Chrome user agent string to mimic a regular browser

Automation Flags Disabled

Removes --enable-automation switch and useAutomationExtension flags

WebDriver Property Override

Executes JavaScript to make navigator.webdriver return undefined

Data Extraction Process

The scraper follows a multi-step process:

1. URL Construction

linkedin_scraper.py:44-46

termino_limpio = termino_busqueda.lower().replace("linkedin", "").strip()
termino_codificado = quote_plus(termino_limpio)
url = f"https://www.linkedin.com/jobs/search/?keywords={termino_codificado}&location=M%C3%A9xico"

The search term is cleaned, URL-encoded, and formatted into a LinkedIn jobs search URL targeting Mexico.

The scraper automatically handles LinkedIn’s cookie consent banners and popup modals

linkedin_scraper.py:54-68

# Cookie consent handling
try:
    cookie_wait = WebDriverWait(self.driver, 3) 
    css_selector = "button[data-tracking-control-name='cookie-consent-accept'], button.artdeco-global-banner__accept"
    cookie_button = cookie_wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, css_selector)))
    cookie_button.click()
except: pass

# Modal dismissal
try:
    wait_popup = WebDriverWait(self.driver, 3) 
    close_button = wait_popup.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.modal__dismiss, button[aria-label='Dismiss']")))
    close_button.click()
except: 
    try: self.driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.ESCAPE)
    except: pass

3. First Job Selection

The scraper locates and navigates to the first job listing:

linkedin_scraper.py:70-84

if "/jobs/search/" in self.driver.current_url:
    print("[SCRAPER] Buscando enlace del primer empleo...")
    try:
        enlace_empleo = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 
            "ul.jobs-search__results-list li a.base-card__full-link," + 
            "a.job-search-card__url-overlay," +
            "div.job-search-card a"
        )))
        url_oferta_especifica = enlace_empleo.get_attribute("href")
        
        self.driver.get(url_oferta_especifica)
        time.sleep(3)

4. Description Expansion

LinkedIn truncates job descriptions by default. The scraper clicks “Show more” to reveal the full content.

linkedin_scraper.py:86-97

try:
    boton_ver_mas_selector = (By.CSS_SELECTOR, 
        "button.show-more-less-html__button," +
        "button[aria-label='Show more, visually expands previously read content']," +
        "button.description__footer-button"
    )
    boton_ver_mas = WebDriverWait(self.driver, 5).until(EC.element_to_be_clickable(boton_ver_mas_selector))
    self.driver.execute_script("arguments[0].click();", boton_ver_mas)
    time.sleep(1)
except:
    pass

5. HTML Parsing with BeautifulSoup

Once the page is fully loaded, BeautifulSoup extracts structured data:

linkedin_scraper.py:99-127

html = self.driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Extract job title
titulo = "Oferta sin título"
titulo_elem = soup.find('h1', class_=re.compile('top-card-layout__title|jobs-top-card__job-title'))
if not titulo_elem: titulo_elem = soup.find('a', class_=re.compile('top-card-layout__title'))
if titulo_elem: titulo = titulo_elem.get_text(strip=True)

# Extract skills and requirements
habilidades = []
description_container = soup.find('div', class_=re.compile(r'show-more-less-html__markup|description__text|job-description-content'))

if description_container:
    items = description_container.find_all('li')
    for item in items:
        habilidades.append(item.get_text(strip=True))
    
    if len(habilidades) < 3:
        parrafos = description_container.find_all('p')
        for p in parrafos:
            texto = p.get_text(strip=True)
            if len(texto) > 10: 
                habilidades.append(texto)

Return Format

The extraer_datos method returns a dictionary with the following structure:

linkedin_scraper.py:132-137

return {
    'exito': True,
    'titulo_oferta': titulo,
    'url': self.driver.current_url,
    'habilidades_brutas': habilidades 
}

exito

boolean

Indicates whether the scraping operation succeeded

titulo_oferta

string

The extracted job title

url

string

The final URL of the job listing

habilidades_brutas

list[str]

Raw list of extracted skills and requirements (unprocessed)

Error Handling

The scraper includes comprehensive error handling:

linkedin_scraper.py:139-148

except Exception as e:
    import traceback
    traceback.print_exc()
    return {
        'exito': False,
        'mensaje': f"Error al extraer: {str(e)}"
    }

finally:
    self._cerrar_navegador()

The browser is always closed in the finally block to prevent resource leaks

Best Practices

Wait Strategies

Uses explicit waits (WebDriverWait) instead of implicit waits for reliable element detection

Multiple Selectors

Targets multiple CSS selectors to handle LinkedIn’s varying page structures

Graceful Degradation

Wraps optional steps in try-except blocks to continue even if non-critical elements are missing

Resource Cleanup

Always closes the browser in the finally block to prevent zombie processes

Get Started

Features

Configuration

Overview

Architecture

Browser Automation Setup

Anti-Detection Features

Data Extraction Process

1. URL Construction

3. First Job Selection

4. Description Expansion

5. HTML Parsing with BeautifulSoup

Return Format

Error Handling

Best Practices

Wait Strategies

Multiple Selectors

Graceful Degradation

Resource Cleanup

Build docs developers (and LLMs) love

Get Started

Features

Configuration

​Overview

​Architecture

​Browser Automation Setup

​Anti-Detection Features

​Data Extraction Process

​1. URL Construction

​2. Cookie & Modal Handling

​3. First Job Selection

​4. Description Expansion

​5. HTML Parsing with BeautifulSoup

​Return Format

​Error Handling

​Best Practices

Wait Strategies

Multiple Selectors

Graceful Degradation

Resource Cleanup

Build docs developers (and LLMs) love

Overview

Architecture

Browser Automation Setup

Anti-Detection Features

Data Extraction Process

1. URL Construction

2. Cookie & Modal Handling

3. First Job Selection

4. Description Expansion

5. HTML Parsing with BeautifulSoup

Return Format

Error Handling

Best Practices