Skip to main content

Overview

The LinkedIn Job Analyzer uses Selenium WebDriver with Chrome to automate job data extraction from LinkedIn. The scraper navigates to LinkedIn job search pages, handles authentication modals, and extracts structured job information using BeautifulSoup.

Architecture

The scraping system follows the Strategy Pattern with two main components:
  • ScraperStrategy (base_scraper.py:4) - Abstract base class defining the scraper interface
  • LinkedInScraper (linkedin_scraper.py:15) - Concrete implementation for LinkedIn
from abc import ABC, abstractmethod
from typing import Dict

class ScraperStrategy(ABC):
    """Interfaz base para cualquier scraper de vacantes."""
    
    @abstractmethod
    def extraer_datos(self, termino_busqueda: str) -> Dict:
        """Debe retornar un diccionario con el éxito, título, url y habilidades brutas."""
        pass

Browser Automation Setup

The scraper initializes Chrome with anti-detection configurations to avoid bot detection:
linkedin_scraper.py:21-32
def _iniciar_navegador(self):
    """Inicia el navegador Selenium (Código original de logica.py)"""
    options = webdriver.ChromeOptions()
    user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"        
    options.add_argument(f'user-agent={user_agent}')
    options.add_argument('--start-maximized')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    servicio = ChromeService(ChromeDriverManager().install())
    self.driver = webdriver.Chrome(service=servicio, options=options) 
    self.driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

Anti-Detection Features

Sets a realistic Chrome user agent string to mimic a regular browser
Removes --enable-automation switch and useAutomationExtension flags
Executes JavaScript to make navigator.webdriver return undefined

Data Extraction Process

The scraper follows a multi-step process:

1. URL Construction

linkedin_scraper.py:44-46
termino_limpio = termino_busqueda.lower().replace("linkedin", "").strip()
termino_codificado = quote_plus(termino_limpio)
url = f"https://www.linkedin.com/jobs/search/?keywords={termino_codificado}&location=M%C3%A9xico"
The search term is cleaned, URL-encoded, and formatted into a LinkedIn jobs search URL targeting Mexico.
The scraper automatically handles LinkedIn’s cookie consent banners and popup modals
linkedin_scraper.py:54-68
# Cookie consent handling
try:
    cookie_wait = WebDriverWait(self.driver, 3) 
    css_selector = "button[data-tracking-control-name='cookie-consent-accept'], button.artdeco-global-banner__accept"
    cookie_button = cookie_wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, css_selector)))
    cookie_button.click()
except: pass

# Modal dismissal
try:
    wait_popup = WebDriverWait(self.driver, 3) 
    close_button = wait_popup.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.modal__dismiss, button[aria-label='Dismiss']")))
    close_button.click()
except: 
    try: self.driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.ESCAPE)
    except: pass

3. First Job Selection

The scraper locates and navigates to the first job listing:
linkedin_scraper.py:70-84
if "/jobs/search/" in self.driver.current_url:
    print("[SCRAPER] Buscando enlace del primer empleo...")
    try:
        enlace_empleo = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 
            "ul.jobs-search__results-list li a.base-card__full-link," + 
            "a.job-search-card__url-overlay," +
            "div.job-search-card a"
        )))
        url_oferta_especifica = enlace_empleo.get_attribute("href")
        
        self.driver.get(url_oferta_especifica)
        time.sleep(3)

4. Description Expansion

LinkedIn truncates job descriptions by default. The scraper clicks “Show more” to reveal the full content.
linkedin_scraper.py:86-97
try:
    boton_ver_mas_selector = (By.CSS_SELECTOR, 
        "button.show-more-less-html__button," +
        "button[aria-label='Show more, visually expands previously read content']," +
        "button.description__footer-button"
    )
    boton_ver_mas = WebDriverWait(self.driver, 5).until(EC.element_to_be_clickable(boton_ver_mas_selector))
    self.driver.execute_script("arguments[0].click();", boton_ver_mas)
    time.sleep(1)
except:
    pass

5. HTML Parsing with BeautifulSoup

Once the page is fully loaded, BeautifulSoup extracts structured data:
linkedin_scraper.py:99-127
html = self.driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Extract job title
titulo = "Oferta sin título"
titulo_elem = soup.find('h1', class_=re.compile('top-card-layout__title|jobs-top-card__job-title'))
if not titulo_elem: titulo_elem = soup.find('a', class_=re.compile('top-card-layout__title'))
if titulo_elem: titulo = titulo_elem.get_text(strip=True)

# Extract skills and requirements
habilidades = []
description_container = soup.find('div', class_=re.compile(r'show-more-less-html__markup|description__text|job-description-content'))

if description_container:
    items = description_container.find_all('li')
    for item in items:
        habilidades.append(item.get_text(strip=True))
    
    if len(habilidades) < 3:
        parrafos = description_container.find_all('p')
        for p in parrafos:
            texto = p.get_text(strip=True)
            if len(texto) > 10: 
                habilidades.append(texto)

Return Format

The extraer_datos method returns a dictionary with the following structure:
linkedin_scraper.py:132-137
return {
    'exito': True,
    'titulo_oferta': titulo,
    'url': self.driver.current_url,
    'habilidades_brutas': habilidades 
}
exito
boolean
Indicates whether the scraping operation succeeded
titulo_oferta
string
The extracted job title
url
string
The final URL of the job listing
habilidades_brutas
list[str]
Raw list of extracted skills and requirements (unprocessed)

Error Handling

The scraper includes comprehensive error handling:
linkedin_scraper.py:139-148
except Exception as e:
    import traceback
    traceback.print_exc()
    return {
        'exito': False,
        'mensaje': f"Error al extraer: {str(e)}"
    }

finally:
    self._cerrar_navegador()
The browser is always closed in the finally block to prevent resource leaks

Best Practices

Wait Strategies

Uses explicit waits (WebDriverWait) instead of implicit waits for reliable element detection

Multiple Selectors

Targets multiple CSS selectors to handle LinkedIn’s varying page structures

Graceful Degradation

Wraps optional steps in try-except blocks to continue even if non-critical elements are missing

Resource Cleanup

Always closes the browser in the finally block to prevent zombie processes

Build docs developers (and LLMs) love