Overview
The LinkedIn Job Analyzer uses Selenium WebDriver with Chrome to automate job data extraction from LinkedIn. The scraper navigates to LinkedIn job search pages, handles authentication modals, and extracts structured job information using BeautifulSoup.
Architecture
The scraping system follows the Strategy Pattern with two main components:
ScraperStrategy (base_scraper.py:4) - Abstract base class defining the scraper interface
LinkedInScraper (linkedin_scraper.py:15) - Concrete implementation for LinkedIn
Base Strategy Interface
LinkedIn Implementation
from abc import ABC , abstractmethod
from typing import Dict
class ScraperStrategy ( ABC ):
"""Interfaz base para cualquier scraper de vacantes."""
@abstractmethod
def extraer_datos ( self , termino_busqueda : str ) -> Dict:
"""Debe retornar un diccionario con el éxito, título, url y habilidades brutas."""
pass
Browser Automation Setup
The scraper initializes Chrome with anti-detection configurations to avoid bot detection:
linkedin_scraper.py:21-32
def _iniciar_navegador ( self ):
"""Inicia el navegador Selenium (Código original de logica.py)"""
options = webdriver.ChromeOptions()
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
options.add_argument( f 'user-agent= { user_agent } ' )
options.add_argument( '--start-maximized' )
options.add_argument( '--disable-blink-features=AutomationControlled' )
options.add_experimental_option( "excludeSwitches" , [ "enable-automation" ])
options.add_experimental_option( 'useAutomationExtension' , False )
servicio = ChromeService(ChromeDriverManager().install())
self .driver = webdriver.Chrome( service = servicio, options = options)
self .driver.execute_script( "Object.defineProperty(navigator, 'webdriver', {get: () => undefined} )" )
Anti-Detection Features
Sets a realistic Chrome user agent string to mimic a regular browser
Automation Flags Disabled
Removes --enable-automation switch and useAutomationExtension flags
WebDriver Property Override
Executes JavaScript to make navigator.webdriver return undefined
The scraper follows a multi-step process:
1. URL Construction
linkedin_scraper.py:44-46
termino_limpio = termino_busqueda.lower().replace( "linkedin" , "" ).strip()
termino_codificado = quote_plus(termino_limpio)
url = f "https://www.linkedin.com/jobs/search/?keywords= { termino_codificado } &location=M%C3%A9xico"
The search term is cleaned, URL-encoded, and formatted into a LinkedIn jobs search URL targeting Mexico.
2. Cookie & Modal Handling
The scraper automatically handles LinkedIn’s cookie consent banners and popup modals
linkedin_scraper.py:54-68
# Cookie consent handling
try :
cookie_wait = WebDriverWait( self .driver, 3 )
css_selector = "button[data-tracking-control-name='cookie-consent-accept'], button.artdeco-global-banner__accept"
cookie_button = cookie_wait.until( EC .element_to_be_clickable((By. CSS_SELECTOR , css_selector)))
cookie_button.click()
except : pass
# Modal dismissal
try :
wait_popup = WebDriverWait( self .driver, 3 )
close_button = wait_popup.until( EC .element_to_be_clickable((By. CSS_SELECTOR , "button.modal__dismiss, button[aria-label='Dismiss']" )))
close_button.click()
except :
try : self .driver.find_element(By. TAG_NAME , 'body' ).send_keys(Keys. ESCAPE )
except : pass
3. First Job Selection
The scraper locates and navigates to the first job listing:
linkedin_scraper.py:70-84
if "/jobs/search/" in self .driver.current_url:
print ( "[SCRAPER] Buscando enlace del primer empleo..." )
try :
enlace_empleo = wait.until( EC .presence_of_element_located((By. CSS_SELECTOR ,
"ul.jobs-search__results-list li a.base-card__full-link," +
"a.job-search-card__url-overlay," +
"div.job-search-card a"
)))
url_oferta_especifica = enlace_empleo.get_attribute( "href" )
self .driver.get(url_oferta_especifica)
time.sleep( 3 )
4. Description Expansion
LinkedIn truncates job descriptions by default. The scraper clicks “Show more” to reveal the full content.
linkedin_scraper.py:86-97
try :
boton_ver_mas_selector = (By. CSS_SELECTOR ,
"button.show-more-less-html__button," +
"button[aria-label='Show more, visually expands previously read content']," +
"button.description__footer-button"
)
boton_ver_mas = WebDriverWait( self .driver, 5 ).until( EC .element_to_be_clickable(boton_ver_mas_selector))
self .driver.execute_script( "arguments[0].click();" , boton_ver_mas)
time.sleep( 1 )
except :
pass
5. HTML Parsing with BeautifulSoup
Once the page is fully loaded, BeautifulSoup extracts structured data:
linkedin_scraper.py:99-127
html = self .driver.page_source
soup = BeautifulSoup(html, 'html.parser' )
# Extract job title
titulo = "Oferta sin título"
titulo_elem = soup.find( 'h1' , class_ = re.compile( 'top-card-layout__title|jobs-top-card__job-title' ))
if not titulo_elem: titulo_elem = soup.find( 'a' , class_ = re.compile( 'top-card-layout__title' ))
if titulo_elem: titulo = titulo_elem.get_text( strip = True )
# Extract skills and requirements
habilidades = []
description_container = soup.find( 'div' , class_ = re.compile( r 'show-more-less-html__markup | description__text | job-description-content' ))
if description_container:
items = description_container.find_all( 'li' )
for item in items:
habilidades.append(item.get_text( strip = True ))
if len (habilidades) < 3 :
parrafos = description_container.find_all( 'p' )
for p in parrafos:
texto = p.get_text( strip = True )
if len (texto) > 10 :
habilidades.append(texto)
The extraer_datos method returns a dictionary with the following structure:
linkedin_scraper.py:132-137
return {
'exito' : True ,
'titulo_oferta' : titulo,
'url' : self .driver.current_url,
'habilidades_brutas' : habilidades
}
Indicates whether the scraping operation succeeded
The final URL of the job listing
Raw list of extracted skills and requirements (unprocessed)
Error Handling
The scraper includes comprehensive error handling:
linkedin_scraper.py:139-148
except Exception as e:
import traceback
traceback.print_exc()
return {
'exito' : False ,
'mensaje' : f "Error al extraer: { str (e) } "
}
finally :
self ._cerrar_navegador()
The browser is always closed in the finally block to prevent resource leaks
Best Practices
Wait Strategies Uses explicit waits (WebDriverWait) instead of implicit waits for reliable element detection
Multiple Selectors Targets multiple CSS selectors to handle LinkedIn’s varying page structures
Graceful Degradation Wraps optional steps in try-except blocks to continue even if non-critical elements are missing
Resource Cleanup Always closes the browser in the finally block to prevent zombie processes