Overview
The TextCleaner utility class provides static methods for processing raw text extracted from job postings. It removes noise, filters out irrelevant content, and deduplicates entries to produce clean, analysis-ready skill lists.
TextCleaner
Class Definition
import re
from typing import List
class TextCleaner :
"""Clase de utilidad responsable de procesar, limpiar y filtrar el texto extraído de las vacantes."""
This is a utility class with only static methods. No instantiation is required.
Methods
limpiar_habilidades()
Static method that cleans, filters, and deduplicates a list of raw skill strings.
@ staticmethod
def limpiar_habilidades ( habilidades : List[ str ]) -> List[ str ]:
Raw list of strings extracted from job postings. Typically the habilidades_brutas field returned by scrapers.
Cleaned and filtered list of unique skills/requirements, ready for AI analysis or export
Cleaning Process
The method applies a multi-stage cleaning pipeline:
Normalize Whitespace
Collapses multiple spaces, tabs, and newlines into single spaces using regex: limpia = re.sub( r ' \s + ' , ' ' , habilidad).strip()
Remove Special Characters
Strips unwanted characters while preserving technical symbols: limpia = re.sub( r ' [ ^ \w\s \+\#\. -_, \(\) ] ' , '' , limpia)
Preserved characters:
Word characters: \w (letters, digits, underscore)
Spaces: \s
Technical symbols: +, #, ., -, _
Delimiters: ,, (, )
Length Validation
Filters out strings that are too short or too long: if len (limpia) < 3 or len (limpia) > 250 :
continue
Minimum: 3 characters (removes empty/trivial strings)
Maximum: 250 characters (removes entire paragraphs)
Exclude Common UI Text
Removes words commonly found in page UI but not actual skills: palabras_excluir = [
'click' , 'show' , 'more' , 'less' , 'see' , 'view' ,
'apply' , 'job' , 'description' , 'ver' , 'más'
]
Case-insensitive matching prevents “Click”, “SHOW”, etc. from passing through.
Deduplicate
Uses a case-insensitive set to track seen values: if limpia.lower() not in vistas:
vistas.add(limpia.lower())
limpiadas.append(limpia)
This preserves the original casing of the first occurrence while preventing duplicates.
Example Usage
Basic Cleaning
from utilidades.text_cleaner import TextCleaner
# Raw extracted data from scraper
habilidades_brutas = [
"Python programming " ,
" Django framework \n\n " ,
"PostgreSQL database" ,
"C# and .NET Core" ,
"React.js + TypeScript" ,
"python programming" , # Duplicate (different case)
"click here to apply" , # UI noise
"5+ years experience" ,
"ab" , # Too short
"see more" , # Excluded word
"Strong communication skills and ability to work in agile teams with distributed members across multiple time zones while managing stakeholder expectations and delivering high-quality solutions..." # Too long
]
limpiadas = TextCleaner.limpiar_habilidades(habilidades_brutas)
print (limpiadas)
# Output:
# [
# 'Python programming',
# 'Django framework',
# 'PostgreSQL database',
# 'C# and .NET Core',
# 'React.js + TypeScript',
# '5+ years experience'
# ]
Integration with Scraper
from scraping.linkedin_scraper import LinkedInScraper
from utilidades.text_cleaner import TextCleaner
# Extract raw data
scraper = LinkedInScraper()
resultado = scraper.extraer_datos( "Backend Developer" )
if resultado[ 'exito' ]:
# Clean the raw skills
habilidades_brutas = resultado[ 'habilidades_brutas' ]
habilidades_limpias = TextCleaner.limpiar_habilidades(habilidades_brutas)
print ( f "Raw: { len (habilidades_brutas) } items" )
print ( f "Cleaned: { len (habilidades_limpias) } items" )
print ( f "Removed: { len (habilidades_brutas) - len (habilidades_limpias) } items" )
Technical Symbols Preservation
The cleaner is designed to preserve programming language symbols:
Preserved Symbols
Removed Symbols
# These pass through successfully:
texts = [
"C++" , # Plus signs
"C#" , # Hash symbol
"Node.js" , # Dots
"Vue.js" , # Dots
"ASP.NET" , # Dots
"Objective-C" , # Hyphens
"snake_case" , # Underscores
"API (REST)" , # Parentheses
"Git, GitHub" # Commas
]
result = TextCleaner.limpiar_habilidades(texts)
print (result) # All preserved
Customization
You can modify the cleaning behavior by editing the source file:
Adjusting Length Limits
# In text_cleaner.py, line 34
if len (limpia) < 3 or len (limpia) > 250 : # Change these values
continue
Decrease minimum (< 3) to allow abbreviations like “AI”, “ML”
Increase maximum (> 250) to preserve longer requirement descriptions
Adding Excluded Words
# In text_cleaner.py, lines 20-23
palabras_excluir = [
'click' , 'show' , 'more' , 'less' , 'see' , 'view' ,
'apply' , 'job' , 'description' , 'ver' , 'más' ,
# Add your custom exclusions:
'sponsored' , 'promoted' , 'featured'
]
Adjusting Character Filtering
# In text_cleaner.py, line 31
# Current regex preserves: a-zA-Z0-9_ spaces + # . - _ , ( )
limpia = re.sub( r ' [ ^ \w\s \+\#\. -_, \(\) ] ' , '' , limpia)
# Example: Also preserve ampersand and asterisk
limpia = re.sub( r ' [ ^ \w\s \+\#\. -_, \(\)\&\* ] ' , '' , limpia)
Time Complexity O(n) where n is the number of input stringsEach string is processed once in a single pass through the list.
Space Complexity O(m) where m is the number of unique cleaned stringsMaintains a set for deduplication tracking.
Typical Performance:
Input: 100 raw strings
Processing time: < 5ms
Output: 40-70 cleaned strings (depending on duplication)
Common Patterns
Pattern 1: Pipeline with Scraper and AI
from scraping.linkedin_scraper import LinkedInScraper
from utilidades.text_cleaner import TextCleaner
from inteligencia_artificial.gpt_analyzer import AIAnalyzer
# Extract → Clean → Analyze
scraper = LinkedInScraper()
analyzer = AIAnalyzer()
raw_data = scraper.extraer_datos( "DevOps Engineer" )
if raw_data[ 'exito' ]:
clean_skills = TextCleaner.limpiar_habilidades(raw_data[ 'habilidades_brutas' ])
summary = analyzer.generar_resumen(raw_data[ 'titulo_oferta' ], clean_skills)
print (summary)
Pattern 2: Quality Metrics
from utilidades.text_cleaner import TextCleaner
def analyze_cleaning_impact ( raw_skills ):
clean_skills = TextCleaner.limpiar_habilidades(raw_skills)
removed = len (raw_skills) - len (clean_skills)
duplicate_rate = (removed / len (raw_skills)) * 100
print ( f "Original: { len (raw_skills) } " )
print ( f "Cleaned: { len (clean_skills) } " )
print ( f "Removed: { removed } ( { duplicate_rate :.1f} %)" )
print ( f " \n First 5 cleaned:" )
for skill in clean_skills[: 5 ]:
print ( f " - { skill } " )
# Usage
raw = scraper.extraer_datos( "Software Engineer" )[ 'habilidades_brutas' ]
analyze_cleaning_impact(raw)
Best Practices
Always Clean Before Analysis
Raw scraped data contains noise that degrades AI analysis quality. Always run limpiar_habilidades() before passing data to GPT or other analyzers. # ❌ Bad: Skip cleaning
summary = analyzer.generar_resumen(titulo, raw_skills)
# ✅ Good: Clean first
clean = TextCleaner.limpiar_habilidades(raw_skills)
summary = analyzer.generar_resumen(titulo, clean)
The JobService facade already calls TextCleaner internally. Don’t clean again: # ❌ Redundant
result = service.procesar_busqueda( "Developer" )
cleaned = TextCleaner.limpiar_habilidades(result[ 'habilidades' ])
# ✅ Already cleaned
result = service.procesar_busqueda( "Developer" )
skills = result[ 'habilidades' ] # Already cleaned by service
Monitor Cleaning Effectiveness
Track how much data is being filtered out. High removal rates may indicate:
Scraper extracting too much UI text
Need to adjust exclusion words
Need to tune length limits
original = len (raw_skills)
cleaned = len (TextCleaner.limpiar_habilidades(raw_skills))
removal_rate = ( 1 - cleaned / original) * 100
if removal_rate > 70 :
print ( "⚠️ Warning: Removing >70 % o f data" )
Dependencies
import re
from typing import List
No external packages required. Uses only Python standard library.