Skip to main content

Overview

The TextCleaner utility class provides static methods for processing raw text extracted from job postings. It removes noise, filters out irrelevant content, and deduplicates entries to produce clean, analysis-ready skill lists.

TextCleaner

Class Definition

import re
from typing import List

class TextCleaner:
    """Clase de utilidad responsable de procesar, limpiar y filtrar el texto extraído de las vacantes."""
This is a utility class with only static methods. No instantiation is required.

Methods

limpiar_habilidades()

Static method that cleans, filters, and deduplicates a list of raw skill strings.
@staticmethod
def limpiar_habilidades(habilidades: List[str]) -> List[str]:
habilidades
List[str]
required
Raw list of strings extracted from job postings. Typically the habilidades_brutas field returned by scrapers.
return
List[str]
Cleaned and filtered list of unique skills/requirements, ready for AI analysis or export

Cleaning Process

The method applies a multi-stage cleaning pipeline:
1

Normalize Whitespace

Collapses multiple spaces, tabs, and newlines into single spaces using regex:
limpia = re.sub(r'\s+', ' ', habilidad).strip()
2

Remove Special Characters

Strips unwanted characters while preserving technical symbols:
limpia = re.sub(r'[^\w\s\+\#\.-_,\(\)]', '', limpia)
Preserved characters:
  • Word characters: \w (letters, digits, underscore)
  • Spaces: \s
  • Technical symbols: +, #, ., -, _
  • Delimiters: ,, (, )
3

Length Validation

Filters out strings that are too short or too long:
if len(limpia) < 3 or len(limpia) > 250:
    continue
  • Minimum: 3 characters (removes empty/trivial strings)
  • Maximum: 250 characters (removes entire paragraphs)
4

Exclude Common UI Text

Removes words commonly found in page UI but not actual skills:
palabras_excluir = [
    'click', 'show', 'more', 'less', 'see', 'view',
    'apply', 'job', 'description', 'ver', 'más'
]
Case-insensitive matching prevents “Click”, “SHOW”, etc. from passing through.
5

Deduplicate

Uses a case-insensitive set to track seen values:
if limpia.lower() not in vistas:
    vistas.add(limpia.lower())
    limpiadas.append(limpia)
This preserves the original casing of the first occurrence while preventing duplicates.

Example Usage

Basic Cleaning

from utilidades.text_cleaner import TextCleaner

# Raw extracted data from scraper
habilidades_brutas = [
    "Python programming   ",
    "   Django framework\n\n",
    "PostgreSQL database",
    "C# and .NET Core",
    "React.js + TypeScript",
    "python programming",  # Duplicate (different case)
    "click here to apply",  # UI noise
    "5+ years experience",
    "ab",  # Too short
    "see more",  # Excluded word
    "Strong communication skills and ability to work in agile teams with distributed members across multiple time zones while managing stakeholder expectations and delivering high-quality solutions..." # Too long
]

limpiadas = TextCleaner.limpiar_habilidades(habilidades_brutas)

print(limpiadas)
# Output:
# [
#     'Python programming',
#     'Django framework',
#     'PostgreSQL database',
#     'C# and .NET Core',
#     'React.js + TypeScript',
#     '5+ years experience'
# ]

Integration with Scraper

from scraping.linkedin_scraper import LinkedInScraper
from utilidades.text_cleaner import TextCleaner

# Extract raw data
scraper = LinkedInScraper()
resultado = scraper.extraer_datos("Backend Developer")

if resultado['exito']:
    # Clean the raw skills
    habilidades_brutas = resultado['habilidades_brutas']
    habilidades_limpias = TextCleaner.limpiar_habilidades(habilidades_brutas)
    
    print(f"Raw: {len(habilidades_brutas)} items")
    print(f"Cleaned: {len(habilidades_limpias)} items")
    print(f"Removed: {len(habilidades_brutas) - len(habilidades_limpias)} items")

Technical Symbols Preservation

The cleaner is designed to preserve programming language symbols:
# These pass through successfully:
texts = [
    "C++",           # Plus signs
    "C#",            # Hash symbol
    "Node.js",       # Dots
    "Vue.js",        # Dots
    "ASP.NET",       # Dots
    "Objective-C",   # Hyphens
    "snake_case",    # Underscores
    "API (REST)",    # Parentheses
    "Git, GitHub"    # Commas
]

result = TextCleaner.limpiar_habilidades(texts)
print(result)  # All preserved

Customization

You can modify the cleaning behavior by editing the source file:

Adjusting Length Limits

# In text_cleaner.py, line 34
if len(limpia) < 3 or len(limpia) > 250:  # Change these values
    continue
  • Decrease minimum (< 3) to allow abbreviations like “AI”, “ML”
  • Increase maximum (> 250) to preserve longer requirement descriptions

Adding Excluded Words

# In text_cleaner.py, lines 20-23
palabras_excluir = [
    'click', 'show', 'more', 'less', 'see', 'view',
    'apply', 'job', 'description', 'ver', 'más',
    # Add your custom exclusions:
    'sponsored', 'promoted', 'featured'
]

Adjusting Character Filtering

# In text_cleaner.py, line 31
# Current regex preserves: a-zA-Z0-9_ spaces + # . - _ , ( )
limpia = re.sub(r'[^\w\s\+\#\.-_,\(\)]', '', limpia)

# Example: Also preserve ampersand and asterisk
limpia = re.sub(r'[^\w\s\+\#\.-_,\(\)\&\*]', '', limpia)

Performance Characteristics

Time Complexity

O(n) where n is the number of input stringsEach string is processed once in a single pass through the list.

Space Complexity

O(m) where m is the number of unique cleaned stringsMaintains a set for deduplication tracking.
Typical Performance:
  • Input: 100 raw strings
  • Processing time: < 5ms
  • Output: 40-70 cleaned strings (depending on duplication)

Common Patterns

Pattern 1: Pipeline with Scraper and AI

from scraping.linkedin_scraper import LinkedInScraper
from utilidades.text_cleaner import TextCleaner
from inteligencia_artificial.gpt_analyzer import AIAnalyzer

# Extract → Clean → Analyze
scraper = LinkedInScraper()
analyzer = AIAnalyzer()

raw_data = scraper.extraer_datos("DevOps Engineer")
if raw_data['exito']:
    clean_skills = TextCleaner.limpiar_habilidades(raw_data['habilidades_brutas'])
    summary = analyzer.generar_resumen(raw_data['titulo_oferta'], clean_skills)
    print(summary)

Pattern 2: Quality Metrics

from utilidades.text_cleaner import TextCleaner

def analyze_cleaning_impact(raw_skills):
    clean_skills = TextCleaner.limpiar_habilidades(raw_skills)
    
    removed = len(raw_skills) - len(clean_skills)
    duplicate_rate = (removed / len(raw_skills)) * 100
    
    print(f"Original: {len(raw_skills)}")
    print(f"Cleaned: {len(clean_skills)}")
    print(f"Removed: {removed} ({duplicate_rate:.1f}%)")
    print(f"\nFirst 5 cleaned:")
    for skill in clean_skills[:5]:
        print(f"  - {skill}")

# Usage
raw = scraper.extraer_datos("Software Engineer")['habilidades_brutas']
analyze_cleaning_impact(raw)

Best Practices

Raw scraped data contains noise that degrades AI analysis quality. Always run limpiar_habilidades() before passing data to GPT or other analyzers.
# ❌ Bad: Skip cleaning
summary = analyzer.generar_resumen(titulo, raw_skills)

# ✅ Good: Clean first
clean = TextCleaner.limpiar_habilidades(raw_skills)
summary = analyzer.generar_resumen(titulo, clean)
The JobService facade already calls TextCleaner internally. Don’t clean again:
# ❌ Redundant
result = service.procesar_busqueda("Developer")
cleaned = TextCleaner.limpiar_habilidades(result['habilidades'])

# ✅ Already cleaned
result = service.procesar_busqueda("Developer")
skills = result['habilidades']  # Already cleaned by service
Track how much data is being filtered out. High removal rates may indicate:
  • Scraper extracting too much UI text
  • Need to adjust exclusion words
  • Need to tune length limits
original = len(raw_skills)
cleaned = len(TextCleaner.limpiar_habilidades(raw_skills))
removal_rate = (1 - cleaned/original) * 100

if removal_rate > 70:
    print("⚠️ Warning: Removing >70% of data")

Dependencies

import re
from typing import List
No external packages required. Uses only Python standard library.

Build docs developers (and LLMs) love