Text Cleaner

Overview

The TextCleaner utility class provides static methods for processing raw text extracted from job postings. It removes noise, filters out irrelevant content, and deduplicates entries to produce clean, analysis-ready skill lists.

TextCleaner

Class Definition

import re
from typing import List

class TextCleaner:
    """Clase de utilidad responsable de procesar, limpiar y filtrar el texto extraído de las vacantes."""

This is a utility class with only static methods. No instantiation is required.

Methods

limpiar_habilidades()

Static method that cleans, filters, and deduplicates a list of raw skill strings.

@staticmethod
def limpiar_habilidades(habilidades: List[str]) -> List[str]:

habilidades

List[str]

required

Raw list of strings extracted from job postings. Typically the habilidades_brutas field returned by scrapers.

return

List[str]

Cleaned and filtered list of unique skills/requirements, ready for AI analysis or export

Cleaning Process

The method applies a multi-stage cleaning pipeline:

Normalize Whitespace

Collapses multiple spaces, tabs, and newlines into single spaces using regex:

limpia = re.sub(r'\s+', ' ', habilidad).strip()

Remove Special Characters

Strips unwanted characters while preserving technical symbols:

limpia = re.sub(r'[^\w\s\+\#\.-_,\(\)]', '', limpia)

Preserved characters:

Word characters: \w (letters, digits, underscore)
Spaces: \s
Technical symbols: +, #, ., -, _
Delimiters: ,, (, )

Length Validation

Filters out strings that are too short or too long:

if len(limpia) < 3 or len(limpia) > 250:
    continue

Minimum: 3 characters (removes empty/trivial strings)
Maximum: 250 characters (removes entire paragraphs)

Exclude Common UI Text

Removes words commonly found in page UI but not actual skills:

palabras_excluir = [
    'click', 'show', 'more', 'less', 'see', 'view',
    'apply', 'job', 'description', 'ver', 'más'
]

Case-insensitive matching prevents “Click”, “SHOW”, etc. from passing through.

Deduplicate

Uses a case-insensitive set to track seen values:

if limpia.lower() not in vistas:
    vistas.add(limpia.lower())
    limpiadas.append(limpia)

This preserves the original casing of the first occurrence while preventing duplicates.

Example Usage

Basic Cleaning

from utilidades.text_cleaner import TextCleaner

# Raw extracted data from scraper
habilidades_brutas = [
    "Python programming   ",
    "   Django framework\n\n",
    "PostgreSQL database",
    "C# and .NET Core",
    "React.js + TypeScript",
    "python programming",  # Duplicate (different case)
    "click here to apply",  # UI noise
    "5+ years experience",
    "ab",  # Too short
    "see more",  # Excluded word
    "Strong communication skills and ability to work in agile teams with distributed members across multiple time zones while managing stakeholder expectations and delivering high-quality solutions..." # Too long
]

limpiadas = TextCleaner.limpiar_habilidades(habilidades_brutas)

print(limpiadas)
# Output:
# [
#     'Python programming',
#     'Django framework',
#     'PostgreSQL database',
#     'C# and .NET Core',
#     'React.js + TypeScript',
#     '5+ years experience'
# ]

Integration with Scraper

from scraping.linkedin_scraper import LinkedInScraper
from utilidades.text_cleaner import TextCleaner

# Extract raw data
scraper = LinkedInScraper()
resultado = scraper.extraer_datos("Backend Developer")

if resultado['exito']:
    # Clean the raw skills
    habilidades_brutas = resultado['habilidades_brutas']
    habilidades_limpias = TextCleaner.limpiar_habilidades(habilidades_brutas)
    
    print(f"Raw: {len(habilidades_brutas)} items")
    print(f"Cleaned: {len(habilidades_limpias)} items")
    print(f"Removed: {len(habilidades_brutas) - len(habilidades_limpias)} items")

Technical Symbols Preservation

The cleaner is designed to preserve programming language symbols:

# These pass through successfully:
texts = [
    "C++",           # Plus signs
    "C#",            # Hash symbol
    "Node.js",       # Dots
    "Vue.js",        # Dots
    "ASP.NET",       # Dots
    "Objective-C",   # Hyphens
    "snake_case",    # Underscores
    "API (REST)",    # Parentheses
    "Git, GitHub"    # Commas
]

result = TextCleaner.limpiar_habilidades(texts)
print(result)  # All preserved

Customization

You can modify the cleaning behavior by editing the source file:

Adjusting Length Limits

# In text_cleaner.py, line 34
if len(limpia) < 3 or len(limpia) > 250:  # Change these values
    continue

Decrease minimum (< 3) to allow abbreviations like “AI”, “ML”
Increase maximum (> 250) to preserve longer requirement descriptions

Adding Excluded Words

# In text_cleaner.py, lines 20-23
palabras_excluir = [
    'click', 'show', 'more', 'less', 'see', 'view',
    'apply', 'job', 'description', 'ver', 'más',
    # Add your custom exclusions:
    'sponsored', 'promoted', 'featured'
]

Adjusting Character Filtering

# In text_cleaner.py, line 31
# Current regex preserves: a-zA-Z0-9_ spaces + # . - _ , ( )
limpia = re.sub(r'[^\w\s\+\#\.-_,\(\)]', '', limpia)

# Example: Also preserve ampersand and asterisk
limpia = re.sub(r'[^\w\s\+\#\.-_,\(\)\&\*]', '', limpia)

Performance Characteristics

Time Complexity

O(n) where n is the number of input stringsEach string is processed once in a single pass through the list.

Space Complexity

O(m) where m is the number of unique cleaned stringsMaintains a set for deduplication tracking.

Typical Performance:

Input: 100 raw strings
Processing time: < 5ms
Output: 40-70 cleaned strings (depending on duplication)

Common Patterns

Pattern 1: Pipeline with Scraper and AI

from scraping.linkedin_scraper import LinkedInScraper
from utilidades.text_cleaner import TextCleaner
from inteligencia_artificial.gpt_analyzer import AIAnalyzer

# Extract → Clean → Analyze
scraper = LinkedInScraper()
analyzer = AIAnalyzer()

raw_data = scraper.extraer_datos("DevOps Engineer")
if raw_data['exito']:
    clean_skills = TextCleaner.limpiar_habilidades(raw_data['habilidades_brutas'])
    summary = analyzer.generar_resumen(raw_data['titulo_oferta'], clean_skills)
    print(summary)

Pattern 2: Quality Metrics

from utilidades.text_cleaner import TextCleaner

def analyze_cleaning_impact(raw_skills):
    clean_skills = TextCleaner.limpiar_habilidades(raw_skills)
    
    removed = len(raw_skills) - len(clean_skills)
    duplicate_rate = (removed / len(raw_skills)) * 100
    
    print(f"Original: {len(raw_skills)}")
    print(f"Cleaned: {len(clean_skills)}")
    print(f"Removed: {removed} ({duplicate_rate:.1f}%)")
    print(f"\nFirst 5 cleaned:")
    for skill in clean_skills[:5]:
        print(f"  - {skill}")

# Usage
raw = scraper.extraer_datos("Software Engineer")['habilidades_brutas']
analyze_cleaning_impact(raw)

Best Practices

Always Clean Before Analysis

Raw scraped data contains noise that degrades AI analysis quality. Always run limpiar_habilidades() before passing data to GPT or other analyzers.

# ❌ Bad: Skip cleaning
summary = analyzer.generar_resumen(titulo, raw_skills)

# ✅ Good: Clean first
clean = TextCleaner.limpiar_habilidades(raw_skills)
summary = analyzer.generar_resumen(titulo, clean)

Don't Clean Twice

The JobService facade already calls TextCleaner internally. Don’t clean again:

# ❌ Redundant
result = service.procesar_busqueda("Developer")
cleaned = TextCleaner.limpiar_habilidades(result['habilidades'])

# ✅ Already cleaned
result = service.procesar_busqueda("Developer")
skills = result['habilidades']  # Already cleaned by service

Monitor Cleaning Effectiveness

Track how much data is being filtered out. High removal rates may indicate:

Scraper extracting too much UI text
Need to adjust exclusion words
Need to tune length limits

original = len(raw_skills)
cleaned = len(TextCleaner.limpiar_habilidades(raw_skills))
removal_rate = (1 - cleaned/original) * 100

if removal_rate > 70:
    print("⚠️ Warning: Removing >70% of data")

Dependencies

import re
from typing import List

No external packages required. Uses only Python standard library.

Core Modules

Utilities

Overview

TextCleaner

Class Definition

Methods

limpiar_habilidades()

Cleaning Process

Example Usage

Basic Cleaning

Integration with Scraper

Technical Symbols Preservation

Customization

Adjusting Length Limits

Adding Excluded Words

Adjusting Character Filtering

Performance Characteristics

Time Complexity

Space Complexity

Common Patterns

Pattern 1: Pipeline with Scraper and AI

Pattern 2: Quality Metrics

Best Practices

Dependencies

Build docs developers (and LLMs) love

Core Modules

Utilities

​Overview

​TextCleaner

​Class Definition

​Methods

​limpiar_habilidades()

​Cleaning Process

​Example Usage

​Basic Cleaning

​Integration with Scraper

​Technical Symbols Preservation

​Customization

​Adjusting Length Limits

​Adding Excluded Words

​Adjusting Character Filtering

​Performance Characteristics

Time Complexity

Space Complexity

​Common Patterns

​Pattern 1: Pipeline with Scraper and AI

​Pattern 2: Quality Metrics

​Best Practices

​Dependencies

Build docs developers (and LLMs) love

Overview

TextCleaner

Class Definition

Methods

limpiar_habilidades()

Cleaning Process

Example Usage

Basic Cleaning

Integration with Scraper

Technical Symbols Preservation

Customization

Adjusting Length Limits

Adding Excluded Words

Adjusting Character Filtering

Performance Characteristics

Common Patterns

Pattern 1: Pipeline with Scraper and AI

Pattern 2: Quality Metrics

Best Practices

Dependencies