Skip to main content

Overview

This example demonstrates how to process multiple CVs in batch, extracting structured data using Pydantic models and the LLM. Perfect for building talent databases from resume collections.

Define the Data Schema

First, create a Pydantic model to structure the extracted data:
import glob
import pandas as pd
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field
from langchain_community.document_loaders import PyPDFLoader

# Define the data schema
class PerfilEstudiante(BaseModel):
    # Personal Information
    nombre: str = Field(description="Student's full name")
    email: str = Field(description="University or personal email")
    ubicacion: str = Field(description="City/Country")

    # Academic Profile
    universidad: str = Field(description="Name of university or institute")
    carrera: str = Field(description="Major being studied (e.g., Software Engineering)")
    ciclo_actual: str = Field(description="Current semester or cycle (e.g., 7th Semester, Graduate)")

    # Tech Talent
    stack_principal: list = Field(description="List of top 5 languages/technologies they master")
    proyectos_destacados: list = Field(description="Names of academic projects, thesis, or freelance work mentioned")

    # Profile Evaluation
    tipo_perfil: str = Field(description="Classify as: Backend, Frontend, Data, Fullstack, or Management")
    potencial_contratacion: str = Field(description="Brief justification for why to hire them as an intern")

parser = JsonOutputParser(pydantic_object=PerfilEstudiante)

Create the Extraction Prompt

Design a prompt that guides the LLM to extract data focusing on potential:
from langchain_core.prompts import ChatPromptTemplate

template_estudiantes = """
You are an Expert in Youth Employability and IT Recruitment.
Analyze this student's CV and extract structured data.

USE THE FOLLOWING JSON FORMAT:
{format_instructions}

EXTRACTION RULES (Focus on Potential):

1. ACADEMIC:
   - Look for current cycle (e.g., "VI Cycle", "7th", "Graduate"). If not stated, infer from dates.
   - University: Extract main name (e.g., "UTP", "UPC", "San Marcos").

2. PROJECTS (Key for juniors):
   - Look for sections like "Academic Projects", "Freelance", or "Experience".
   - Extract concrete project names (e.g., "Library System", "Recycling App").
   - DO NOT put generic company names, look for WHAT THEY DID.

3. PROFILE TYPE:
   - Analyze their skills.
   - If they know Python + Pandas -> "Data".
   - If they know React + Node -> "Fullstack".
   - If they know Java + Spring -> "Backend".

CV TEXT:
{context}
"""

prompt_extract = ChatPromptTemplate.from_template(template_estudiantes)
chain_extract = prompt_extract | llm | parser

Batch Processing Execution

Process all CVs in a directory:
resultados = []
archivos = glob.glob("cvs_estudiantes_final/*.pdf")

print(f"Analyzing potential of {len(archivos)} students with AI...")

for pdf in archivos:
    try:
        # Load PDF
        loader = PyPDFLoader(pdf)
        pages = loader.load()
        texto_completo = "\n".join([p.page_content for p in pages])

        # Invoke Gemini
        data = chain_extract.invoke({
            "context": texto_completo,
            "format_instructions": parser.get_format_instructions()
        })

        # Add source filename
        data['archivo_origen'] = pdf.split("/")[-1]
        resultados.append(data)

        print(f"Processed: {data['nombre']} ({data['ciclo_actual']}) -> {data['tipo_perfil']}")

    except Exception as e:
        print(f"Error reading {pdf}: {e}")

Create a DataFrame

Convert the results into a structured DataFrame:
print("\nTALENT TABLE (REVERSE MATCH):")
df_talent = pd.DataFrame(resultados)

cols = ["nombre", "universidad", "ciclo_actual", "tipo_perfil", 
        "stack_principal", "potencial_contratacion"]

# Only show columns that exist
cols_existentes = [c for c in cols if c in df_talent.columns]
print(df_talent[cols_existentes])

Expected Output

Analyzing potential of 5 students with AI...
Processed: FERNANDA PAREDES (9no Ciclo) -> Data
Processed: XIMENA RIOS (9no ciclo) -> Fullstack
Processed: NICOLAS PAREDES (7mo Ciclo) -> Fullstack
Processed: LUCIANA CORDOVA (8vo ciclo) -> Fullstack
Processed: FERNANDA MENDOZA (8vo Ciclo) -> Fullstack

TALENT TABLE (REVERSE MATCH):
nombreuniversidadciclo_actualtipo_perfilstack_principalpotencial_contratacion
FERNANDA PAREDESUTP9no CicloData[Python, PowerBI, Java, Spring Boot]Strong candidate for Data Analyst with Hackathon experience…
XIMENA RIOSSan Marcos9no cicloFullstack[Python, FastAPI, Java, React, Spring Boot]Advanced student (9th cycle) with practical experience…
NICOLAS PAREDESUNI7mo CicloFullstack[Python, FastAPI, SQL (PostgreSQL), React, Spring Boot]Advanced student (7th cycle) with focus on full-stack development…
LUCIANA CORDOVAUNI8vo cicloFullstack[Python, FastAPI, Java (Spring Boot), React, SQL]8th cycle candidate with experience in academic projects…
FERNANDA MENDOZAUNI8vo CicloFullstack[React, Java, Python, PowerBI, Excel]8th cycle student with practical experience in web development…

Export to Excel

Save the talent database for further analysis:
import pandas as pd

# Export DataFrame
nombre_archivo = "Base_Talento_Estudiantes.xlsx"
df_talent.to_excel(nombre_archivo, index=False)

print(f"Excel file generated: {nombre_archivo}")

Key Features

Uses Pydantic models to ensure consistent data structure across all CVs
The LLM intelligently extracts and categorizes information, even inferring missing data
Process hundreds of CVs with the same code - just point to a different directory
Results are immediately ready for Excel export or database insertion

Advanced Tips

Custom Fields: Modify the PerfilEstudiante model to extract additional fields like certifications, languages, or soft skills.
API Rate Limits: When processing large batches, consider implementing rate limiting or batching to avoid API throttling.

Next Steps

Visualization

Create dashboards from extracted data

Basic Query

Query individual profiles

Build docs developers (and LLMs) love