This example demonstrates how to process multiple CVs in batch, extracting structured data using Pydantic models and the LLM. Perfect for building talent databases from resume collections.
First, create a Pydantic model to structure the extracted data:
import globimport pandas as pdfrom langchain_core.output_parsers import JsonOutputParserfrom pydantic import BaseModel, Fieldfrom langchain_community.document_loaders import PyPDFLoader# Define the data schemaclass PerfilEstudiante(BaseModel): # Personal Information nombre: str = Field(description="Student's full name") email: str = Field(description="University or personal email") ubicacion: str = Field(description="City/Country") # Academic Profile universidad: str = Field(description="Name of university or institute") carrera: str = Field(description="Major being studied (e.g., Software Engineering)") ciclo_actual: str = Field(description="Current semester or cycle (e.g., 7th Semester, Graduate)") # Tech Talent stack_principal: list = Field(description="List of top 5 languages/technologies they master") proyectos_destacados: list = Field(description="Names of academic projects, thesis, or freelance work mentioned") # Profile Evaluation tipo_perfil: str = Field(description="Classify as: Backend, Frontend, Data, Fullstack, or Management") potencial_contratacion: str = Field(description="Brief justification for why to hire them as an intern")parser = JsonOutputParser(pydantic_object=PerfilEstudiante)
Design a prompt that guides the LLM to extract data focusing on potential:
from langchain_core.prompts import ChatPromptTemplatetemplate_estudiantes = """You are an Expert in Youth Employability and IT Recruitment.Analyze this student's CV and extract structured data.USE THE FOLLOWING JSON FORMAT:{format_instructions}EXTRACTION RULES (Focus on Potential):1. ACADEMIC: - Look for current cycle (e.g., "VI Cycle", "7th", "Graduate"). If not stated, infer from dates. - University: Extract main name (e.g., "UTP", "UPC", "San Marcos").2. PROJECTS (Key for juniors): - Look for sections like "Academic Projects", "Freelance", or "Experience". - Extract concrete project names (e.g., "Library System", "Recycling App"). - DO NOT put generic company names, look for WHAT THEY DID.3. PROFILE TYPE: - Analyze their skills. - If they know Python + Pandas -> "Data". - If they know React + Node -> "Fullstack". - If they know Java + Spring -> "Backend".CV TEXT:{context}"""prompt_extract = ChatPromptTemplate.from_template(template_estudiantes)chain_extract = prompt_extract | llm | parser
print("\nTALENT TABLE (REVERSE MATCH):")df_talent = pd.DataFrame(resultados)cols = ["nombre", "universidad", "ciclo_actual", "tipo_perfil", "stack_principal", "potencial_contratacion"]# Only show columns that existcols_existentes = [c for c in cols if c in df_talent.columns]print(df_talent[cols_existentes])