Document metadata extraction is a common use case for LLMs, allowing you to automatically parse and structure information from various document types including research papers, product announcements, meeting notes, news articles, and technical documentation.
# Define metadata extraction schemaclass DocumentMetadata(BaseModel): """Pydantic model for document metadata extraction.""" title: str = Field( description="The main title or subject of the document" ) document_type: Literal[ "research paper", "product announcement", "meeting notes", "news article", "technical documentation", "other" ] = Field( description="Type of document" ) date: str = Field( description="Any date mentioned in the document (publication date, meeting date, etc.)" ) keywords: List[str] = Field( description="List of key topics, technologies, or important terms mentioned in the document" ) summary: str = Field( description="Brief one-sentence summary of the document's main purpose or content" )
Field descriptions are critical - they guide the LLM on what to extract and how to interpret the content.
documents_data = [ { "id": "doc_001", "text": "Neural Networks for Climate Prediction: A Comprehensive Study. Published March 15, 2024. This research presents a novel deep learning approach for predicting climate patterns using multi-layered neural networks. Our methodology combines satellite imagery data with ground-based sensor readings to achieve 94% accuracy in temperature forecasting. Keywords: machine learning, climate modeling, neural networks, environmental science." }, { "id": "doc_002", "text": "Introducing CloudSync Pro - Next-Generation File Synchronization. Release Date: January 8, 2024. CloudSync Pro revolutionizes how teams collaborate with real-time file synchronization across unlimited devices. Features include end-to-end encryption, automatic conflict resolution, and integration with over 50 productivity tools. Pricing starts at $12/month per user." }, # ... more documents]docs_df = session.create_dataframe(documents_data)
Classification from predefined categories using Literal typeOptions: research paper, product announcement, meeting notes, news article, technical documentation, other
# Good - specific and clearkeywords: List[str] = Field( description="Technical terms, product names, and key concepts mentioned in the text")# Bad - vaguekeywords: List[str] = Field(description="Keywords")
# Use Literal when values come from a known setdocument_type: Literal["email", "report", "memo"] = Field(...)# Use str when values are open-endedtitle: str = Field(...)
Start with simple schemas and iterate. Test on a few documents, refine your field descriptions based on extraction quality, then scale to your full dataset.