Overview
Semantic operators are LLM-powered functions that transform unstructured data into structured insights. They enable natural language processing tasks like extraction, classification, summarization, and semantic search at scale.
All semantic operators are available through the semantic namespace:
from fenic.api.functions import semantic
Core Operators
Extract structured information from unstructured text using a Pydantic schema.
Column containing text to extract from
Pydantic model defining the output structure
model_alias
str | ModelAlias
default: "None"
Language model to use (defaults to configured default)
Sampling temperature (0.0 = deterministic)
Maximum tokens to generate
Basic Extraction
Nested Extraction
from pydantic import BaseModel, Field
from fenic.api.functions import semantic, col
class Person ( BaseModel ):
name: str = Field( description = "Person's full name" )
age: int = Field( description = "Person's age in years" )
occupation: str = Field( description = "Person's job or profession" )
df = df.with_column(
"structured_data" ,
semantic.extract( "biography" , response_format = Person)
)
The response_format schema supports: primitives (str, int, float, bool), Optional[T], List[T], Literal[...], and nested Pydantic models.
map()
Apply a generation prompt to transform data using Jinja2 templates.
Jinja2 template with column placeholders like {{ column_name }}
Named columns corresponding to template variables
If True, None values in any column result in None output
examples
MapExampleCollection
default: "None"
Few-shot examples to guide output format
response_format
type[BaseModel]
default: "None"
Optional Pydantic model for structured output
Simple Mapping
With Examples
Structured Output
from fenic.api.functions import semantic, col
df = df.with_column(
"product_description" ,
semantic.map(
"Write a compelling one-line description for {{ name }} : {{ details }} " ,
name = col( "product_name" ),
details = col( "product_features" )
)
)
classify()
Classify text into predefined categories.
Column containing text to classify
classes
List[str] | List[ClassDefinition]
required
List of class labels or ClassDefinition objects with descriptions
examples
ClassifyExampleCollection
default: "None"
Few-shot examples for classification
Simple Classification
With Descriptions
With Examples
from fenic.api.functions import semantic
df = df.with_column(
"category" ,
semantic.classify(
"message" ,
classes = [ "Account Access" , "Billing Issue" , "Technical Problem" ]
)
)
predicate()
Evaluate boolean conditions for filtering.
Jinja2 template with yes/no question or boolean claim
Named columns corresponding to template variables
If True, None values result in None output
examples
PredicateExampleCollection
default: "None"
Few-shot examples for consistent evaluation
from fenic.api.functions import semantic, col
from textwrap import dedent
# Filter products
wireless = df.filter(
semantic.predicate(
dedent( '''
Product: {{ description }}
Is this product wireless or battery-powered?''' ),
description = col( "product_description" )
)
)
# Filter urgent tickets
urgent = df.filter(
semantic.predicate(
dedent( '''
Subject: {{ subject }}
Body: {{ body }}
This ticket indicates an urgent issue.''' ),
subject = col( "ticket_subject" ),
body = col( "ticket_body" )
)
)
reduce()
Aggregate multiple text values into a single summary (used with group_by).
Instruction for aggregation (supports Jinja2 templates)
Column containing text to aggregate
group_context
Dict[str, Column]
default: "None"
Additional columns for context (from first row of group)
order_by
List[ColumnOrName]
default: "None"
Columns to sort by before aggregation
Simple Reduce
With Context
With Ordering
from fenic.api.functions import semantic, col
# Summarize documents by category
df.group_by( "category" ).agg(
semantic.reduce(
"Summarize these documents" ,
col( "document_text" )
).alias( "summary" )
)
Specialized Operators
embed()
Generate vector embeddings for semantic search.
from fenic.api.functions import semantic
df = df.with_column(
"embeddings" ,
semantic.embed(col( "text_column" ))
)
summarize()
Generate summaries in specific formats.
from fenic.api.functions import semantic
from fenic.core.types import KeyPoints, Paragraph
# Key points format
df = df.with_column(
"summary" ,
semantic.summarize(col( "article" ), format = KeyPoints( num_points = 5 ))
)
# Paragraph format
df = df.with_column(
"summary" ,
semantic.summarize(col( "article" ), format = Paragraph( max_words = 120 ))
)
analyze_sentiment()
Analyze sentiment (returns “positive”, “negative”, or “neutral”).
from fenic.api.functions import semantic
df = df.with_column(
"sentiment" ,
semantic.analyze_sentiment(col( "review_text" ))
)
parse_pdf()
Parse PDF files into markdown.
from fenic.api.functions import semantic
# Parse PDFs
pdf_metadata = session.read.pdf_metadata( "docs/**/*.pdf" )
df = pdf_metadata.with_column(
"markdown_content" ,
semantic.parse_pdf(
col( "file_path" ),
page_separator = "--- PAGE {page} ---" ,
describe_images = True
)
)
DataFrame Semantic Operations
semantic.join()
Join DataFrames using natural language predicates.
from textwrap import dedent
from fenic.api.functions import col
jobs = session.read.csv( "jobs.csv" )
resumes = session.read.csv( "resumes.csv" )
matches = jobs.semantic.join(
resumes,
predicate = dedent( '''
Job: {{ left_on }}
Experience: {{ right_on }}
The candidate is qualified for this job.''' ),
left_on = col( "job_description" ),
right_on = col( "work_experience" )
)
semantic.sim_join()
Join based on embedding similarity.
from fenic.api.functions import semantic, col
queries = session.read.csv( "queries.csv" )
docs = session.read.csv( "documents.csv" )
matches = queries.semantic.sim_join(
docs,
left_on = semantic.embed(col( "query_text" )),
right_on = semantic.embed(col( "doc_text" )),
k = 3 , # Top 3 matches per query
similarity_metric = "cosine" ,
similarity_score_column = "similarity"
)
semantic.with_cluster_labels()
Cluster rows using K-means on embeddings.
clustered = df.semantic.with_cluster_labels(
by = semantic.embed(col( "text" )),
num_clusters = 5 ,
label_column = "cluster_id" ,
centroid_column = "cluster_centroid"
)
Model Selection
Using Default Model
# Uses default model from session config
semantic.map(
"Summarize: {{ text }} " ,
text = col( "content" )
)
Specifying Model Alias
# Use specific model
semantic.map(
"Summarize: {{ text }} " ,
text = col( "content" ),
model_alias = "gpt4" # Must be configured in session
)
Using Model Profiles
from fenic.core.types.semantic import ModelAlias
# Use specific profile
semantic.map(
"Construct a formal proof of {{ hypothesis }} " ,
hypothesis = col( "claim" ),
model_alias = ModelAlias( name = "o4" , profile = "thorough" )
)
Best Practices
Use structured outputs when possible
Pydantic schemas provide type safety and validation: # Good: Structured output
semantic.extract( "text" , response_format = MySchema)
# Less reliable: Free-form text
semantic.map( "Extract information from {{ text }} " , text = col( "text" ))
Provide clear field descriptions
LLMs use field descriptions to understand output requirements: class Person ( BaseModel ):
name: str = Field( description = "Person's full legal name" )
age: int = Field( description = "Age in complete years" )
# Not: name: str (no description)
Use examples for consistency
Few-shot examples dramatically improve output quality: examples = MapExampleCollection()
examples.create_example(MapExample(
input = { "text" : "..." },
output = "Expected format"
))
Set appropriate temperatures
Use temperature=0.0 for deterministic, factual tasks
Use temperature=0.7+ for creative generation
# Factual extraction
semantic.extract( "text" , response_format = Schema, temperature = 0.0 )
# Creative writing
semantic.map( "Write a story about {{ topic }} " , topic = col( "topic" ), temperature = 0.8 )
Batch operations automatically
Fenic automatically batches LLM requests for efficiency. No need to manually batch: # This is efficient - automatic batching
df.with_column(
"extracted" ,
semantic.extract( "text" , response_format = Schema)
).collect()
Cost : Semantic operations call LLM APIs and incur costs. Use limit() during development.
Latency : LLM calls add latency. Consider async execution for large datasets.
Rate limits : Configure rpm and tpm in session config to match provider limits.
Caching : Results are not cached by default. Use session-level LLM response caching if available.
Error Handling
try :
result = df.with_column(
"extracted" ,
semantic.extract( "text" , response_format = Schema)
).collect()
except ValidationError as e:
# Schema validation failed
print ( f "Schema error: { e } " )
except Exception as e:
# LLM API error or other issue
print ( f "Execution error: { e } " )
Next Steps
Data Types Learn about Fenic’s type system
Sessions Configure models and execution