Skip to main content

Overview

Semantic operators are LLM-powered functions that transform unstructured data into structured insights. They enable natural language processing tasks like extraction, classification, summarization, and semantic search at scale. All semantic operators are available through the semantic namespace:
from fenic.api.functions import semantic

Core Operators

extract()

Extract structured information from unstructured text using a Pydantic schema.
column
ColumnOrName
required
Column containing text to extract from
response_format
type[BaseModel]
required
Pydantic model defining the output structure
model_alias
str | ModelAlias
default:"None"
Language model to use (defaults to configured default)
temperature
float
default:"0.0"
Sampling temperature (0.0 = deterministic)
max_output_tokens
int
default:"1024"
Maximum tokens to generate
from pydantic import BaseModel, Field
from fenic.api.functions import semantic, col

class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age in years")
    occupation: str = Field(description="Person's job or profession")

df = df.with_column(
    "structured_data",
    semantic.extract("biography", response_format=Person)
)
The response_format schema supports: primitives (str, int, float, bool), Optional[T], List[T], Literal[...], and nested Pydantic models.

map()

Apply a generation prompt to transform data using Jinja2 templates.
prompt
str
required
Jinja2 template with column placeholders like {{ column_name }}
**columns
Column
required
Named columns corresponding to template variables
strict
bool
default:"True"
If True, None values in any column result in None output
examples
MapExampleCollection
default:"None"
Few-shot examples to guide output format
response_format
type[BaseModel]
default:"None"
Optional Pydantic model for structured output
from fenic.api.functions import semantic, col

df = df.with_column(
    "product_description",
    semantic.map(
        "Write a compelling one-line description for {{ name }}: {{ details }}",
        name=col("product_name"),
        details=col("product_features")
    )
)

classify()

Classify text into predefined categories.
column
ColumnOrName
required
Column containing text to classify
classes
List[str] | List[ClassDefinition]
required
List of class labels or ClassDefinition objects with descriptions
examples
ClassifyExampleCollection
default:"None"
Few-shot examples for classification
from fenic.api.functions import semantic

df = df.with_column(
    "category",
    semantic.classify(
        "message",
        classes=["Account Access", "Billing Issue", "Technical Problem"]
    )
)

predicate()

Evaluate boolean conditions for filtering.
predicate
str
required
Jinja2 template with yes/no question or boolean claim
**columns
Column
required
Named columns corresponding to template variables
strict
bool
default:"True"
If True, None values result in None output
examples
PredicateExampleCollection
default:"None"
Few-shot examples for consistent evaluation
from fenic.api.functions import semantic, col
from textwrap import dedent

# Filter products
wireless = df.filter(
    semantic.predicate(
        dedent('''
            Product: {{ description }}
            Is this product wireless or battery-powered?'''),
        description=col("product_description")
    )
)

# Filter urgent tickets
urgent = df.filter(
    semantic.predicate(
        dedent('''
            Subject: {{ subject }}
            Body: {{ body }}
            This ticket indicates an urgent issue.'''),
        subject=col("ticket_subject"),
        body=col("ticket_body")
    )
)

reduce()

Aggregate multiple text values into a single summary (used with group_by).
prompt
str
required
Instruction for aggregation (supports Jinja2 templates)
column
ColumnOrName
required
Column containing text to aggregate
group_context
Dict[str, Column]
default:"None"
Additional columns for context (from first row of group)
order_by
List[ColumnOrName]
default:"None"
Columns to sort by before aggregation
from fenic.api.functions import semantic, col

# Summarize documents by category
df.group_by("category").agg(
    semantic.reduce(
        "Summarize these documents",
        col("document_text")
    ).alias("summary")
)

Specialized Operators

embed()

Generate vector embeddings for semantic search.
from fenic.api.functions import semantic

df = df.with_column(
    "embeddings",
    semantic.embed(col("text_column"))
)

summarize()

Generate summaries in specific formats.
from fenic.api.functions import semantic
from fenic.core.types import KeyPoints, Paragraph

# Key points format
df = df.with_column(
    "summary",
    semantic.summarize(col("article"), format=KeyPoints(num_points=5))
)

# Paragraph format
df = df.with_column(
    "summary",
    semantic.summarize(col("article"), format=Paragraph(max_words=120))
)

analyze_sentiment()

Analyze sentiment (returns “positive”, “negative”, or “neutral”).
from fenic.api.functions import semantic

df = df.with_column(
    "sentiment",
    semantic.analyze_sentiment(col("review_text"))
)

parse_pdf()

Parse PDF files into markdown.
from fenic.api.functions import semantic

# Parse PDFs
pdf_metadata = session.read.pdf_metadata("docs/**/*.pdf")
df = pdf_metadata.with_column(
    "markdown_content",
    semantic.parse_pdf(
        col("file_path"),
        page_separator="--- PAGE {page} ---",
        describe_images=True
    )
)

DataFrame Semantic Operations

semantic.join()

Join DataFrames using natural language predicates.
from textwrap import dedent
from fenic.api.functions import col

jobs = session.read.csv("jobs.csv")
resumes = session.read.csv("resumes.csv")

matches = jobs.semantic.join(
    resumes,
    predicate=dedent('''
        Job: {{ left_on }}
        Experience: {{ right_on }}
        The candidate is qualified for this job.'''),
    left_on=col("job_description"),
    right_on=col("work_experience")
)

semantic.sim_join()

Join based on embedding similarity.
from fenic.api.functions import semantic, col

queries = session.read.csv("queries.csv")
docs = session.read.csv("documents.csv")

matches = queries.semantic.sim_join(
    docs,
    left_on=semantic.embed(col("query_text")),
    right_on=semantic.embed(col("doc_text")),
    k=3,  # Top 3 matches per query
    similarity_metric="cosine",
    similarity_score_column="similarity"
)

semantic.with_cluster_labels()

Cluster rows using K-means on embeddings.
clustered = df.semantic.with_cluster_labels(
    by=semantic.embed(col("text")),
    num_clusters=5,
    label_column="cluster_id",
    centroid_column="cluster_centroid"
)

Model Selection

Using Default Model

# Uses default model from session config
semantic.map(
    "Summarize: {{ text }}",
    text=col("content")
)

Specifying Model Alias

# Use specific model
semantic.map(
    "Summarize: {{ text }}",
    text=col("content"),
    model_alias="gpt4"  # Must be configured in session
)

Using Model Profiles

from fenic.core.types.semantic import ModelAlias

# Use specific profile
semantic.map(
    "Construct a formal proof of {{ hypothesis }}",
    hypothesis=col("claim"),
    model_alias=ModelAlias(name="o4", profile="thorough")
)

Best Practices

Pydantic schemas provide type safety and validation:
# Good: Structured output
semantic.extract("text", response_format=MySchema)

# Less reliable: Free-form text
semantic.map("Extract information from {{ text }}", text=col("text"))
LLMs use field descriptions to understand output requirements:
class Person(BaseModel):
    name: str = Field(description="Person's full legal name")
    age: int = Field(description="Age in complete years")
    # Not: name: str  (no description)
Few-shot examples dramatically improve output quality:
examples = MapExampleCollection()
examples.create_example(MapExample(
    input={"text": "..."},
    output="Expected format"
))
  • Use temperature=0.0 for deterministic, factual tasks
  • Use temperature=0.7+ for creative generation
# Factual extraction
semantic.extract("text", response_format=Schema, temperature=0.0)

# Creative writing
semantic.map("Write a story about {{ topic }}", topic=col("topic"), temperature=0.8)
Fenic automatically batches LLM requests for efficiency. No need to manually batch:
# This is efficient - automatic batching
df.with_column(
    "extracted",
    semantic.extract("text", response_format=Schema)
).collect()

Performance Considerations

  1. Cost: Semantic operations call LLM APIs and incur costs. Use limit() during development.
  2. Latency: LLM calls add latency. Consider async execution for large datasets.
  3. Rate limits: Configure rpm and tpm in session config to match provider limits.
  4. Caching: Results are not cached by default. Use session-level LLM response caching if available.

Error Handling

try:
    result = df.with_column(
        "extracted",
        semantic.extract("text", response_format=Schema)
    ).collect()
except ValidationError as e:
    # Schema validation failed
    print(f"Schema error: {e}")
except Exception as e:
    # LLM API error or other issue
    print(f"Execution error: {e}")

Next Steps

Data Types

Learn about Fenic’s type system

Sessions

Configure models and execution

Build docs developers (and LLMs) love