Semantic Operators

Overview

Semantic operators are LLM-powered functions that transform unstructured data into structured insights. They enable natural language processing tasks like extraction, classification, summarization, and semantic search at scale. All semantic operators are available through the semantic namespace:

from fenic.api.functions import semantic

Core Operators

extract()

Extract structured information from unstructured text using a Pydantic schema.

column

ColumnOrName

required

Column containing text to extract from

response_format

type[BaseModel]

required

Pydantic model defining the output structure

model_alias

str | ModelAlias

default:"None"

Language model to use (defaults to configured default)

temperature

float

default:"0.0"

Sampling temperature (0.0 = deterministic)

max_output_tokens

int

default:"1024"

Maximum tokens to generate

from pydantic import BaseModel, Field
from fenic.api.functions import semantic, col

class Person(BaseModel):
    name: str = Field(description="Person's full name")
    age: int = Field(description="Person's age in years")
    occupation: str = Field(description="Person's job or profession")

df = df.with_column(
    "structured_data",
    semantic.extract("biography", response_format=Person)
)

The response_format schema supports: primitives (str, int, float, bool), Optional[T], List[T], Literal[...], and nested Pydantic models.

map()

Apply a generation prompt to transform data using Jinja2 templates.

prompt

str

required

Jinja2 template with column placeholders like {{ column_name }}

**columns

Column

required

Named columns corresponding to template variables

strict

bool

default:"True"

If True, None values in any column result in None output

examples

MapExampleCollection

default:"None"

Few-shot examples to guide output format

response_format

type[BaseModel]

default:"None"

Optional Pydantic model for structured output

from fenic.api.functions import semantic, col

df = df.with_column(
    "product_description",
    semantic.map(
        "Write a compelling one-line description for {{ name }}: {{ details }}",
        name=col("product_name"),
        details=col("product_features")
    )
)

classify()

Classify text into predefined categories.

column

ColumnOrName

required

Column containing text to classify

classes

List[str] | List[ClassDefinition]

required

List of class labels or ClassDefinition objects with descriptions

examples

ClassifyExampleCollection

default:"None"

Few-shot examples for classification

from fenic.api.functions import semantic

df = df.with_column(
    "category",
    semantic.classify(
        "message",
        classes=["Account Access", "Billing Issue", "Technical Problem"]
    )
)

predicate()

Evaluate boolean conditions for filtering.

predicate

str

required

Jinja2 template with yes/no question or boolean claim

**columns

Column

required

Named columns corresponding to template variables

strict

bool

default:"True"

If True, None values result in None output

examples

PredicateExampleCollection

default:"None"

Few-shot examples for consistent evaluation

from fenic.api.functions import semantic, col
from textwrap import dedent

# Filter products
wireless = df.filter(
    semantic.predicate(
        dedent('''
            Product: {{ description }}
            Is this product wireless or battery-powered?'''),
        description=col("product_description")
    )
)

# Filter urgent tickets
urgent = df.filter(
    semantic.predicate(
        dedent('''
            Subject: {{ subject }}
            Body: {{ body }}
            This ticket indicates an urgent issue.'''),
        subject=col("ticket_subject"),
        body=col("ticket_body")
    )
)

reduce()

Aggregate multiple text values into a single summary (used with group_by).

prompt

str

required

Instruction for aggregation (supports Jinja2 templates)

column

ColumnOrName

required

Column containing text to aggregate

group_context

Dict[str, Column]

default:"None"

Additional columns for context (from first row of group)

order_by

List[ColumnOrName]

default:"None"

Columns to sort by before aggregation

from fenic.api.functions import semantic, col

# Summarize documents by category
df.group_by("category").agg(
    semantic.reduce(
        "Summarize these documents",
        col("document_text")
    ).alias("summary")
)

Specialized Operators

embed()

Generate vector embeddings for semantic search.

from fenic.api.functions import semantic

df = df.with_column(
    "embeddings",
    semantic.embed(col("text_column"))
)

summarize()

Generate summaries in specific formats.

from fenic.api.functions import semantic
from fenic.core.types import KeyPoints, Paragraph

# Key points format
df = df.with_column(
    "summary",
    semantic.summarize(col("article"), format=KeyPoints(num_points=5))
)

# Paragraph format
df = df.with_column(
    "summary",
    semantic.summarize(col("article"), format=Paragraph(max_words=120))
)

analyze_sentiment()

Analyze sentiment (returns “positive”, “negative”, or “neutral”).

from fenic.api.functions import semantic

df = df.with_column(
    "sentiment",
    semantic.analyze_sentiment(col("review_text"))
)

parse_pdf()

Parse PDF files into markdown.

from fenic.api.functions import semantic

# Parse PDFs
pdf_metadata = session.read.pdf_metadata("docs/**/*.pdf")
df = pdf_metadata.with_column(
    "markdown_content",
    semantic.parse_pdf(
        col("file_path"),
        page_separator="--- PAGE {page} ---",
        describe_images=True
    )
)

DataFrame Semantic Operations

semantic.join()

Join DataFrames using natural language predicates.

from textwrap import dedent
from fenic.api.functions import col

jobs = session.read.csv("jobs.csv")
resumes = session.read.csv("resumes.csv")

matches = jobs.semantic.join(
    resumes,
    predicate=dedent('''
        Job: {{ left_on }}
        Experience: {{ right_on }}
        The candidate is qualified for this job.'''),
    left_on=col("job_description"),
    right_on=col("work_experience")
)

semantic.sim_join()

Join based on embedding similarity.

from fenic.api.functions import semantic, col

queries = session.read.csv("queries.csv")
docs = session.read.csv("documents.csv")

matches = queries.semantic.sim_join(
    docs,
    left_on=semantic.embed(col("query_text")),
    right_on=semantic.embed(col("doc_text")),
    k=3,  # Top 3 matches per query
    similarity_metric="cosine",
    similarity_score_column="similarity"
)

semantic.with_cluster_labels()

Cluster rows using K-means on embeddings.

clustered = df.semantic.with_cluster_labels(
    by=semantic.embed(col("text")),
    num_clusters=5,
    label_column="cluster_id",
    centroid_column="cluster_centroid"
)

Model Selection

Using Default Model

# Uses default model from session config
semantic.map(
    "Summarize: {{ text }}",
    text=col("content")
)

Specifying Model Alias

# Use specific model
semantic.map(
    "Summarize: {{ text }}",
    text=col("content"),
    model_alias="gpt4"  # Must be configured in session
)

Using Model Profiles

from fenic.core.types.semantic import ModelAlias

# Use specific profile
semantic.map(
    "Construct a formal proof of {{ hypothesis }}",
    hypothesis=col("claim"),
    model_alias=ModelAlias(name="o4", profile="thorough")
)

Best Practices

Use structured outputs when possible

Pydantic schemas provide type safety and validation:

# Good: Structured output
semantic.extract("text", response_format=MySchema)

# Less reliable: Free-form text
semantic.map("Extract information from {{ text }}", text=col("text"))

Provide clear field descriptions

LLMs use field descriptions to understand output requirements:

class Person(BaseModel):
    name: str = Field(description="Person's full legal name")
    age: int = Field(description="Age in complete years")
    # Not: name: str  (no description)

Use examples for consistency

Few-shot examples dramatically improve output quality:

examples = MapExampleCollection()
examples.create_example(MapExample(
    input={"text": "..."},
    output="Expected format"
))

Set appropriate temperatures

Use temperature=0.0 for deterministic, factual tasks
Use temperature=0.7+ for creative generation

# Factual extraction
semantic.extract("text", response_format=Schema, temperature=0.0)

# Creative writing
semantic.map("Write a story about {{ topic }}", topic=col("topic"), temperature=0.8)

Batch operations automatically

Fenic automatically batches LLM requests for efficiency. No need to manually batch:

# This is efficient - automatic batching
df.with_column(
    "extracted",
    semantic.extract("text", response_format=Schema)
).collect()

Performance Considerations

Cost: Semantic operations call LLM APIs and incur costs. Use limit() during development.
Latency: LLM calls add latency. Consider async execution for large datasets.
Rate limits: Configure rpm and tpm in session config to match provider limits.
Caching: Results are not cached by default. Use session-level LLM response caching if available.

Error Handling

try:
    result = df.with_column(
        "extracted",
        semantic.extract("text", response_format=Schema)
    ).collect()
except ValidationError as e:
    # Schema validation failed
    print(f"Schema error: {e}")
except Exception as e:
    # LLM API error or other issue
    print(f"Execution error: {e}")

Get Started

Core Concepts

Guides

Examples

Integrations

Semantic Operators

Overview

Core Operators

extract()

map()

classify()

predicate()

reduce()

Specialized Operators

embed()

summarize()

analyze_sentiment()

parse_pdf()

DataFrame Semantic Operations

semantic.join()

semantic.sim_join()

semantic.with_cluster_labels()

Model Selection

Using Default Model

Specifying Model Alias

Using Model Profiles

Best Practices

Performance Considerations

Error Handling

Next Steps

Data Types

Sessions

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Integrations

​Overview

​Core Operators

​extract()

​map()

​classify()

​predicate()

​reduce()

​Specialized Operators

​embed()

​summarize()

​analyze_sentiment()

​parse_pdf()

​DataFrame Semantic Operations

​semantic.join()

​semantic.sim_join()

​semantic.with_cluster_labels()

​Model Selection

​Using Default Model

​Specifying Model Alias

​Using Model Profiles

​Best Practices

​Performance Considerations

​Error Handling

​Next Steps

Data Types

Sessions

Build docs developers (and LLMs) love

Overview

Core Operators

extract()

map()

classify()

predicate()

reduce()

Specialized Operators

embed()

summarize()

analyze_sentiment()

parse_pdf()

DataFrame Semantic Operations

semantic.join()

semantic.sim_join()

semantic.with_cluster_labels()

Model Selection

Using Default Model

Specifying Model Alias

Using Model Profiles

Best Practices

Performance Considerations

Error Handling

Next Steps