Skip to main content

Overview

Fenic provides a rich type system designed for both traditional data processing and AI workloads. Types are used for schema definition, casting, validation, and optimization throughout the DataFrame API.

Type Categories

Fenic types fall into three main categories:
  1. Primitive Types: Basic types like strings, integers, and booleans
  2. Composite Types: Arrays and structs containing other types
  3. Logical Types: Specialized types for AI workloads (embeddings, markdown, JSON, etc.)

Primitive Types

StringType

Represents UTF-8 encoded string values.
from fenic.core.types import StringType

# Used in schema definitions
StructField("name", StringType)

# Cast to string
df.with_column("str_value", col("number").cast(StringType))

IntegerType

Represents signed integer values.
from fenic.core.types import IntegerType

StructField("age", IntegerType)
df.with_column("int_value", col("string_number").cast(IntegerType))

FloatType

Represents 32-bit floating-point numbers.
from fenic.core.types import FloatType

StructField("score", FloatType)
df.with_column("float_value", col("decimal_str").cast(FloatType))

DoubleType

Represents 64-bit floating-point numbers (higher precision than float).
from fenic.core.types import DoubleType

StructField("precise_value", DoubleType)
df.with_column("double_value", col("high_precision").cast(DoubleType))

BooleanType

Represents boolean True/False values.
from fenic.core.types import BooleanType

StructField("is_active", BooleanType)
df.with_column("bool_value", col("flag").cast(BooleanType))

DateType

Represents date values (year, month, day).
from fenic.core.types import DateType

StructField("birth_date", DateType)
df.with_column("date_value", col("date_string").cast(DateType))

TimestampType

Represents timestamp values with date and time.
from fenic.core.types import TimestampType

StructField("created_at", TimestampType)
df.with_column("ts_value", col("timestamp_str").cast(TimestampType))

Composite Types

ArrayType

Represents a homogeneous variable-length array (list) of elements.
element_type
DataType
required
The data type of each element in the array
from fenic.core.types import ArrayType, StringType

# Define schema with array
ArrayType(StringType)
ArrayType(element_type=StringType)

# Example: tags column
StructField("tags", ArrayType(StringType))

Working with Arrays

from fenic.api.functions import col, array

# Create array column
df.with_column("tags", array(["tag1", "tag2", "tag3"]))

# Explode array into rows
df.select("id", col("tags").explode())

# Array length
df.with_column("tag_count", col("tags").list.len())

# Access array element
df.with_column("first_tag", col("tags").list.get(0))

StructType

Represents a struct (record) with named fields.
struct_fields
List[StructField]
required
List of field definitions (name and type pairs)
from fenic.core.types import StructType, StructField, StringType, IntegerType

address_type = StructType([
    StructField("street", StringType),
    StructField("city", StringType),
    StructField("zip_code", IntegerType)
])

StructField("address", address_type)

Working with Structs

from fenic.api.functions import col

# Access struct field
df.select(col("address").struct.field("city"))

# Unnest struct into separate columns
df.unnest("address")  # Creates: address_street, address_city, address_zip_code

# Create struct from columns
from fenic.api.functions import struct
df.with_column(
    "location",
    struct([col("latitude"), col("longitude")])
)

Logical Types

Logical types are specialized string types that preserve semantic meaning for AI operations.

EmbeddingType

Represents a fixed-length embedding vector.
dimensions
int
required
Number of dimensions in the embedding vector
embedding_model
str
required
Name of the model used to generate the embedding
from fenic.core.types import EmbeddingType

# OpenAI text-embedding-3-small (1536 dimensions)
EmbeddingType(1536, embedding_model="text-embedding-3-small")

# Cohere embed-v4 (1024 dimensions)
EmbeddingType(1024, embedding_model="embed-v4.0")

Generating Embeddings

from fenic.api.functions import semantic

# Generate embeddings
df = df.with_column(
    "text_embeddings",
    semantic.embed(col("text_column"))
)

# The column will have EmbeddingType(dimensions, model)
# based on your configured embedding model

MarkdownType

Represents a string containing Markdown-formatted text.
from fenic.core.types import MarkdownType

StructField("document", MarkdownType)

Use Cases

  • Storing formatted documentation
  • Output from PDF parsing
  • Rich text content for LLM processing
from fenic.api.functions import semantic

# Parse PDF to markdown
df = df.with_column(
    "markdown_content",
    semantic.parse_pdf(col("pdf_path"))
)
# Result column has MarkdownType

HtmlType

Represents a string containing raw HTML markup.
from fenic.core.types import HtmlType

StructField("webpage_content", HtmlType)

JsonType

Represents a string containing valid JSON data.
from fenic.core.types import JsonType

StructField("api_response", JsonType)

Working with JSON

from fenic.api.functions import col

# Parse JSON string to struct
df.with_column(
    "parsed",
    col("json_column").str.json_extract()
)

# Convert to JSON string
df.with_column(
    "json_str",
    col("struct_column").to_json()
)

TranscriptType

Represents a string containing a transcript in a specific format.
format
Literal['generic', 'srt', 'webvtt']
required
The transcript format
from fenic.core.types import TranscriptType

# Generic transcript format
TranscriptType(format="generic")

StructField("transcript", TranscriptType(format="generic"))

DocumentPathType

Represents a string containing a document’s local (file system) or remote (URL) path.
from fenic.core.types import DocumentPathType

# PDF path type
DocumentPathType(format="pdf")

StructField("pdf_path", DocumentPathType(format="pdf"))

Type Inspection

Getting DataFrame Schema

# Get schema
schema = df.schema
print(schema)
# Schema([
#     ColumnField('name', StringType),
#     ColumnField('age', IntegerType),
#     ColumnField('tags', ArrayType(StringType))
# ])

# Print formatted schema
df.print_schema()
# root
#  |-- name: StringType
#  |-- age: IntegerType
#  |-- tags: ArrayType(StringType)

# Get column names
columns = df.columns  # ['name', 'age', 'tags']

# Get specific field type
name_field = schema.field("name")
print(name_field.data_type)  # StringType

Type Checking

from fenic.core.types import StringType, IntegerType, ArrayType

# Check if types match
StringType == StringType  # True
StringType == IntegerType  # False

# Check array element type
array_type = ArrayType(StringType)
array_type.element_type == StringType  # True

# Check struct fields
from fenic.core.types import StructType, StructField

struct_type = StructType([
    StructField("name", StringType),
    StructField("age", IntegerType)
])

struct_type.struct_fields[0].name  # "name"
struct_type.struct_fields[0].data_type  # StringType

Type Casting

from fenic.api.functions import col
from fenic.core.types import StringType, IntegerType, DoubleType

# Cast to string
df.with_column("age_str", col("age").cast(StringType))

# Cast to integer
df.with_column("age_int", col("age_str").cast(IntegerType))

# Cast to double
df.with_column("score_double", col("score").cast(DoubleType))

Schema Definition

Define explicit schemas for reading data:
from fenic.core.types import (
    Schema,
    ColumnField,
    StringType,
    IntegerType,
    ArrayType,
    StructType,
    StructField
)

schema = Schema([
    ColumnField("id", IntegerType),
    ColumnField("name", StringType),
    ColumnField("tags", ArrayType(StringType)),
    ColumnField("metadata", StructType([
        StructField("created_at", StringType),
        StructField("version", IntegerType)
    ]))
])

df = session.read.csv("data.csv", schema=schema)

Type Inference

Fenic automatically infers types when reading data:
# From CSV - infers types from data
df = session.read.csv("data.csv")

# From Parquet - uses Parquet schema
df = session.read.parquet("data.parquet")

# From dictionary - infers from Python types
df = session.create_dataframe({
    "name": ["Alice", "Bob"],      # -> StringType
    "age": [25, 30],                # -> IntegerType
    "score": [95.5, 87.3],          # -> DoubleType
    "active": [True, False]         # -> BooleanType
})

Best Practices

Logical types preserve semantic information for LLM operations:
# Good: Preserves type information
StructField("document", MarkdownType)
StructField("embeddings", EmbeddingType(1536, "text-embedding-3-small"))

# Avoid: Loses semantic meaning
StructField("document", StringType)
Choose appropriate numeric types based on precision needs:
# For precise calculations
StructField("price", DoubleType)  # 64-bit precision

# For memory efficiency
StructField("count", IntegerType)
StructField("ratio", FloatType)  # 32-bit sufficient
Explicit schemas prevent inference errors:
# Good: Explicit schema
schema = Schema([
    ColumnField("id", IntegerType),
    ColumnField("metadata", StructType([
        StructField("tags", ArrayType(StringType))
    ]))
])
df = session.read.json("data.json", schema=schema)
Match embedding dimensions to your model:
# text-embedding-3-small: 1536 dimensions
EmbeddingType(1536, "text-embedding-3-small")

# text-embedding-3-large: 3072 dimensions
EmbeddingType(3072, "text-embedding-3-large")

# Cohere embed-v4: 1024 dimensions
EmbeddingType(1024, "embed-v4.0")

Type Compatibility

Numeric Type Hierarchy

IntegerType
    ↓ (can cast to)
FloatType
    ↓ (can cast to)
DoubleType

String to Other Types

# String can be cast to most types
StringType → IntegerType    # "123" → 123
StringType → DoubleType     # "123.45" → 123.45
StringType → BooleanType    # "true" → True
StringType → DateType       # "2024-01-01" → Date(2024, 1, 1)

Logical Types

Logical types are specialized strings - they can be used anywhere strings are accepted, but preserve additional semantic meaning:
# These are all string-based
MarkdownType   # String with markdown formatting
JsonType       # String with JSON content
HtmlType       # String with HTML content
TranscriptType # String with transcript content

Next Steps

DataFrames

Work with typed data in DataFrames

Semantic Operators

Use logical types with LLM operations

Build docs developers (and LLMs) love